[jira] [Assigned] (SPARK-8698) partitionBy in Python DataFrame reader/writer interface should not default to empty tuple
[ https://issues.apache.org/jira/browse/SPARK-8698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8698: --- Assignee: Apache Spark (was: Reynold Xin) > partitionBy in Python DataFrame reader/writer interface should not default to > empty tuple > - > > Key: SPARK-8698 > URL: https://issues.apache.org/jira/browse/SPARK-8698 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8698) partitionBy in Python DataFrame reader/writer interface should not default to empty tuple
[ https://issues.apache.org/jira/browse/SPARK-8698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8698: --- Assignee: Reynold Xin (was: Apache Spark) > partitionBy in Python DataFrame reader/writer interface should not default to > empty tuple > - > > Key: SPARK-8698 > URL: https://issues.apache.org/jira/browse/SPARK-8698 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8698) partitionBy in Python DataFrame reader/writer interface should not default to empty tuple
[ https://issues.apache.org/jira/browse/SPARK-8698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14605203#comment-14605203 ] Apache Spark commented on SPARK-8698: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/7079 > partitionBy in Python DataFrame reader/writer interface should not default to > empty tuple > - > > Key: SPARK-8698 > URL: https://issues.apache.org/jira/browse/SPARK-8698 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8698) partitionBy in Python DataFrame reader/writer interface should not default to empty tuple
Reynold Xin created SPARK-8698: -- Summary: partitionBy in Python DataFrame reader/writer interface should not default to empty tuple Key: SPARK-8698 URL: https://issues.apache.org/jira/browse/SPARK-8698 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5571) LDA should handle text as well
[ https://issues.apache.org/jira/browse/SPARK-5571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14605200#comment-14605200 ] Alok Singh commented on SPARK-5571: --- Just wanted to get more clarification on this. Does this jira , expect all the components i.e tokenizer -> stemmer -> stopword->runWithPrunedBagOfWords? or is it that we assume that input is already tokenized, stemmed and stopword removed? > LDA should handle text as well > -- > > Key: SPARK-5571 > URL: https://issues.apache.org/jira/browse/SPARK-5571 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley > > Latent Dirichlet Allocation (LDA) currently operates only on vectors of word > counts. It should also supporting training and prediction using text > (Strings). > This plan is sketched in the [original LDA design > doc|https://docs.google.com/document/d/1kSsDqTeZMEB94Bs4GTd0mvdAmduvZSSkpoSfn-seAzo/edit?usp=sharing]. > There should be: > * runWithText() method which takes an RDD with a collection of Strings (bags > of words). This will also index terms and compute a dictionary. > * dictionary parameter for when LDA is run with word count vectors > * prediction/feedback methods returning Strings (such as > describeTopicsAsStrings, which is commented out in LDA currently) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8355) Python DataFrameReader/Writer should mirror scala
[ https://issues.apache.org/jira/browse/SPARK-8355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8355: --- Assignee: (was: Apache Spark) > Python DataFrameReader/Writer should mirror scala > - > > Key: SPARK-8355 > URL: https://issues.apache.org/jira/browse/SPARK-8355 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Michael Armbrust >Priority: Critical > > All the functions that I can run in scala should also work in python. At > least {{ctx.read.option}} is missing, but we should also audit to make sure > there aren't others. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8355) Python DataFrameReader/Writer should mirror scala
[ https://issues.apache.org/jira/browse/SPARK-8355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14605185#comment-14605185 ] Apache Spark commented on SPARK-8355: - User 'piaozhexiu' has created a pull request for this issue: https://github.com/apache/spark/pull/7078 > Python DataFrameReader/Writer should mirror scala > - > > Key: SPARK-8355 > URL: https://issues.apache.org/jira/browse/SPARK-8355 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Michael Armbrust >Priority: Critical > > All the functions that I can run in scala should also work in python. At > least {{ctx.read.option}} is missing, but we should also audit to make sure > there aren't others. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8355) Python DataFrameReader/Writer should mirror scala
[ https://issues.apache.org/jira/browse/SPARK-8355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8355: --- Assignee: Apache Spark > Python DataFrameReader/Writer should mirror scala > - > > Key: SPARK-8355 > URL: https://issues.apache.org/jira/browse/SPARK-8355 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Michael Armbrust >Assignee: Apache Spark >Priority: Critical > > All the functions that I can run in scala should also work in python. At > least {{ctx.read.option}} is missing, but we should also audit to make sure > there aren't others. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5562) LDA should handle empty documents
[ https://issues.apache.org/jira/browse/SPARK-5562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-5562: - Shepherd: Joseph K. Bradley > LDA should handle empty documents > - > > Key: SPARK-5562 > URL: https://issues.apache.org/jira/browse/SPARK-5562 > Project: Spark > Issue Type: Test > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Priority: Minor > Labels: starter > Original Estimate: 96h > Remaining Estimate: 96h > > Latent Dirichlet Allocation (LDA) could easily be given empty documents when > people select a small vocabulary. We should check to make sure it is robust > to empty documents. > This will hopefully take the form of a unit test, but may require modifying > the LDA implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8592) CoarseGrainedExecutorBackend: Cannot register with driver => NPE
[ https://issues.apache.org/jira/browse/SPARK-8592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14605177#comment-14605177 ] Apache Spark commented on SPARK-8592: - User 'xuchenCN' has created a pull request for this issue: https://github.com/apache/spark/pull/7077 > CoarseGrainedExecutorBackend: Cannot register with driver => NPE > > > Key: SPARK-8592 > URL: https://issues.apache.org/jira/browse/SPARK-8592 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Affects Versions: 1.4.0 > Environment: Ubuntu 14.04, Scala 2.11, Java 8, >Reporter: Sjoerd Mulder >Priority: Minor > > I cannot reproduce this consistently but when submitting jobs just after > another finished it will not come up: > {code} > 15/06/24 14:57:24 INFO WorkerWatcher: Connecting to worker > akka.tcp://sparkWorker@10.0.7.171:39135/user/Worker > 15/06/24 14:57:24 INFO WorkerWatcher: Successfully connected to > akka.tcp://sparkWorker@10.0.7.171:39135/user/Worker > 15/06/24 14:57:24 ERROR CoarseGrainedExecutorBackend: Cannot register with > driver: akka.tcp://sparkDriver@172.17.0.109:47462/user/CoarseGrainedScheduler > java.lang.NullPointerException > at > org.apache.spark.rpc.akka.AkkaRpcEndpointRef.actorRef$lzycompute(AkkaRpcEnv.scala:273) > at > org.apache.spark.rpc.akka.AkkaRpcEndpointRef.actorRef(AkkaRpcEnv.scala:273) > at > org.apache.spark.rpc.akka.AkkaRpcEndpointRef.toString(AkkaRpcEnv.scala:313) > at java.lang.String.valueOf(String.java:2982) > at > scala.collection.mutable.StringBuilder.append(StringBuilder.scala:200) > at > org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$3.apply(CoarseGrainedSchedulerBackend.scala:125) > at > org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$3.apply(CoarseGrainedSchedulerBackend.scala:125) > at org.apache.spark.Logging$class.logInfo(Logging.scala:59) > at > org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.logInfo(CoarseGrainedSchedulerBackend.scala:69) > at > org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receiveAndReply$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:125) > at > org.apache.spark.rpc.akka.AkkaRpcEnv.org$apache$spark$rpc$akka$AkkaRpcEnv$$processMessage(AkkaRpcEnv.scala:178) > at > org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1$$anonfun$receiveWithLogging$1$$anonfun$applyOrElse$4.apply$mcV$sp(AkkaRpcEnv.scala:127) > at > org.apache.spark.rpc.akka.AkkaRpcEnv.org$apache$spark$rpc$akka$AkkaRpcEnv$$safelyCall(AkkaRpcEnv.scala:198) > at > org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1$$anonfun$receiveWithLogging$1.applyOrElse(AkkaRpcEnv.scala:126) > at > scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36) > at > org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:59) > at > org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42) > at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) > at > org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42) > at akka.actor.Actor$class.aroundReceive(Actor.scala:465) > at > org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1.aroundReceive(AkkaRpcEnv.scala:93) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) > at akka.actor.ActorCell.invoke(ActorCell.scala:487) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) > at akka.dispatch.Mailbox.run(Mailbox.scala:220) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8592) CoarseGrainedExecutorBackend: Cannot register with driver => NPE
[ https://issues.apache.org/jira/browse/SPARK-8592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8592: --- Assignee: (was: Apache Spark) > CoarseGrainedExecutorBackend: Cannot register with driver => NPE > > > Key: SPARK-8592 > URL: https://issues.apache.org/jira/browse/SPARK-8592 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Affects Versions: 1.4.0 > Environment: Ubuntu 14.04, Scala 2.11, Java 8, >Reporter: Sjoerd Mulder >Priority: Minor > > I cannot reproduce this consistently but when submitting jobs just after > another finished it will not come up: > {code} > 15/06/24 14:57:24 INFO WorkerWatcher: Connecting to worker > akka.tcp://sparkWorker@10.0.7.171:39135/user/Worker > 15/06/24 14:57:24 INFO WorkerWatcher: Successfully connected to > akka.tcp://sparkWorker@10.0.7.171:39135/user/Worker > 15/06/24 14:57:24 ERROR CoarseGrainedExecutorBackend: Cannot register with > driver: akka.tcp://sparkDriver@172.17.0.109:47462/user/CoarseGrainedScheduler > java.lang.NullPointerException > at > org.apache.spark.rpc.akka.AkkaRpcEndpointRef.actorRef$lzycompute(AkkaRpcEnv.scala:273) > at > org.apache.spark.rpc.akka.AkkaRpcEndpointRef.actorRef(AkkaRpcEnv.scala:273) > at > org.apache.spark.rpc.akka.AkkaRpcEndpointRef.toString(AkkaRpcEnv.scala:313) > at java.lang.String.valueOf(String.java:2982) > at > scala.collection.mutable.StringBuilder.append(StringBuilder.scala:200) > at > org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$3.apply(CoarseGrainedSchedulerBackend.scala:125) > at > org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$3.apply(CoarseGrainedSchedulerBackend.scala:125) > at org.apache.spark.Logging$class.logInfo(Logging.scala:59) > at > org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.logInfo(CoarseGrainedSchedulerBackend.scala:69) > at > org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receiveAndReply$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:125) > at > org.apache.spark.rpc.akka.AkkaRpcEnv.org$apache$spark$rpc$akka$AkkaRpcEnv$$processMessage(AkkaRpcEnv.scala:178) > at > org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1$$anonfun$receiveWithLogging$1$$anonfun$applyOrElse$4.apply$mcV$sp(AkkaRpcEnv.scala:127) > at > org.apache.spark.rpc.akka.AkkaRpcEnv.org$apache$spark$rpc$akka$AkkaRpcEnv$$safelyCall(AkkaRpcEnv.scala:198) > at > org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1$$anonfun$receiveWithLogging$1.applyOrElse(AkkaRpcEnv.scala:126) > at > scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36) > at > org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:59) > at > org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42) > at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) > at > org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42) > at akka.actor.Actor$class.aroundReceive(Actor.scala:465) > at > org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1.aroundReceive(AkkaRpcEnv.scala:93) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) > at akka.actor.ActorCell.invoke(ActorCell.scala:487) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) > at akka.dispatch.Mailbox.run(Mailbox.scala:220) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8592) CoarseGrainedExecutorBackend: Cannot register with driver => NPE
[ https://issues.apache.org/jira/browse/SPARK-8592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8592: --- Assignee: Apache Spark > CoarseGrainedExecutorBackend: Cannot register with driver => NPE > > > Key: SPARK-8592 > URL: https://issues.apache.org/jira/browse/SPARK-8592 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Affects Versions: 1.4.0 > Environment: Ubuntu 14.04, Scala 2.11, Java 8, >Reporter: Sjoerd Mulder >Assignee: Apache Spark >Priority: Minor > > I cannot reproduce this consistently but when submitting jobs just after > another finished it will not come up: > {code} > 15/06/24 14:57:24 INFO WorkerWatcher: Connecting to worker > akka.tcp://sparkWorker@10.0.7.171:39135/user/Worker > 15/06/24 14:57:24 INFO WorkerWatcher: Successfully connected to > akka.tcp://sparkWorker@10.0.7.171:39135/user/Worker > 15/06/24 14:57:24 ERROR CoarseGrainedExecutorBackend: Cannot register with > driver: akka.tcp://sparkDriver@172.17.0.109:47462/user/CoarseGrainedScheduler > java.lang.NullPointerException > at > org.apache.spark.rpc.akka.AkkaRpcEndpointRef.actorRef$lzycompute(AkkaRpcEnv.scala:273) > at > org.apache.spark.rpc.akka.AkkaRpcEndpointRef.actorRef(AkkaRpcEnv.scala:273) > at > org.apache.spark.rpc.akka.AkkaRpcEndpointRef.toString(AkkaRpcEnv.scala:313) > at java.lang.String.valueOf(String.java:2982) > at > scala.collection.mutable.StringBuilder.append(StringBuilder.scala:200) > at > org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$3.apply(CoarseGrainedSchedulerBackend.scala:125) > at > org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$3.apply(CoarseGrainedSchedulerBackend.scala:125) > at org.apache.spark.Logging$class.logInfo(Logging.scala:59) > at > org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.logInfo(CoarseGrainedSchedulerBackend.scala:69) > at > org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receiveAndReply$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:125) > at > org.apache.spark.rpc.akka.AkkaRpcEnv.org$apache$spark$rpc$akka$AkkaRpcEnv$$processMessage(AkkaRpcEnv.scala:178) > at > org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1$$anonfun$receiveWithLogging$1$$anonfun$applyOrElse$4.apply$mcV$sp(AkkaRpcEnv.scala:127) > at > org.apache.spark.rpc.akka.AkkaRpcEnv.org$apache$spark$rpc$akka$AkkaRpcEnv$$safelyCall(AkkaRpcEnv.scala:198) > at > org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1$$anonfun$receiveWithLogging$1.applyOrElse(AkkaRpcEnv.scala:126) > at > scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36) > at > org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:59) > at > org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42) > at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) > at > org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42) > at akka.actor.Actor$class.aroundReceive(Actor.scala:465) > at > org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1.aroundReceive(AkkaRpcEnv.scala:93) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) > at akka.actor.ActorCell.invoke(ActorCell.scala:487) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) > at akka.dispatch.Mailbox.run(Mailbox.scala:220) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8697) MatchIterator not serializable exception in RegexTokeinzer
Xiangrui Meng created SPARK-8697: Summary: MatchIterator not serializable exception in RegexTokeinzer Key: SPARK-8697 URL: https://issues.apache.org/jira/browse/SPARK-8697 Project: Spark Issue Type: Bug Components: ML Affects Versions: 1.4.0 Reporter: Xiangrui Meng Priority: Minor I'm not sure whether this is a real bug or not. In REPL, I saw MatchIterator not serializable exception in RegexTokeinzer during some ad-hoc testing. However, I couldn't reproduce this issue. Maybe it is a REPL bug. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7739) Improve ChiSqSelector example code in the user guide
[ https://issues.apache.org/jira/browse/SPARK-7739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-7739: - Target Version/s: 1.5.0 > Improve ChiSqSelector example code in the user guide > > > Key: SPARK-7739 > URL: https://issues.apache.org/jira/browse/SPARK-7739 > Project: Spark > Issue Type: Improvement > Components: Documentation, MLlib >Affects Versions: 1.3.1, 1.4.0 >Reporter: Xiangrui Meng >Assignee: Seth Hendrickson >Priority: Minor > Labels: starter > > As discussed in > http://apache-spark-user-list.1001560.n3.nabble.com/Discretization-td22811.html > We should mention the values are gray levels (0-255) and change the division > to integer division. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8696) StreamingLDA
yuhao yang created SPARK-8696: - Summary: StreamingLDA Key: SPARK-8696 URL: https://issues.apache.org/jira/browse/SPARK-8696 Project: Spark Issue Type: New Feature Components: MLlib Reporter: yuhao yang Streaming LDA can be a natural extension from online LDA. Yet for now we need to settle down the implementation for LDA prediction, to support the predictOn method in the streaming version. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7739) Improve ChiSqSelector example code in the user guide
[ https://issues.apache.org/jira/browse/SPARK-7739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-7739: - Assignee: Seth Hendrickson > Improve ChiSqSelector example code in the user guide > > > Key: SPARK-7739 > URL: https://issues.apache.org/jira/browse/SPARK-7739 > Project: Spark > Issue Type: Improvement > Components: Documentation, MLlib >Affects Versions: 1.3.1, 1.4.0 >Reporter: Xiangrui Meng >Assignee: Seth Hendrickson >Priority: Minor > Labels: starter > > As discussed in > http://apache-spark-user-list.1001560.n3.nabble.com/Discretization-td22811.html > We should mention the values are gray levels (0-255) and change the division > to integer division. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8575) Deprecate callUDF in favor of udf
[ https://issues.apache.org/jira/browse/SPARK-8575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-8575. -- Resolution: Fixed Issue resolved by pull request 6993 [https://github.com/apache/spark/pull/6993] > Deprecate callUDF in favor of udf > - > > Key: SPARK-8575 > URL: https://issues.apache.org/jira/browse/SPARK-8575 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Benjamin Fradet >Assignee: Benjamin Fradet >Priority: Minor > Fix For: 1.5.0 > > > Follow-up of [SPARK-8356|https://issues.apache.org/jira/browse/SPARK-8356] to > use {{callUDF}} in favor of {{udf}} wherever possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5962) [MLLIB] Python support for Power Iteration Clustering
[ https://issues.apache.org/jira/browse/SPARK-5962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-5962. -- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 6992 [https://github.com/apache/spark/pull/6992] > [MLLIB] Python support for Power Iteration Clustering > - > > Key: SPARK-5962 > URL: https://issues.apache.org/jira/browse/SPARK-5962 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Stephen Boesch >Assignee: Yanbo Liang > Labels: python > Fix For: 1.5.0 > > Original Estimate: 168h > Remaining Estimate: 168h > > Add python support for the Power Iteration Clustering feature. Here is a > fragment of the python API as we plan to implement it: > /** >* Java stub for Python mllib PowerIterationClustering.run() >*/ > def trainPowerIterationClusteringModel( > data: JavaRDD[(java.lang.Long, java.lang.Long, java.lang.Double)], > k: Int, > maxIterations: Int, > runs: Int, > initializationMode: String, > seed: java.lang.Long): PowerIterationClusteringModel = { > val picAlg = new PowerIterationClustering() > .setK(k) > .setMaxIterations(maxIterations) > try { > picAlg.run(data.rdd.persist(StorageLevel.MEMORY_AND_DISK)) > } finally { > data.rdd.unpersist(blocking = false) > } > } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7212) Frequent pattern mining for sequential item sets
[ https://issues.apache.org/jira/browse/SPARK-7212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-7212. -- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 6997 [https://github.com/apache/spark/pull/6997] > Frequent pattern mining for sequential item sets > > > Key: SPARK-7212 > URL: https://issues.apache.org/jira/browse/SPARK-7212 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Joseph K. Bradley >Assignee: Feynman Liang > Fix For: 1.5.0 > > > Currently, FPGrowth handles unordered item sets. It would be great to be > able to handle sequences of items, in which the order matters. This JIRA is > for discussing modifications to FPGrowth and/or new algorithms for handling > sequences. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8695) TreeAggregation shouldn't be triggered for 5 partitions
[ https://issues.apache.org/jira/browse/SPARK-8695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-8695: - Priority: Minor (was: Major) > TreeAggregation shouldn't be triggered for 5 partitions > --- > > Key: SPARK-8695 > URL: https://issues.apache.org/jira/browse/SPARK-8695 > Project: Spark > Issue Type: Improvement > Components: MLlib, Spark Core >Affects Versions: 1.5.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Minor > > If an RDD has 5 partitions, tree aggregation doesn't reduce the wall-clock > time. Instead, it introduces scheduling and shuffling overhead. We should > update the condition to use tree aggregation (code attached): > {code} > while (numPartitions > scale + numPartitions / scale) { > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8695) TreeAggregation shouldn't be triggered for 5 partitions
Xiangrui Meng created SPARK-8695: Summary: TreeAggregation shouldn't be triggered for 5 partitions Key: SPARK-8695 URL: https://issues.apache.org/jira/browse/SPARK-8695 Project: Spark Issue Type: Improvement Components: MLlib, Spark Core Affects Versions: 1.5.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng If an RDD has 5 partitions, tree aggregation doesn't reduce the wall-clock time. Instead, it introduces scheduling and shuffling overhead. We should update the condition to use tree aggregation (code attached): {code} while (numPartitions > scale + numPartitions / scale) { {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8405) Show executor logs on Web UI when Yarn log aggregation is enabled
[ https://issues.apache.org/jira/browse/SPARK-8405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14605126#comment-14605126 ] Carson Wang commented on SPARK-8405: Thank you very much, [~hshreedharan] After I configured yarn.log.server.url in yarn-site.xml, the log url did redirect to MR job history server's UI and was able to show the logs. But this seems a little strange because we need view Spark's log on MR history server. Would it make more sense that Spark history server reads and shows the aggregated logs itself? > Show executor logs on Web UI when Yarn log aggregation is enabled > - > > Key: SPARK-8405 > URL: https://issues.apache.org/jira/browse/SPARK-8405 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.4.0 >Reporter: Carson Wang > Attachments: SparkLogError.png > > > When running Spark application in Yarn mode and Yarn log aggregation is > enabled, customer is not able to view executor logs on the history server Web > UI. The only way for customer to view the logs is through the Yarn command > "yarn logs -applicationId ". > An screenshot of the error is attached. When you click an executor’s log link > on the Spark history server, you’ll see the error if Yarn log aggregation is > enabled. The log URL redirects user to the node manager’s UI. This works if > the logs are located on that node. But since log aggregation is enabled, the > local logs are deleted once log aggregation is completed. > The logs should be available through the web UIs just like other Hadoop > components like MapReduce. For security reasons, end users may not be able to > log into the nodes and run the yarn logs -applicationId command. The web UIs > can be viewable and exposed through the firewall if necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8694) Defer executing drawTaskAssignmentTimeline until page loaded to avoid to freeze the page
[ https://issues.apache.org/jira/browse/SPARK-8694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8694: --- Assignee: (was: Apache Spark) > Defer executing drawTaskAssignmentTimeline until page loaded to avoid to > freeze the page > > > Key: SPARK-8694 > URL: https://issues.apache.org/jira/browse/SPARK-8694 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 1.4.0, 1.5.0 >Reporter: Kousuke Saruta > > When there are massive tasks in the stage page (such as, running > sc.parallelize(1 to 10, 1).count()), Event Timeline needs 15+ seconds > to render the graph (drawTaskAssignmentTimeline) in my environment. The page > is unresponsive until the graph is ready. > However, since Event Timeline is hidden by default, we can defer > drawTaskAssignmentTimeline until page loaded to avoid freezing the page. So > that the user can view the page while rendering Event Timeline in the > background. > This PR puts drawTaskAssignmentTimeline into $(function(){}) to avoid > blocking loading page. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8694) Defer executing drawTaskAssignmentTimeline until page loaded to avoid to freeze the page
[ https://issues.apache.org/jira/browse/SPARK-8694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14605123#comment-14605123 ] Apache Spark commented on SPARK-8694: - User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/7071 > Defer executing drawTaskAssignmentTimeline until page loaded to avoid to > freeze the page > > > Key: SPARK-8694 > URL: https://issues.apache.org/jira/browse/SPARK-8694 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 1.4.0, 1.5.0 >Reporter: Kousuke Saruta > > When there are massive tasks in the stage page (such as, running > sc.parallelize(1 to 10, 1).count()), Event Timeline needs 15+ seconds > to render the graph (drawTaskAssignmentTimeline) in my environment. The page > is unresponsive until the graph is ready. > However, since Event Timeline is hidden by default, we can defer > drawTaskAssignmentTimeline until page loaded to avoid freezing the page. So > that the user can view the page while rendering Event Timeline in the > background. > This PR puts drawTaskAssignmentTimeline into $(function(){}) to avoid > blocking loading page. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8694) Defer executing drawTaskAssignmentTimeline until page loaded to avoid to freeze the page
[ https://issues.apache.org/jira/browse/SPARK-8694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8694: --- Assignee: Apache Spark > Defer executing drawTaskAssignmentTimeline until page loaded to avoid to > freeze the page > > > Key: SPARK-8694 > URL: https://issues.apache.org/jira/browse/SPARK-8694 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 1.4.0, 1.5.0 >Reporter: Kousuke Saruta >Assignee: Apache Spark > > When there are massive tasks in the stage page (such as, running > sc.parallelize(1 to 10, 1).count()), Event Timeline needs 15+ seconds > to render the graph (drawTaskAssignmentTimeline) in my environment. The page > is unresponsive until the graph is ready. > However, since Event Timeline is hidden by default, we can defer > drawTaskAssignmentTimeline until page loaded to avoid freezing the page. So > that the user can view the page while rendering Event Timeline in the > background. > This PR puts drawTaskAssignmentTimeline into $(function(){}) to avoid > blocking loading page. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8374) Job frequently hangs after YARN preemption
[ https://issues.apache.org/jira/browse/SPARK-8374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14605117#comment-14605117 ] Shay Rojansky commented on SPARK-8374: -- Any chance someone can look at this bug, at least to confirm it? This is a pretty serious issue preventing Spark 1.4 use in YARN where preemption may happen... > Job frequently hangs after YARN preemption > -- > > Key: SPARK-8374 > URL: https://issues.apache.org/jira/browse/SPARK-8374 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.4.0 > Environment: YARN 2.7.0, Spark 1.4.0, Ubuntu 14.04 >Reporter: Shay Rojansky >Priority: Critical > > After upgrading to Spark 1.4.0, jobs that get preempted very frequently will > not reacquire executors and will therefore hang. To reproduce: > 1. I run Spark job A that acquires all grid resources > 2. I run Spark job B in a higher-priority queue that acquires all grid > resources. Job A is fully preempted. > 3. Kill job B, releasing all resources > 4. Job A should at this point reacquire all grid resources, but occasionally > doesn't. Repeating the preemption scenario makes the bad behavior occur > within a few attempts. > (see logs at bottom). > Note issue SPARK-7451 that was supposed to fix some Spark YARN preemption > issues, maybe the work there is related to the new issues. > The 1.4.0 preemption situation is considerably worse than 1.3.1 (we've > downgraded to 1.3.1 just because of this issue). > Logs > -- > When job B (the preemptor first acquires an application master, the following > is logged by job A (the preemptee): > {noformat} > ERROR YarnScheduler: Lost executor 447 on g023.grid.eaglerd.local: remote Rpc > client disassociated > INFO TaskSetManager: Re-queueing tasks for 447 from TaskSet 0.0 > WARN ReliableDeliverySupervisor: Association with remote system > [akka.tcp://sparkexecu...@g023.grid.eaglerd.local:54167] has failed, address > is now gated for [5000] ms. Reason is: [Disassociated]. > WARN TaskSetManager: Lost task 15.0 in stage 0.0 (TID 15, > g023.grid.eaglerd.local): ExecutorLostFailure (executor 447 lost) > INFO DAGScheduler: Executor lost: 447 (epoch 0) > INFO BlockManagerMasterEndpoint: Trying to remove executor 447 from > BlockManagerMaster. > INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(447, > g023.grid.eaglerd.local, 41406) > INFO BlockManagerMaster: Removed 447 successfully in removeExecutor > {noformat} > (It's strange for errors/warnings to be logged for preemption) > Later, when job B's AM starts requesting its resources, I get lots of the > following in job A: > {noformat} > ERROR YarnScheduler: Lost executor 415 on g033.grid.eaglerd.local: remote Rpc > client disassociated > INFO TaskSetManager: Re-queueing tasks for 415 from TaskSet 0.0 > WARN TaskSetManager: Lost task 231.0 in stage 0.0 (TID 231, > g033.grid.eaglerd.local): ExecutorLostFailure (executor 415 lost) > WARN ReliableDeliverySupervisor: Association with remote system > [akka.tcp://sparkexecu...@g023.grid.eaglerd.local:34357] has failed, address > is now gated for [5000] ms. Reason is: [Disassociated]. > {noformat} > Finally, when I kill job B, job A emits lots of the following: > {noformat} > INFO YarnClientSchedulerBackend: Requesting to kill executor(s) 31 > WARN YarnClientSchedulerBackend: Executor to kill 31 does not exist! > {noformat} > And finally after some time: > {noformat} > WARN HeartbeatReceiver: Removing executor 466 with no recent heartbeats: > 165964 ms exceeds timeout 12 ms > ERROR YarnScheduler: Lost an executor 466 (already removed): Executor > heartbeat timed out after 165964 ms > {noformat} > At this point the job never requests/acquires more resources and hangs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8694) Defer executing drawTaskAssignmentTimeline until page loaded to avoid to freeze the page
Kousuke Saruta created SPARK-8694: - Summary: Defer executing drawTaskAssignmentTimeline until page loaded to avoid to freeze the page Key: SPARK-8694 URL: https://issues.apache.org/jira/browse/SPARK-8694 Project: Spark Issue Type: Sub-task Components: Web UI Affects Versions: 1.4.0, 1.5.0 Reporter: Kousuke Saruta When there are massive tasks in the stage page (such as, running sc.parallelize(1 to 10, 1).count()), Event Timeline needs 15+ seconds to render the graph (drawTaskAssignmentTimeline) in my environment. The page is unresponsive until the graph is ready. However, since Event Timeline is hidden by default, we can defer drawTaskAssignmentTimeline until page loaded to avoid freezing the page. So that the user can view the page while rendering Event Timeline in the background. This PR puts drawTaskAssignmentTimeline into $(function(){}) to avoid blocking loading page. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8636) CaseKeyWhen has incorrect NULL handling
[ https://issues.apache.org/jira/browse/SPARK-8636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14605112#comment-14605112 ] Animesh Baranawal commented on SPARK-8636: -- No a row with null will not be equal to another row with null if we follow the modification. What do you say [~smolav] ? > CaseKeyWhen has incorrect NULL handling > --- > > Key: SPARK-8636 > URL: https://issues.apache.org/jira/browse/SPARK-8636 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.0 >Reporter: Santiago M. Mola > Labels: starter > > CaseKeyWhen implementation in Spark uses the following equals implementation: > {code} > private def equalNullSafe(l: Any, r: Any) = { > if (l == null && r == null) { > true > } else if (l == null || r == null) { > false > } else { > l == r > } > } > {code} > Which is not correct, since in SQL, NULL is never equal to NULL (actually, it > is not unequal either). In this case, a NULL value in a CASE WHEN expression > should never match. > For example, you can execute this in MySQL: > {code} > SELECT CASE NULL WHEN NULL THEN "NULL MATCHES" ELSE "NULL DOES NOT MATCH" END > FROM DUAL; > {code} > And the result will be "NULL DOES NOT MATCH". -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-8616) SQLContext doesn't handle tricky column names when loading from JDBC
[ https://issues.apache.org/jira/browse/SPARK-8616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14605078#comment-14605078 ] Gergely Svigruha edited comment on SPARK-8616 at 6/29/15 3:43 AM: -- Thanks for working on this! FYI, MYSQL may allow double quotes if the ANSI_QUOTES mode is enabled: https://dev.mysql.com/doc/refman/5.0/en/identifiers.html was (Author: gsvigruha): FYI, MYSQL may allow double quotes if the ANSI_QUOTES mode is enabled: https://dev.mysql.com/doc/refman/5.0/en/identifiers.html > SQLContext doesn't handle tricky column names when loading from JDBC > > > Key: SPARK-8616 > URL: https://issues.apache.org/jira/browse/SPARK-8616 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.0 > Environment: Ubuntu 14.04, Sqlite 3.8.7, Spark 1.4.0 >Reporter: Gergely Svigruha > > Reproduce: > - create a table in a relational database (in my case sqlite) with a column > name containing a space: > CREATE TABLE my_table (id INTEGER, "tricky column" TEXT); > - try to create a DataFrame using that table: > sqlContext.read.format("jdbc").options(Map( > "url" -> "jdbs:sqlite:...", > "dbtable" -> "my_table")).load() > java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database (no such > column: tricky) > According to the SQL spec this should be valid: > http://savage.net.au/SQL/sql-99.bnf.html#delimited%20identifier -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8616) SQLContext doesn't handle tricky column names when loading from JDBC
[ https://issues.apache.org/jira/browse/SPARK-8616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14605078#comment-14605078 ] Gergely Svigruha commented on SPARK-8616: - FYI, MYSQL may allow double quotes if the ANSI_QUOTES mode is enabled: https://dev.mysql.com/doc/refman/5.0/en/identifiers.html > SQLContext doesn't handle tricky column names when loading from JDBC > > > Key: SPARK-8616 > URL: https://issues.apache.org/jira/browse/SPARK-8616 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.0 > Environment: Ubuntu 14.04, Sqlite 3.8.7, Spark 1.4.0 >Reporter: Gergely Svigruha > > Reproduce: > - create a table in a relational database (in my case sqlite) with a column > name containing a space: > CREATE TABLE my_table (id INTEGER, "tricky column" TEXT); > - try to create a DataFrame using that table: > sqlContext.read.format("jdbc").options(Map( > "url" -> "jdbs:sqlite:...", > "dbtable" -> "my_table")).load() > java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database (no such > column: tricky) > According to the SQL spec this should be valid: > http://savage.net.au/SQL/sql-99.bnf.html#delimited%20identifier -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8651) Lasso with SGD not Converging properly
[ https://issues.apache.org/jira/browse/SPARK-8651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14605056#comment-14605056 ] yuhao yang commented on SPARK-8651: --- [~aazout] I know someone met the similar issue. Usually it's because the weights did not start converging. Try more iteration numbers ( 2000 or larger) or larger step size (4 or 5). Let me know if that helps. (if not, please share a part of your code or data). > Lasso with SGD not Converging properly > -- > > Key: SPARK-8651 > URL: https://issues.apache.org/jira/browse/SPARK-8651 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.4.0 >Reporter: Albert Azout > > We are having issues getting Lasso with SGD to converge properly. The weights > outputted are extremely large values. We have tried multiple miniBatchRatios > and still see same issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7845) Bump "Hadoop 1" tests to version 1.2.1
[ https://issues.apache.org/jira/browse/SPARK-7845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-7845: Fix Version/s: 1.5.0 > Bump "Hadoop 1" tests to version 1.2.1 > -- > > Key: SPARK-7845 > URL: https://issues.apache.org/jira/browse/SPARK-7845 > Project: Spark > Issue Type: Improvement > Components: Tests >Reporter: Patrick Wendell >Assignee: Cheng Lian >Priority: Critical > Fix For: 1.4.0, 1.5.0 > > > A small number of API's in Hadoop were added between 1.0.4 and 1.2.1. It > appears this is one cause of SPARK-7843 since some Hive code relies on newer > Hadoop API's. My feeling is we should just bump our tested version up to > 1.2.1 (both versions are extremely old). If users are still on < 1.2.1 and > run into some of these corner cases, we can consider doing some engineering > and supporting the older versions. I'd like to bump our test version though > and let this be driven by users, if they exist. > https://github.com/apache/spark/blob/master/dev/run-tests#L43 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7845) Bump "Hadoop 1" tests to version 1.2.1
[ https://issues.apache.org/jira/browse/SPARK-7845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-7845. - Resolution: Fixed Assignee: Cheng Lian (was: Yin Huai) Target Version/s: 1.4.0, 1.5.0 (was: 1.4.0) https://github.com/apache/spark/pull/7062 has fixed the master branch (1.5.0). > Bump "Hadoop 1" tests to version 1.2.1 > -- > > Key: SPARK-7845 > URL: https://issues.apache.org/jira/browse/SPARK-7845 > Project: Spark > Issue Type: Improvement > Components: Tests >Reporter: Patrick Wendell >Assignee: Cheng Lian >Priority: Critical > Fix For: 1.4.0 > > > A small number of API's in Hadoop were added between 1.0.4 and 1.2.1. It > appears this is one cause of SPARK-7843 since some Hive code relies on newer > Hadoop API's. My feeling is we should just bump our tested version up to > 1.2.1 (both versions are extremely old). If users are still on < 1.2.1 and > run into some of these corner cases, we can consider doing some engineering > and supporting the older versions. I'd like to bump our test version though > and let this be driven by users, if they exist. > https://github.com/apache/spark/blob/master/dev/run-tests#L43 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8693) profiles and goals are not printed in a nice way
[ https://issues.apache.org/jira/browse/SPARK-8693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14605044#comment-14605044 ] Yin Huai commented on SPARK-8693: - [~boyork] Can you take a look at it when you get time? > profiles and goals are not printed in a nice way > > > Key: SPARK-8693 > URL: https://issues.apache.org/jira/browse/SPARK-8693 > Project: Spark > Issue Type: Sub-task > Components: Build, Project Infra >Reporter: Yin Huai >Priority: Minor > > In our master build, I see > {code} > -Phadoop-1[info] Building Spark (w/Hive 0.13.1) using SBT with these > arguments: -Dhadoop.version=1.0.4[info] Building Spark (w/Hive 0.13.1) using > SBT with these arguments: -Pkinesis-asl[info] Building Spark (w/Hive 0.13.1) > using SBT with these arguments: -Phive-thriftserver[info] Building Spark > (w/Hive 0.13.1) using SBT with these arguments: -Phive[info] Building Spark > (w/Hive 0.13.1) using SBT with these arguments: package[info] Building Spark > (w/Hive 0.13.1) using SBT with these arguments: assembly/assembly[info] > Building Spark (w/Hive 0.13.1) using SBT with these arguments: > streaming-kafka-assembly/assembly > {code} > Seems we format the string in a wrong way? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8693) profiles and goals are not printed in a nice way
Yin Huai created SPARK-8693: --- Summary: profiles and goals are not printed in a nice way Key: SPARK-8693 URL: https://issues.apache.org/jira/browse/SPARK-8693 Project: Spark Issue Type: Sub-task Reporter: Yin Huai Priority: Minor In our master build, I see {code} -Phadoop-1[info] Building Spark (w/Hive 0.13.1) using SBT with these arguments: -Dhadoop.version=1.0.4[info] Building Spark (w/Hive 0.13.1) using SBT with these arguments: -Pkinesis-asl[info] Building Spark (w/Hive 0.13.1) using SBT with these arguments: -Phive-thriftserver[info] Building Spark (w/Hive 0.13.1) using SBT with these arguments: -Phive[info] Building Spark (w/Hive 0.13.1) using SBT with these arguments: package[info] Building Spark (w/Hive 0.13.1) using SBT with these arguments: assembly/assembly[info] Building Spark (w/Hive 0.13.1) using SBT with these arguments: streaming-kafka-assembly/assembly {code} Seems we format the string in a wrong way? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8443) GenerateMutableProjection Exceeds JVM Code Size Limits
[ https://issues.apache.org/jira/browse/SPARK-8443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8443: --- Assignee: Apache Spark > GenerateMutableProjection Exceeds JVM Code Size Limits > -- > > Key: SPARK-8443 > URL: https://issues.apache.org/jira/browse/SPARK-8443 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.0 >Reporter: Sen Fang >Assignee: Apache Spark > > GenerateMutableProjection put all expressions columns into a single apply > function. When there are a lot of columns, the apply function code size > exceeds the 64kb limit, which is a hard limit on jvm that cannot change. > This comes up when we were aggregating about 100 columns using codegen and > unsafe feature. > I wrote an unit test that reproduces this issue. > https://github.com/saurfang/spark/blob/codegen_size_limit/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CodeGenerationSuite.scala > This test currently fails at 2048 expressions. It seems the master is more > tolerant than branch-1.4 about this because code is more concise. > While the code on master has changed since branch-1.4, I am able to reproduce > the problem in master. For now I hacked my way in branch-1.4 to workaround > this problem by wrapping each expression with a separate function then call > those functions sequentially in apply. The proper way is probably check the > length of the projectCode and break it up as necessary. (This seems to be > easier in master actually since we are building code by string rather than > quasiquote) > Let me know if anyone has additional thoughts on this, I'm happy to > contribute a pull request. > Attaching stack trace produced by unit test > {code} > [info] - code size limit *** FAILED *** (7 seconds, 103 milliseconds) > [info] com.google.common.util.concurrent.UncheckedExecutionException: > org.codehaus.janino.JaninoRuntimeException: Code of method > "(Ljava/lang/Object;)Ljava/lang/Object;" of class "SC$SpecificProjection" > grows beyond 64 KB > [info] at > com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2263) > [info] at com.google.common.cache.LocalCache.get(LocalCache.java:4000) > [info] at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4004) > [info] at > com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874) > [info] at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:285) > [info] at > org.apache.spark.sql.catalyst.expressions.CodeGenerationSuite$$anonfun$2$$anonfun$apply$mcV$sp$2.apply(CodeGenerationSuite.scala:50) > [info] at > org.apache.spark.sql.catalyst.expressions.CodeGenerationSuite$$anonfun$2$$anonfun$apply$mcV$sp$2.apply(CodeGenerationSuite.scala:48) > [info] at > scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144) > [info] at > scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144) > [info] at scala.collection.immutable.Range.foreach(Range.scala:141) > [info] at > scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144) > [info] at > scala.collection.AbstractTraversable.foldLeft(Traversable.scala:105) > [info] at > org.apache.spark.sql.catalyst.expressions.CodeGenerationSuite$$anonfun$2.apply$mcV$sp(CodeGenerationSuite.scala:47) > [info] at > org.apache.spark.sql.catalyst.expressions.CodeGenerationSuite$$anonfun$2.apply(CodeGenerationSuite.scala:47) > [info] at > org.apache.spark.sql.catalyst.expressions.CodeGenerationSuite$$anonfun$2.apply(CodeGenerationSuite.scala:47) > [info] at > org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) > [info] at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > [info] at org.scalatest.Transformer.apply(Transformer.scala:22) > [info] at org.scalatest.Transformer.apply(Transformer.scala:20) > [info] at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) > [info] at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:42) > [info] at > org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) > [info] at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > [info] at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) > [info] at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) > [info] at org.scalatest.FunSuite.runTest(FunSuite.scala:1555) > [info] at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) > [info] at > org.sc
[jira] [Assigned] (SPARK-8443) GenerateMutableProjection Exceeds JVM Code Size Limits
[ https://issues.apache.org/jira/browse/SPARK-8443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8443: --- Assignee: (was: Apache Spark) > GenerateMutableProjection Exceeds JVM Code Size Limits > -- > > Key: SPARK-8443 > URL: https://issues.apache.org/jira/browse/SPARK-8443 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.0 >Reporter: Sen Fang > > GenerateMutableProjection put all expressions columns into a single apply > function. When there are a lot of columns, the apply function code size > exceeds the 64kb limit, which is a hard limit on jvm that cannot change. > This comes up when we were aggregating about 100 columns using codegen and > unsafe feature. > I wrote an unit test that reproduces this issue. > https://github.com/saurfang/spark/blob/codegen_size_limit/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CodeGenerationSuite.scala > This test currently fails at 2048 expressions. It seems the master is more > tolerant than branch-1.4 about this because code is more concise. > While the code on master has changed since branch-1.4, I am able to reproduce > the problem in master. For now I hacked my way in branch-1.4 to workaround > this problem by wrapping each expression with a separate function then call > those functions sequentially in apply. The proper way is probably check the > length of the projectCode and break it up as necessary. (This seems to be > easier in master actually since we are building code by string rather than > quasiquote) > Let me know if anyone has additional thoughts on this, I'm happy to > contribute a pull request. > Attaching stack trace produced by unit test > {code} > [info] - code size limit *** FAILED *** (7 seconds, 103 milliseconds) > [info] com.google.common.util.concurrent.UncheckedExecutionException: > org.codehaus.janino.JaninoRuntimeException: Code of method > "(Ljava/lang/Object;)Ljava/lang/Object;" of class "SC$SpecificProjection" > grows beyond 64 KB > [info] at > com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2263) > [info] at com.google.common.cache.LocalCache.get(LocalCache.java:4000) > [info] at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4004) > [info] at > com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874) > [info] at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:285) > [info] at > org.apache.spark.sql.catalyst.expressions.CodeGenerationSuite$$anonfun$2$$anonfun$apply$mcV$sp$2.apply(CodeGenerationSuite.scala:50) > [info] at > org.apache.spark.sql.catalyst.expressions.CodeGenerationSuite$$anonfun$2$$anonfun$apply$mcV$sp$2.apply(CodeGenerationSuite.scala:48) > [info] at > scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144) > [info] at > scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144) > [info] at scala.collection.immutable.Range.foreach(Range.scala:141) > [info] at > scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144) > [info] at > scala.collection.AbstractTraversable.foldLeft(Traversable.scala:105) > [info] at > org.apache.spark.sql.catalyst.expressions.CodeGenerationSuite$$anonfun$2.apply$mcV$sp(CodeGenerationSuite.scala:47) > [info] at > org.apache.spark.sql.catalyst.expressions.CodeGenerationSuite$$anonfun$2.apply(CodeGenerationSuite.scala:47) > [info] at > org.apache.spark.sql.catalyst.expressions.CodeGenerationSuite$$anonfun$2.apply(CodeGenerationSuite.scala:47) > [info] at > org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) > [info] at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > [info] at org.scalatest.Transformer.apply(Transformer.scala:22) > [info] at org.scalatest.Transformer.apply(Transformer.scala:20) > [info] at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) > [info] at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:42) > [info] at > org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) > [info] at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > [info] at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) > [info] at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) > [info] at org.scalatest.FunSuite.runTest(FunSuite.scala:1555) > [info] at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) > [info] at > org.scalatest.FunSuiteLike$$ano
[jira] [Commented] (SPARK-8443) GenerateMutableProjection Exceeds JVM Code Size Limits
[ https://issues.apache.org/jira/browse/SPARK-8443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14605028#comment-14605028 ] Apache Spark commented on SPARK-8443: - User 'saurfang' has created a pull request for this issue: https://github.com/apache/spark/pull/7076 > GenerateMutableProjection Exceeds JVM Code Size Limits > -- > > Key: SPARK-8443 > URL: https://issues.apache.org/jira/browse/SPARK-8443 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.0 >Reporter: Sen Fang > > GenerateMutableProjection put all expressions columns into a single apply > function. When there are a lot of columns, the apply function code size > exceeds the 64kb limit, which is a hard limit on jvm that cannot change. > This comes up when we were aggregating about 100 columns using codegen and > unsafe feature. > I wrote an unit test that reproduces this issue. > https://github.com/saurfang/spark/blob/codegen_size_limit/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CodeGenerationSuite.scala > This test currently fails at 2048 expressions. It seems the master is more > tolerant than branch-1.4 about this because code is more concise. > While the code on master has changed since branch-1.4, I am able to reproduce > the problem in master. For now I hacked my way in branch-1.4 to workaround > this problem by wrapping each expression with a separate function then call > those functions sequentially in apply. The proper way is probably check the > length of the projectCode and break it up as necessary. (This seems to be > easier in master actually since we are building code by string rather than > quasiquote) > Let me know if anyone has additional thoughts on this, I'm happy to > contribute a pull request. > Attaching stack trace produced by unit test > {code} > [info] - code size limit *** FAILED *** (7 seconds, 103 milliseconds) > [info] com.google.common.util.concurrent.UncheckedExecutionException: > org.codehaus.janino.JaninoRuntimeException: Code of method > "(Ljava/lang/Object;)Ljava/lang/Object;" of class "SC$SpecificProjection" > grows beyond 64 KB > [info] at > com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2263) > [info] at com.google.common.cache.LocalCache.get(LocalCache.java:4000) > [info] at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4004) > [info] at > com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874) > [info] at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:285) > [info] at > org.apache.spark.sql.catalyst.expressions.CodeGenerationSuite$$anonfun$2$$anonfun$apply$mcV$sp$2.apply(CodeGenerationSuite.scala:50) > [info] at > org.apache.spark.sql.catalyst.expressions.CodeGenerationSuite$$anonfun$2$$anonfun$apply$mcV$sp$2.apply(CodeGenerationSuite.scala:48) > [info] at > scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144) > [info] at > scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144) > [info] at scala.collection.immutable.Range.foreach(Range.scala:141) > [info] at > scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144) > [info] at > scala.collection.AbstractTraversable.foldLeft(Traversable.scala:105) > [info] at > org.apache.spark.sql.catalyst.expressions.CodeGenerationSuite$$anonfun$2.apply$mcV$sp(CodeGenerationSuite.scala:47) > [info] at > org.apache.spark.sql.catalyst.expressions.CodeGenerationSuite$$anonfun$2.apply(CodeGenerationSuite.scala:47) > [info] at > org.apache.spark.sql.catalyst.expressions.CodeGenerationSuite$$anonfun$2.apply(CodeGenerationSuite.scala:47) > [info] at > org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) > [info] at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > [info] at org.scalatest.Transformer.apply(Transformer.scala:22) > [info] at org.scalatest.Transformer.apply(Transformer.scala:20) > [info] at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) > [info] at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:42) > [info] at > org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) > [info] at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > [info] at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) > [info] at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) > [info] at org.scalatest.FunSuite.runTest(FunSuite.scala:1555) > [info] at > org.scalate
[jira] [Updated] (SPARK-8692) re-order the case statements that handling catalyst data types
[ https://issues.apache.org/jira/browse/SPARK-8692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-8692: -- Shepherd: Cheng Lian > re-order the case statements that handling catalyst data types > --- > > Key: SPARK-8692 > URL: https://issues.apache.org/jira/browse/SPARK-8692 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-8424) Add blacklist mechanism for task scheduler and Yarn container allocation
[ https://issues.apache.org/jira/browse/SPARK-8424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao closed SPARK-8424. -- Resolution: Duplicate > Add blacklist mechanism for task scheduler and Yarn container allocation > > > Key: SPARK-8424 > URL: https://issues.apache.org/jira/browse/SPARK-8424 > Project: Spark > Issue Type: New Feature > Components: Scheduler, YARN >Affects Versions: 1.4.0 >Reporter: Saisai Shao > > Previously MapReduce has a blacklist and graylist to exclude some constantly > failed TaskTrackers/nodes, it is important for a large cluster to alleviate > the problem of increasing chance of hardware and software failure. > Unfortunately current version of Spark lacks such mechanism to blacklist some > constantly failed executors/nodes. The only blacklist mechanism in Spark is > to avoid relaunching the task on the same executor when this task is > previously failed on this executor within specified time. So here propose a > new feature to add blacklist mechanism for Spark, this proposal is divided > into two sub-tasks: > 1. Add a heuristic blacklist algorithm to track the status of executors by > the status of finished tasks, and enable blacklist mechanism in tasking > scheduling. > 2. Enable blacklist mechanism in YARN container allocation (avoid allocating > containers on the blacklist hosts). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8424) Add blacklist mechanism for task scheduler and Yarn container allocation
[ https://issues.apache.org/jira/browse/SPARK-8424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14605001#comment-14605001 ] Saisai Shao commented on SPARK-8424: OK, I will close this JIRA. > Add blacklist mechanism for task scheduler and Yarn container allocation > > > Key: SPARK-8424 > URL: https://issues.apache.org/jira/browse/SPARK-8424 > Project: Spark > Issue Type: New Feature > Components: Scheduler, YARN >Affects Versions: 1.4.0 >Reporter: Saisai Shao > > Previously MapReduce has a blacklist and graylist to exclude some constantly > failed TaskTrackers/nodes, it is important for a large cluster to alleviate > the problem of increasing chance of hardware and software failure. > Unfortunately current version of Spark lacks such mechanism to blacklist some > constantly failed executors/nodes. The only blacklist mechanism in Spark is > to avoid relaunching the task on the same executor when this task is > previously failed on this executor within specified time. So here propose a > new feature to add blacklist mechanism for Spark, this proposal is divided > into two sub-tasks: > 1. Add a heuristic blacklist algorithm to track the status of executors by > the status of finished tasks, and enable blacklist mechanism in tasking > scheduling. > 2. Enable blacklist mechanism in YARN container allocation (avoid allocating > containers on the blacklist hosts). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8692) re-order the case statements that handling catalyst data types
[ https://issues.apache.org/jira/browse/SPARK-8692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-8692: --- Summary: re-order the case statements that handling catalyst data types (was: refactor the handling of date and timestamp) > re-order the case statements that handling catalyst data types > --- > > Key: SPARK-8692 > URL: https://issues.apache.org/jira/browse/SPARK-8692 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8692) refactor the handling of date and timestamp
[ https://issues.apache.org/jira/browse/SPARK-8692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-8692: --- Summary: refactor the handling of date and timestamp (was: improve date and timestamp handling) > refactor the handling of date and timestamp > --- > > Key: SPARK-8692 > URL: https://issues.apache.org/jira/browse/SPARK-8692 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8577) ScalaReflectionLock.synchronized can cause deadlock
[ https://issues.apache.org/jira/browse/SPARK-8577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14604935#comment-14604935 ] koert kuipers commented on SPARK-8577: -- i do not have a way to reproduce this in non-test code. i do have a thread dump but can not share all of it. see below. both threads are blocked by ScalaReflectionLock.synchronized Thread 2942: (state = BLOCKED) - org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(org.apache.spark.sql.catalyst.ScalaReflection, scala.reflect.api.Types$TypeApi) @bci=2611, line=140 (Interpret\ ed frame) - org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(scala.reflect.api.Types$TypeApi) @bci=2, line=28 (Interpreted frame) - org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(scala.reflect.api.Symbols$SymbolApi) @bci=21, line=123 (Interpreted frame) - org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(java.lang.Object) @bci=5, line=121 (Interpreted frame) - scala.collection.immutable.List.map(scala.Function1, scala.collection.generic.CanBuildFrom) @bci=74, line=277 (Compiled frame) - org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(org.apache.spark.sql.catalyst.ScalaReflection, scala.reflect.api.Types$TypeApi) @bci=1699, line=121 (Interpret\ ed frame) - org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(scala.reflect.api.Types$TypeApi) @bci=2, line=28 (Interpreted frame) - org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(org.apache.spark.sql.catalyst.ScalaReflection, scala.reflect.api.TypeTags$TypeTag) @bci=12, line=59 (Interpret\ ed frame) - org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(scala.reflect.api.TypeTags$TypeTag) @bci=2, line=28 (Interpreted frame) Thread 2941: (state = BLOCKED) - org.apache.spark.sql.types.AtomicType.() @bci=11, line=95 (Interpreted frame) - org.apache.spark.sql.types.NumericType.() @bci=1, line=107 (Interpreted frame) - org.apache.spark.sql.types.IntegralType.() @bci=1, line=141 (Interpreted frame) - org.apache.spark.sql.types.IntegerType.() @bci=1, line=34 (Interpreted frame) - org.apache.spark.sql.types.IntegerType$.() @bci=1, line=54 (Interpreted frame) - org.apache.spark.sql.types.IntegerType$.() @bci=3 (Interpreted frame) > ScalaReflectionLock.synchronized can cause deadlock > --- > > Key: SPARK-8577 > URL: https://issues.apache.org/jira/browse/SPARK-8577 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.0 >Reporter: koert kuipers >Priority: Minor > > Just a heads up, i was doing some basic coding using DataFrame, Row, > StructType, etc. in my own project and i ended up with deadlocks in my sbt > tests due to the usage of ScalaReflectionLock.synchronized in the spark sql > code. > the issue went away when i changed my build to have: > parallelExecution in Test := false > so that the tests run consecutively... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8677) Decimal divide operation throws ArithmeticException
[ https://issues.apache.org/jira/browse/SPARK-8677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-8677: -- Assignee: Liang-Chi Hsieh > Decimal divide operation throws ArithmeticException > --- > > Key: SPARK-8677 > URL: https://issues.apache.org/jira/browse/SPARK-8677 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh > Fix For: 1.5.0 > > > Please refer to [BigDecimal > doc|http://docs.oracle.com/javase/1.5.0/docs/api/java/math/BigDecimal.html]: > {quote} > ... the rounding mode setting of a MathContext object with a precision > setting of 0 is not used and thus irrelevant. In the case of divide, the > exact quotient could have an infinitely long decimal expansion; for example, > 1 divided by 3. > {quote} > Because we provide a MathContext.UNLIMITED in toBigDecimal, Decimal divide > operation will throw the following exception: > {code} > val decimal = Decimal(1.0, 10, 3) / Decimal(3.0, 10, 3) > [info] java.lang.ArithmeticException: Non-terminating decimal expansion; no > exact representable decimal result. > [info] at java.math.BigDecimal.divide(BigDecimal.java:1690) > [info] at java.math.BigDecimal.divide(BigDecimal.java:1723) > [info] at scala.math.BigDecimal.$div(BigDecimal.scala:256) > [info] at org.apache.spark.sql.types.Decimal.$div(Decimal.scala:272) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8677) Decimal divide operation throws ArithmeticException
[ https://issues.apache.org/jira/browse/SPARK-8677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-8677. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7056 [https://github.com/apache/spark/pull/7056] > Decimal divide operation throws ArithmeticException > --- > > Key: SPARK-8677 > URL: https://issues.apache.org/jira/browse/SPARK-8677 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Liang-Chi Hsieh > Fix For: 1.5.0 > > > Please refer to [BigDecimal > doc|http://docs.oracle.com/javase/1.5.0/docs/api/java/math/BigDecimal.html]: > {quote} > ... the rounding mode setting of a MathContext object with a precision > setting of 0 is not used and thus irrelevant. In the case of divide, the > exact quotient could have an infinitely long decimal expansion; for example, > 1 divided by 3. > {quote} > Because we provide a MathContext.UNLIMITED in toBigDecimal, Decimal divide > operation will throw the following exception: > {code} > val decimal = Decimal(1.0, 10, 3) / Decimal(3.0, 10, 3) > [info] java.lang.ArithmeticException: Non-terminating decimal expansion; no > exact representable decimal result. > [info] at java.math.BigDecimal.divide(BigDecimal.java:1690) > [info] at java.math.BigDecimal.divide(BigDecimal.java:1723) > [info] at scala.math.BigDecimal.$div(BigDecimal.scala:256) > [info] at org.apache.spark.sql.types.Decimal.$div(Decimal.scala:272) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8674) 2-sample, 2-sided Kolmogorov Smirnov Test Implementation
[ https://issues.apache.org/jira/browse/SPARK-8674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8674: --- Assignee: Apache Spark > 2-sample, 2-sided Kolmogorov Smirnov Test Implementation > > > Key: SPARK-8674 > URL: https://issues.apache.org/jira/browse/SPARK-8674 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Jose Cambronero >Assignee: Apache Spark >Priority: Minor > > We added functionality to calculate a 2-sample, 2-sided Kolmogorov Smirnov > test for 2 RDD[Double]. The calculation provides a test for the null > hypothesis that both samples come from the same probability distribution. The > implementation seeks to minimize the shuffles necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8674) 2-sample, 2-sided Kolmogorov Smirnov Test Implementation
[ https://issues.apache.org/jira/browse/SPARK-8674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8674: --- Assignee: (was: Apache Spark) > 2-sample, 2-sided Kolmogorov Smirnov Test Implementation > > > Key: SPARK-8674 > URL: https://issues.apache.org/jira/browse/SPARK-8674 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Jose Cambronero >Priority: Minor > > We added functionality to calculate a 2-sample, 2-sided Kolmogorov Smirnov > test for 2 RDD[Double]. The calculation provides a test for the null > hypothesis that both samples come from the same probability distribution. The > implementation seeks to minimize the shuffles necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8674) 2-sample, 2-sided Kolmogorov Smirnov Test Implementation
[ https://issues.apache.org/jira/browse/SPARK-8674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14604898#comment-14604898 ] Apache Spark commented on SPARK-8674: - User 'josepablocam' has created a pull request for this issue: https://github.com/apache/spark/pull/7075 > 2-sample, 2-sided Kolmogorov Smirnov Test Implementation > > > Key: SPARK-8674 > URL: https://issues.apache.org/jira/browse/SPARK-8674 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Jose Cambronero >Priority: Minor > > We added functionality to calculate a 2-sample, 2-sided Kolmogorov Smirnov > test for 2 RDD[Double]. The calculation provides a test for the null > hypothesis that both samples come from the same probability distribution. The > implementation seeks to minimize the shuffles necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8596) Install and configure RStudio server on Spark EC2
[ https://issues.apache.org/jira/browse/SPARK-8596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14604897#comment-14604897 ] Shivaram Venkataraman commented on SPARK-8596: -- Merged https://github.com/apache/spark/pull/7068 to open the RStudio port. I'm keeping the JIRA open till we fix the second part of this issue too. > Install and configure RStudio server on Spark EC2 > - > > Key: SPARK-8596 > URL: https://issues.apache.org/jira/browse/SPARK-8596 > Project: Spark > Issue Type: Improvement > Components: EC2, SparkR >Reporter: Shivaram Venkataraman > > This will make it convenient for R users to use SparkR from their browsers -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7894) Graph Union Operator
[ https://issues.apache.org/jira/browse/SPARK-7894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7894: --- Assignee: Apache Spark > Graph Union Operator > > > Key: SPARK-7894 > URL: https://issues.apache.org/jira/browse/SPARK-7894 > Project: Spark > Issue Type: Sub-task > Components: GraphX >Reporter: Andy Huang >Assignee: Apache Spark > Labels: graph, union > Attachments: union_operator.png > > > This operator aims to union two graphs and generate a new graph directly. The > union of two graphs is the union of their vertex sets and their edge > families.Vertexes and edges which are included in either graph will be part > of the new graph. > bq. G ∪ H = (VG ∪ VH, EG ∪ EH). > The below image shows a union of graph G and graph H > !union_operator.png|width=600px,align=center! > A Simple interface would be: > bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED] > However, inevitably vertexes and edges overlapping will happen between > borders of graphs. For vertex, it's quite nature to just make a union and > remove those duplicate ones. But for edges, a mergeEdges function seems to be > more reasonable. > bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: > (ED, ED) => ED): Graph[VD, ED] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7894) Graph Union Operator
[ https://issues.apache.org/jira/browse/SPARK-7894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7894: --- Assignee: (was: Apache Spark) > Graph Union Operator > > > Key: SPARK-7894 > URL: https://issues.apache.org/jira/browse/SPARK-7894 > Project: Spark > Issue Type: Sub-task > Components: GraphX >Reporter: Andy Huang > Labels: graph, union > Attachments: union_operator.png > > > This operator aims to union two graphs and generate a new graph directly. The > union of two graphs is the union of their vertex sets and their edge > families.Vertexes and edges which are included in either graph will be part > of the new graph. > bq. G ∪ H = (VG ∪ VH, EG ∪ EH). > The below image shows a union of graph G and graph H > !union_operator.png|width=600px,align=center! > A Simple interface would be: > bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED] > However, inevitably vertexes and edges overlapping will happen between > borders of graphs. For vertex, it's quite nature to just make a union and > remove those duplicate ones. But for edges, a mergeEdges function seems to be > more reasonable. > bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: > (ED, ED) => ED): Graph[VD, ED] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7894) Graph Union Operator
[ https://issues.apache.org/jira/browse/SPARK-7894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14604830#comment-14604830 ] Apache Spark commented on SPARK-7894: - User 'arnabguin' has created a pull request for this issue: https://github.com/apache/spark/pull/7074 > Graph Union Operator > > > Key: SPARK-7894 > URL: https://issues.apache.org/jira/browse/SPARK-7894 > Project: Spark > Issue Type: Sub-task > Components: GraphX >Reporter: Andy Huang > Labels: graph, union > Attachments: union_operator.png > > > This operator aims to union two graphs and generate a new graph directly. The > union of two graphs is the union of their vertex sets and their edge > families.Vertexes and edges which are included in either graph will be part > of the new graph. > bq. G ∪ H = (VG ∪ VH, EG ∪ EH). > The below image shows a union of graph G and graph H > !union_operator.png|width=600px,align=center! > A Simple interface would be: > bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED] > However, inevitably vertexes and edges overlapping will happen between > borders of graphs. For vertex, it's quite nature to just make a union and > remove those duplicate ones. But for edges, a mergeEdges function seems to be > more reasonable. > bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: > (ED, ED) => ED): Graph[VD, ED] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-8337) KafkaUtils.createDirectStream for python is lacking API/feature parity with the Scala/Java version
[ https://issues.apache.org/jira/browse/SPARK-8337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14604633#comment-14604633 ] Juan Rodríguez Hortalá edited comment on SPARK-8337 at 6/28/15 6:16 PM: Hi, I have worked a bit on the OffsetRange way, you can access the code at https://github.com/juanrh/spark/commit/56fbd5c38bd30b825a7818f1c56abb1f8b2beaff. I have added the following method to pyspark KafkaUtils {code} @staticmethod def getOffsetRanges(rdd): scalaRdd = rdd._jrdd.rdd() offsetRangesArray = scalaRdd.offsetRanges() return [ OffsetRange(topic = offsetRange.topic(), partition = offsetRange.partition(), fromOffset = offsetRange.fromOffset(), untilOffset = offsetRange.untilOffset()) for offsetRange in offsetRangesArray] {code} This method is used in KafkaUtils.createDirectStreamJB, which is based on the original KafkaUtilsPythonHelper.createDirectStream. The main problem I have is that I don't know where to store the OffsetRange objects. The naive trick of adding them to the __dict__ of each python RDD object doesn't work, the new field is lost in the pyspark wrappers. So the new method createDirectStreamJB takes two additional options, one for performing an action on the OffsetRange list, and another for adding it to each record of the DStream {code} def createDirectStreamJB(ssc, topics, kafkaParams, fromOffsets={}, keyDecoder=utf8_decoder, valueDecoder=utf8_decoder, offsetRangeForeach=None, addOffsetRange=False): """ FIXME: temporary working placeholder :param offsetRangeForeach: if different to None, this function should be a function from a list of OffsetRange to None, and is applied to the OffsetRange list of each rdd :param addOffsetRange: if False (default) output records are of the shape (kafkaKey, kafkaValue); if True output records are of the shape (offsetRange, (kafkaKey, kafkaValue)) for offsetRange the OffsetRange value for the Spark partition for the record {code} This is an example of using createDirectStreamJB: {code} from pyspark.streaming import StreamingContext from pyspark.streaming.kafka import KafkaUtils ssc = StreamingContext(sc, 1) topics = ["test"] kafkaParams = {"metadata.broker.list" : "localhost:9092"} def offsetRangeForeach(offsetRangeList): print print for offsetRange in offsetRangeList: print offsetRange print print kafkaStream = KafkaUtils.createDirectStreamJB(ssc, topics, kafkaParams, offsetRangeForeach=offsetRangeForeach, addOffsetRange=True) # OffsetRange printed as , I guess due to some kind of pyspark proxy kafkaStrStream = kafkaStream.map(lambda (offRan, (k, v)) : str(offRan._fromOffset) + " " + str(offRan._untilOffset) + " " + str(k) + " " + str(v)) # kafkaStream.pprint() kafkaStrStream.pprint() ssc.start() ssc.awaitTermination(timeout=5) {code} which gets the following output {code} 15/06/28 12:36:03 INFO InputInfoTracker: remove old batch metadata: 1435487761000 ms OffsetRange(topic=test, partition=0, fromOffset=178, untilOffset=179) 15/06/28 12:36:04 INFO JobScheduler: Added jobs for time 1435487764000 ms ... 15/06/28 12:36:04 INFO DAGScheduler: Job 4 finished: runJob at PythonRDD.scala:366, took 0,075387 s --- Time: 2015-06-28 12:36:04 --- 178 179 None hola () 15/06/28 12:36:04 INFO JobScheduler: Finished job streaming job 1435487764000 ms.0 from job set of time 1435487764000 ms ... 15/06/28 12:36:05 INFO BlockManager: Removing RDD 12 OffsetRange(topic=test, partition=0, fromOffset=179, untilOffset=180) 15/06/28 12:36:06 INFO JobScheduler: Starting job streaming job 1435487766000 ms.0 from job set of time 1435487766000 ms 15/06/28 12:36:06 INFO DAGScheduler: Job 6 finished: start at NativeMethodAccessorImpl.java:-2, took 0,077993 s --- Time: 2015-06-28 12:36:06 --- 179 180 None caracola () 15/06/28 12:36:06 INFO JobScheduler: Finished job streaming job 1435487766000 ms.0 from job set of time 1435487766000 ms {code} Any thoughts on this will be appreciated, in particular about a suitable place to store the list of OffsetRange objects Greetings, Juan was (Author: juanrh): Hi, I have worked a bit on the OffsetRange way, you can access the code at https://github.com/juanrh/spark/commit/56fbd5c38bd30b825a7818f1c56abb1f8b2beaff. I have added the following method to pyspark KafkaUtils @staticmethod def getOffsetRanges(rdd): scalaRdd = rdd._jrdd.rdd() offsetRangesArray = scalaRdd.offsetRanges() return [
[jira] [Assigned] (SPARK-8692) improve date and timestamp handling
[ https://issues.apache.org/jira/browse/SPARK-8692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8692: --- Assignee: (was: Apache Spark) > improve date and timestamp handling > --- > > Key: SPARK-8692 > URL: https://issues.apache.org/jira/browse/SPARK-8692 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8692) improve date and timestamp handling
[ https://issues.apache.org/jira/browse/SPARK-8692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14604773#comment-14604773 ] Apache Spark commented on SPARK-8692: - User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/7073 > improve date and timestamp handling > --- > > Key: SPARK-8692 > URL: https://issues.apache.org/jira/browse/SPARK-8692 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8692) improve date and timestamp handling
[ https://issues.apache.org/jira/browse/SPARK-8692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8692: --- Assignee: Apache Spark > improve date and timestamp handling > --- > > Key: SPARK-8692 > URL: https://issues.apache.org/jira/browse/SPARK-8692 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8692) improve date and timestamp handling
Wenchen Fan created SPARK-8692: -- Summary: improve date and timestamp handling Key: SPARK-8692 URL: https://issues.apache.org/jira/browse/SPARK-8692 Project: Spark Issue Type: Improvement Components: SQL Reporter: Wenchen Fan Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-8645) Incorrect expression analysis with Hive
[ https://issues.apache.org/jira/browse/SPARK-8645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eugene Zhulenev closed SPARK-8645. -- Resolution: Won't Fix > Incorrect expression analysis with Hive > --- > > Key: SPARK-8645 > URL: https://issues.apache.org/jira/browse/SPARK-8645 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.0 > Environment: CDH 5.4.2 1.3.0 >Reporter: Eugene Zhulenev > Labels: dataframe > > When using DataFrame backed by Hive table groupBy with agg can't resolve > column if I pass them by String and not Column: > This fails with: org.apache.spark.sql.AnalysisException: expression 'dt' is > neither present in the group by, nor is it an aggregate function. > {code} > val grouped = eventLogHLL > .groupBy(dt, ad_id, site_id).agg( > dt, > ad_id, > col(site_id) as site_id, > sum(imp_count) as imp_count, > sum(click_count) as click_count > ) > {code} > This works fine: > {code} > val grouped = eventLogHLL > .groupBy(col(dt), col(ad_id), col(site_id)).agg( > col(dt)as dt, > col(ad_id) as ad_id, > col(site_id) as site_id, > sum(imp_count) as imp_count, > sum(click_count) as click_count > ) > {code} > Integration tests running with "embedded" spark and DataFrames generated from > RDD works fine. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8691) Enable GZip for Web UI
[ https://issues.apache.org/jira/browse/SPARK-8691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8691: --- Assignee: Apache Spark > Enable GZip for Web UI > -- > > Key: SPARK-8691 > URL: https://issues.apache.org/jira/browse/SPARK-8691 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Reporter: Shixiong Zhu >Assignee: Apache Spark > > When there are massive tasks in the stage page (such as, running > {{sc.parallelize(1 to 10, 1).count()}}), the size of the stage page > is large. Enabling GZip can reduce the size significantly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8691) Enable GZip for Web UI
[ https://issues.apache.org/jira/browse/SPARK-8691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14604733#comment-14604733 ] Apache Spark commented on SPARK-8691: - User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/7072 > Enable GZip for Web UI > -- > > Key: SPARK-8691 > URL: https://issues.apache.org/jira/browse/SPARK-8691 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Reporter: Shixiong Zhu > > When there are massive tasks in the stage page (such as, running > {{sc.parallelize(1 to 10, 1).count()}}), the size of the stage page > is large. Enabling GZip can reduce the size significantly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8691) Enable GZip for Web UI
[ https://issues.apache.org/jira/browse/SPARK-8691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8691: --- Assignee: (was: Apache Spark) > Enable GZip for Web UI > -- > > Key: SPARK-8691 > URL: https://issues.apache.org/jira/browse/SPARK-8691 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Reporter: Shixiong Zhu > > When there are massive tasks in the stage page (such as, running > {{sc.parallelize(1 to 10, 1).count()}}), the size of the stage page > is large. Enabling GZip can reduce the size significantly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8686) DataFrame should support `where` with expression represented by String
[ https://issues.apache.org/jira/browse/SPARK-8686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-8686. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7063 [https://github.com/apache/spark/pull/7063] > DataFrame should support `where` with expression represented by String > -- > > Key: SPARK-8686 > URL: https://issues.apache.org/jira/browse/SPARK-8686 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.5.0 >Reporter: Kousuke Saruta >Priority: Minor > Fix For: 1.5.0 > > > DataFrame supports `filter` function with two types of argument, `Column` and > `String`. But `where` doesn't. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8691) Enable GZip for Web UI
Shixiong Zhu created SPARK-8691: --- Summary: Enable GZip for Web UI Key: SPARK-8691 URL: https://issues.apache.org/jira/browse/SPARK-8691 Project: Spark Issue Type: Sub-task Components: Web UI Reporter: Shixiong Zhu When there are massive tasks in the stage page (such as, running {{sc.parallelize(1 to 10, 1).count()}}), the size of the stage page is large. Enabling GZip can reduce the size significantly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8636) CaseKeyWhen has incorrect NULL handling
[ https://issues.apache.org/jira/browse/SPARK-8636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14604724#comment-14604724 ] Davies Liu commented on SPARK-8636: --- [~animeshbaranawal] What happen if there is null in the grouping key? Does a row with null equal to another row with null? > CaseKeyWhen has incorrect NULL handling > --- > > Key: SPARK-8636 > URL: https://issues.apache.org/jira/browse/SPARK-8636 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.0 >Reporter: Santiago M. Mola > Labels: starter > > CaseKeyWhen implementation in Spark uses the following equals implementation: > {code} > private def equalNullSafe(l: Any, r: Any) = { > if (l == null && r == null) { > true > } else if (l == null || r == null) { > false > } else { > l == r > } > } > {code} > Which is not correct, since in SQL, NULL is never equal to NULL (actually, it > is not unequal either). In this case, a NULL value in a CASE WHEN expression > should never match. > For example, you can execute this in MySQL: > {code} > SELECT CASE NULL WHEN NULL THEN "NULL MATCHES" ELSE "NULL DOES NOT MATCH" END > FROM DUAL; > {code} > And the result will be "NULL DOES NOT MATCH". -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-2017) web ui stage page becomes unresponsive when the number of tasks is large
[ https://issues.apache.org/jira/browse/SPARK-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-2017: --- Assignee: (was: Apache Spark) > web ui stage page becomes unresponsive when the number of tasks is large > > > Key: SPARK-2017 > URL: https://issues.apache.org/jira/browse/SPARK-2017 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Reporter: Reynold Xin > Labels: starter > > {code} > sc.parallelize(1 to 100, 100).count() > {code} > The above code creates one million tasks to be executed. The stage detail web > ui page takes forever to load (if it ever completes). > There are again a few different alternatives: > 0. Limit the number of tasks we show. > 1. Pagination > 2. By default only show the aggregate metrics and failed tasks, and hide the > successful ones. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2017) web ui stage page becomes unresponsive when the number of tasks is large
[ https://issues.apache.org/jira/browse/SPARK-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14604723#comment-14604723 ] Apache Spark commented on SPARK-2017: - User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/7071 > web ui stage page becomes unresponsive when the number of tasks is large > > > Key: SPARK-2017 > URL: https://issues.apache.org/jira/browse/SPARK-2017 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Reporter: Reynold Xin > Labels: starter > > {code} > sc.parallelize(1 to 100, 100).count() > {code} > The above code creates one million tasks to be executed. The stage detail web > ui page takes forever to load (if it ever completes). > There are again a few different alternatives: > 0. Limit the number of tasks we show. > 1. Pagination > 2. By default only show the aggregate metrics and failed tasks, and hide the > successful ones. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-2017) web ui stage page becomes unresponsive when the number of tasks is large
[ https://issues.apache.org/jira/browse/SPARK-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-2017: --- Assignee: Apache Spark > web ui stage page becomes unresponsive when the number of tasks is large > > > Key: SPARK-2017 > URL: https://issues.apache.org/jira/browse/SPARK-2017 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Reporter: Reynold Xin >Assignee: Apache Spark > Labels: starter > > {code} > sc.parallelize(1 to 100, 100).count() > {code} > The above code creates one million tasks to be executed. The stage detail web > ui page takes forever to load (if it ever completes). > There are again a few different alternatives: > 0. Limit the number of tasks we show. > 1. Pagination > 2. By default only show the aggregate metrics and failed tasks, and hide the > successful ones. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8610) Separate Row and InternalRow (part 2)
[ https://issues.apache.org/jira/browse/SPARK-8610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-8610. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7003 [https://github.com/apache/spark/pull/7003] > Separate Row and InternalRow (part 2) > - > > Key: SPARK-8610 > URL: https://issues.apache.org/jira/browse/SPARK-8610 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > Fix For: 1.5.0 > > > Currently, we use GenericRow both for Row and InternalRow, which is confusing > because it could contain Scala type also Catalyst types. > We should have different implementation for them, to avoid some potential > bugs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8684) Update R version in Spark EC2 AMI
[ https://issues.apache.org/jira/browse/SPARK-8684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14604642#comment-14604642 ] Vincent Warmerdam commented on SPARK-8684: -- I have an initial commit that I would like to try out that does the yum approach: https://github.com/koaning/spark-ec2/blob/rstudio-install/create_image.sh#L22-23 [~shivaram] what's the easiest way to test this? > Update R version in Spark EC2 AMI > - > > Key: SPARK-8684 > URL: https://issues.apache.org/jira/browse/SPARK-8684 > Project: Spark > Issue Type: Improvement > Components: EC2, SparkR >Reporter: Shivaram Venkataraman >Priority: Minor > > Right now the R version in the AMI is 3.1 -- However a number of R libraries > need R version 3.2 and it will be good to update the R version on the AMI > while launching a EC2 cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-8684) Update R version in Spark EC2 AMI
[ https://issues.apache.org/jira/browse/SPARK-8684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14604600#comment-14604600 ] Vincent Warmerdam edited comment on SPARK-8684 at 6/28/15 11:26 AM: I'll double check. I've done this before but I recall it was always a bit messy. Do we prefer using yum or is installing from source also a possbility? What centos version will we use? I just worked up a quick server from digital ocean with centos 7 and this seems to work just fine: ``` [root@servy-server ~]# yum install -y epel-release [root@servy-server ~]# yum update -y [root@servy-server ~]# yum install -y R [root@servy-server ~]# R R version 3.2.0 (2015-04-16) -- "Full of Ingredients" Copyright (C) 2015 The R Foundation for Statistical Computing Platform: x86_64-redhat-linux-gnu (64-bit) > ``` The downside of YUM is that it is not always up to date (the latest version is 3.2.1, not 3.2.0 which is what yum gives us). The yum version of R should allow almost all Rstudio packages to just go and work out of the box though, so it might not be the biggest issue. Depending of which version of CentOS we use, getting epel might be a problem. In the past this made it more practical to just go and install R from source. This is not too terrible, it can be done this way: ``` [root@servy-server ~]# wget http://cran.rstudio.com/src/base/R-3/R-3.2.1.tar.gz [root@servy-server ~]# tar xvf R-3.2.1.tar.gz [root@servy-server ~]# cd R-3.2.1 [root@servy-server ~]# ./configure --prefix=$HOME/R-3.2 --with-readline=no --with-x=no [root@servy-server ~]# make && make install ``` The main downside is that this will take a fair amount of time. Another thing we might need to keep in mind is that R has many C++ dependencies so we may also need to install up to date compilers for it. was (Author: cantdutchthis): I'll double check. I've done this before but I recall it was always a bit messy. Do we prefer using yum or is installing from source also a possbility? What centos version will we use? I just worked up a quick server from digital ocean with centos 7 and this seems to work just fine: ``` [root@servy-server ~]# yum install -y epel-release [root@servy-server ~]# yum update -y [root@servy-server ~]# yum install -y R [root@servy-server ~]# R R version 3.2.0 (2015-04-16) -- "Full of Ingredients" Copyright (C) 2015 The R Foundation for Statistical Computing Platform: x86_64-redhat-linux-gnu (64-bit) > ``` The downside of YUM is that it is not always up to date (the latest version is 3.2.1). This version of R should allow almost all Rstudio packages to just go and work out of the box. Depending of which version of CentOS we use, getting epel might be a problem. In the past this made it more practical to just go and install R from source. This is not too terrible, it can be done this way: ``` [root@servy-server ~]# wget http://cran.rstudio.com/src/base/R-3/R-3.2.1.tar.gz [root@servy-server ~]# tar xvf R-3.2.1.tar.gz [root@servy-server ~]# cd R-3.2.1 [root@servy-server ~]# ./configure --prefix=$HOME/R-3.2 --with-readline=no --with-x=no [root@servy-server ~]# make && make install ``` The main downside is that this will take a fair amount of time. > Update R version in Spark EC2 AMI > - > > Key: SPARK-8684 > URL: https://issues.apache.org/jira/browse/SPARK-8684 > Project: Spark > Issue Type: Improvement > Components: EC2, SparkR >Reporter: Shivaram Venkataraman >Priority: Minor > > Right now the R version in the AMI is 3.1 -- However a number of R libraries > need R version 3.2 and it will be good to update the R version on the AMI > while launching a EC2 cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8337) KafkaUtils.createDirectStream for python is lacking API/feature parity with the Scala/Java version
[ https://issues.apache.org/jira/browse/SPARK-8337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14604633#comment-14604633 ] Juan Rodríguez Hortalá commented on SPARK-8337: --- Hi, I have worked a bit on the OffsetRange way, you can access the code at https://github.com/juanrh/spark/commit/56fbd5c38bd30b825a7818f1c56abb1f8b2beaff. I have added the following method to pyspark KafkaUtils @staticmethod def getOffsetRanges(rdd): scalaRdd = rdd._jrdd.rdd() offsetRangesArray = scalaRdd.offsetRanges() return [ OffsetRange(topic = offsetRange.topic(), partition = offsetRange.partition(), fromOffset = offsetRange.fromOffset(), untilOffset = offsetRange.untilOffset()) for offsetRange in offsetRangesArray] This method is used in KafkaUtils.createDirectStreamJB, which is based on the original KafkaUtilsPythonHelper.createDirectStream. The main problem I have is that I don't know where to store the OffsetRange objects. The naive trick of adding them to the __dict__ of each python RDD object doesn't work, the new field is lost in the pyspark wrappers. So the new method createDirectStreamJB takes two additional options, one for performing an action on the OffsetRange list, and another for adding it to each record of the DStream def createDirectStreamJB(ssc, topics, kafkaParams, fromOffsets={}, keyDecoder=utf8_decoder, valueDecoder=utf8_decoder, offsetRangeForeach=None, addOffsetRange=False): """ FIXME: temporary working placeholder :param offsetRangeForeach: if different to None, this function should be a function from a list of OffsetRange to None, and is applied to the OffsetRange list of each rdd :param addOffsetRange: if False (default) output records are of the shape (kafkaKey, kafkaValue); if True output records are of the shape (offsetRange, (kafkaKey, kafkaValue)) for offsetRange the OffsetRange value for the Spark partition for the record This is an example of using createDirectStreamJB: from pyspark.streaming import StreamingContext from pyspark.streaming.kafka import KafkaUtils ssc = StreamingContext(sc, 1) topics = ["test"] kafkaParams = {"metadata.broker.list" : "localhost:9092"} def offsetRangeForeach(offsetRangeList): print print for offsetRange in offsetRangeList: print offsetRange print print kafkaStream = KafkaUtils.createDirectStreamJB(ssc, topics, kafkaParams, offsetRangeForeach=offsetRangeForeach, addOffsetRange=True) # OffsetRange printed as , I guess due to some kind of pyspark proxy kafkaStrStream = kafkaStream.map(lambda (offRan, (k, v)) : str(offRan._fromOffset) + " " + str(offRan._untilOffset) + " " + str(k) + " " + str(v)) # kafkaStream.pprint() kafkaStrStream.pprint() ssc.start() ssc.awaitTermination(timeout=5) which gets the following output 15/06/28 12:36:03 INFO InputInfoTracker: remove old batch metadata: 1435487761000 ms OffsetRange(topic=test, partition=0, fromOffset=178, untilOffset=179) 15/06/28 12:36:04 INFO JobScheduler: Added jobs for time 1435487764000 ms ... 15/06/28 12:36:04 INFO DAGScheduler: Job 4 finished: runJob at PythonRDD.scala:366, took 0,075387 s --- Time: 2015-06-28 12:36:04 --- 178 179 None hola () 15/06/28 12:36:04 INFO JobScheduler: Finished job streaming job 1435487764000 ms.0 from job set of time 1435487764000 ms ... 15/06/28 12:36:05 INFO BlockManager: Removing RDD 12 OffsetRange(topic=test, partition=0, fromOffset=179, untilOffset=180) 15/06/28 12:36:06 INFO JobScheduler: Starting job streaming job 1435487766000 ms.0 from job set of time 1435487766000 ms 15/06/28 12:36:06 INFO DAGScheduler: Job 6 finished: start at NativeMethodAccessorImpl.java:-2, took 0,077993 s --- Time: 2015-06-28 12:36:06 --- 179 180 None caracola () 15/06/28 12:36:06 INFO JobScheduler: Finished job streaming job 1435487766000 ms.0 from job set of time 1435487766000 ms Any thoughts on this will be appreciated, in particular about a suitable place to store the list of OffsetRange objects Greetings, Juan > KafkaUtils.createDirectStream for python is lacking API/feature parity with > the Scala/Java version > -- > > Key: SPARK-8337 > URL: https://issues.apache.org/jira/browse/SPARK-8337 > Project: Spark > Issue Type: Bug > Components: PySpark, Streaming >Affects Versions: 1.4.0 >Reporter: Amit Ramesh >Priority: Critical > > See the following thread for context. > htt
[jira] [Commented] (SPARK-5281) Registering table on RDD is giving MissingRequirementError
[ https://issues.apache.org/jira/browse/SPARK-5281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14604632#comment-14604632 ] Littlestar commented on SPARK-5281: --- Does this patch merge to spark 1.3 branch, thanks. > Registering table on RDD is giving MissingRequirementError > -- > > Key: SPARK-5281 > URL: https://issues.apache.org/jira/browse/SPARK-5281 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.0, 1.3.1 >Reporter: sarsol >Assignee: Iulian Dragos >Priority: Critical > Fix For: 1.4.0 > > > Application crashes on this line {{rdd.registerTempTable("temp")}} in 1.2 > version when using sbt or Eclipse SCALA IDE > Stacktrace: > {code} > Exception in thread "main" scala.reflect.internal.MissingRequirementError: > class org.apache.spark.sql.catalyst.ScalaReflection in JavaMirror with > primordial classloader with boot classpath > [C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-library.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-reflect.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-actor.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-swing.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-compiler.jar;C:\Program > Files\Java\jre7\lib\resources.jar;C:\Program > Files\Java\jre7\lib\rt.jar;C:\Program > Files\Java\jre7\lib\sunrsasign.jar;C:\Program > Files\Java\jre7\lib\jsse.jar;C:\Program > Files\Java\jre7\lib\jce.jar;C:\Program > Files\Java\jre7\lib\charsets.jar;C:\Program > Files\Java\jre7\lib\jfr.jar;C:\Program Files\Java\jre7\classes] not found. > at > scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16) > at > scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17) > at > scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:48) > at > scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:61) > at > scala.reflect.internal.Mirrors$RootsBase.staticModuleOrClass(Mirrors.scala:72) > at > scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:119) > at > scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:21) > at > org.apache.spark.sql.catalyst.ScalaReflection$$typecreator1$1.apply(ScalaReflection.scala:115) > at > scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:231) > at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231) > at scala.reflect.api.TypeTags$class.typeOf(TypeTags.scala:335) > at scala.reflect.api.Universe.typeOf(Universe.scala:59) > at > org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:115) > at > org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:33) > at > org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:100) > at > org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:33) > at > org.apache.spark.sql.catalyst.ScalaReflection$class.attributesFor(ScalaReflection.scala:94) > at > org.apache.spark.sql.catalyst.ScalaReflection$.attributesFor(ScalaReflection.scala:33) > at org.apache.spark.sql.SQLContext.createSchemaRDD(SQLContext.scala:111) > at > com.sar.spark.dq.poc.SparkPOC$delayedInit$body.apply(SparkPOC.scala:43) > at scala.Function0$class.apply$mcV$sp(Function0.scala:40) > at > scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12) > at scala.App$$anonfun$main$1.apply(App.scala:71) > at scala.App$$anonfun$main$1.apply(App.scala:71) > at scala.collection.immutable.List.foreach(List.scala:318) > at > scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32) > at scala.App$class.main(App.scala:71) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8690) Add a setting to disable SparkSQL parquet schema merge by using datasource API
[ https://issues.apache.org/jira/browse/SPARK-8690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14604631#comment-14604631 ] Apache Spark commented on SPARK-8690: - User 'thegiive' has created a pull request for this issue: https://github.com/apache/spark/pull/7070 > Add a setting to disable SparkSQL parquet schema merge by using datasource > API > --- > > Key: SPARK-8690 > URL: https://issues.apache.org/jira/browse/SPARK-8690 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.4.0 > Environment: all >Reporter: thegiive >Priority: Minor > > We need a general config to disable the parquet schema merge feature. > Our sparkSQL application requirement is > # In spark 1.1, 1.2, sparkSQL read parquet time is around 1~5 sec. We don't > want increase too much read parquet time. Around 2000 parquet file, the > schema is the same. So we don't need schema merge feature > # We need to use datasource API's feature like partition discovery. So we > cannot use Spark 1.2 or pervious version > # We have a lot of SparkSQL product. We use > *sqlContext.parquetFile(filename)* to read the parquet file. We don't want to > change the application code. One setting to disable this feature is what we > want > In 1.4, we have serval method. But both of them cannot perfect match our use > case > # Set spark.sql.parquet.useDataSourceApi to false. It will match requirement > 1,3. But it will use old parquet API and fail in requirement 2 > # Use sqlContext.load("parquet" , Map( "path" -> "..." , "mergeSchema" -> > "false" )) will meet requirement 1,2. But it need to change a lot of code we > use in parquet load. > # Spark 1.4 improve a lot on schema merge than 1.3. But directly use default > version of parquet will increase the load time from 1~5 sec to 100 sec. It > will fail requirement 1. > # Try PR 5231 config. But it cannot disable schema merge. > I think it is better to use a config to disable datasource API's schema merge > feature. A PR will be provide later -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8690) Add a setting to disable SparkSQL parquet schema merge by using datasource API
[ https://issues.apache.org/jira/browse/SPARK-8690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8690: --- Assignee: (was: Apache Spark) > Add a setting to disable SparkSQL parquet schema merge by using datasource > API > --- > > Key: SPARK-8690 > URL: https://issues.apache.org/jira/browse/SPARK-8690 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.4.0 > Environment: all >Reporter: thegiive >Priority: Minor > > We need a general config to disable the parquet schema merge feature. > Our sparkSQL application requirement is > # In spark 1.1, 1.2, sparkSQL read parquet time is around 1~5 sec. We don't > want increase too much read parquet time. Around 2000 parquet file, the > schema is the same. So we don't need schema merge feature > # We need to use datasource API's feature like partition discovery. So we > cannot use Spark 1.2 or pervious version > # We have a lot of SparkSQL product. We use > *sqlContext.parquetFile(filename)* to read the parquet file. We don't want to > change the application code. One setting to disable this feature is what we > want > In 1.4, we have serval method. But both of them cannot perfect match our use > case > # Set spark.sql.parquet.useDataSourceApi to false. It will match requirement > 1,3. But it will use old parquet API and fail in requirement 2 > # Use sqlContext.load("parquet" , Map( "path" -> "..." , "mergeSchema" -> > "false" )) will meet requirement 1,2. But it need to change a lot of code we > use in parquet load. > # Spark 1.4 improve a lot on schema merge than 1.3. But directly use default > version of parquet will increase the load time from 1~5 sec to 100 sec. It > will fail requirement 1. > # Try PR 5231 config. But it cannot disable schema merge. > I think it is better to use a config to disable datasource API's schema merge > feature. A PR will be provide later -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8690) Add a setting to disable SparkSQL parquet schema merge by using datasource API
[ https://issues.apache.org/jira/browse/SPARK-8690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8690: --- Assignee: Apache Spark > Add a setting to disable SparkSQL parquet schema merge by using datasource > API > --- > > Key: SPARK-8690 > URL: https://issues.apache.org/jira/browse/SPARK-8690 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.4.0 > Environment: all >Reporter: thegiive >Assignee: Apache Spark >Priority: Minor > > We need a general config to disable the parquet schema merge feature. > Our sparkSQL application requirement is > # In spark 1.1, 1.2, sparkSQL read parquet time is around 1~5 sec. We don't > want increase too much read parquet time. Around 2000 parquet file, the > schema is the same. So we don't need schema merge feature > # We need to use datasource API's feature like partition discovery. So we > cannot use Spark 1.2 or pervious version > # We have a lot of SparkSQL product. We use > *sqlContext.parquetFile(filename)* to read the parquet file. We don't want to > change the application code. One setting to disable this feature is what we > want > In 1.4, we have serval method. But both of them cannot perfect match our use > case > # Set spark.sql.parquet.useDataSourceApi to false. It will match requirement > 1,3. But it will use old parquet API and fail in requirement 2 > # Use sqlContext.load("parquet" , Map( "path" -> "..." , "mergeSchema" -> > "false" )) will meet requirement 1,2. But it need to change a lot of code we > use in parquet load. > # Spark 1.4 improve a lot on schema merge than 1.3. But directly use default > version of parquet will increase the load time from 1~5 sec to 100 sec. It > will fail requirement 1. > # Try PR 5231 config. But it cannot disable schema merge. > I think it is better to use a config to disable datasource API's schema merge > feature. A PR will be provide later -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8690) Add a setting to disable SparkSQL parquet schema merge by using datasource API
thegiive created SPARK-8690: --- Summary: Add a setting to disable SparkSQL parquet schema merge by using datasource API Key: SPARK-8690 URL: https://issues.apache.org/jira/browse/SPARK-8690 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.4.0 Environment: all Reporter: thegiive Priority: Minor We need a general config to disable the parquet schema merge feature. Our sparkSQL application requirement is # In spark 1.1, 1.2, sparkSQL read parquet time is around 1~5 sec. We don't want increase too much read parquet time. Around 2000 parquet file, the schema is the same. So we don't need schema merge feature # We need to use datasource API's feature like partition discovery. So we cannot use Spark 1.2 or pervious version # We have a lot of SparkSQL product. We use *sqlContext.parquetFile(filename)* to read the parquet file. We don't want to change the application code. One setting to disable this feature is what we want In 1.4, we have serval method. But both of them cannot perfect match our use case # Set spark.sql.parquet.useDataSourceApi to false. It will match requirement 1,3. But it will use old parquet API and fail in requirement 2 # Use sqlContext.load("parquet" , Map( "path" -> "..." , "mergeSchema" -> "false" )) will meet requirement 1,2. But it need to change a lot of code we use in parquet load. # Spark 1.4 improve a lot on schema merge than 1.3. But directly use default version of parquet will increase the load time from 1~5 sec to 100 sec. It will fail requirement 1. # Try PR 5231 config. But it cannot disable schema merge. I think it is better to use a config to disable datasource API's schema merge feature. A PR will be provide later -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8679) remove dummy java class BaseRow and BaseMutableRow
[ https://issues.apache.org/jira/browse/SPARK-8679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-8679: - Priority: Minor (was: Major) Component/s: SQL Issue Type: Improvement (was: Bug) [~cloud_fan] Let's assign component, priority and type please > remove dummy java class BaseRow and BaseMutableRow > -- > > Key: SPARK-8679 > URL: https://issues.apache.org/jira/browse/SPARK-8679 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8680) PropagateTypes is very slow when there are lots of columns
[ https://issues.apache.org/jira/browse/SPARK-8680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-8680: - Component/s: SQL > PropagateTypes is very slow when there are lots of columns > -- > > Key: SPARK-8680 > URL: https://issues.apache.org/jira/browse/SPARK-8680 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.1, 1.4.0 >Reporter: Davies Liu > > The time for PropagateTypes is O(N*N), N is the number of columns, which is > very slow if there many columns (>1000) > There easiest optimization could be put `q.inputSet` outside of > transformExpressions which could have about 4 times improvement for N=3000 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8645) Incorrect expression analysis with Hive
[ https://issues.apache.org/jira/browse/SPARK-8645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-8645: - Component/s: SQL Please read https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark Set component for example > Incorrect expression analysis with Hive > --- > > Key: SPARK-8645 > URL: https://issues.apache.org/jira/browse/SPARK-8645 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.0 > Environment: CDH 5.4.2 1.3.0 >Reporter: Eugene Zhulenev > Labels: dataframe > > When using DataFrame backed by Hive table groupBy with agg can't resolve > column if I pass them by String and not Column: > This fails with: org.apache.spark.sql.AnalysisException: expression 'dt' is > neither present in the group by, nor is it an aggregate function. > {code} > val grouped = eventLogHLL > .groupBy(dt, ad_id, site_id).agg( > dt, > ad_id, > col(site_id) as site_id, > sum(imp_count) as imp_count, > sum(click_count) as click_count > ) > {code} > This works fine: > {code} > val grouped = eventLogHLL > .groupBy(col(dt), col(ad_id), col(site_id)).agg( > col(dt)as dt, > col(ad_id) as ad_id, > col(site_id) as site_id, > sum(imp_count) as imp_count, > sum(click_count) as click_count > ) > {code} > Integration tests running with "embedded" spark and DataFrames generated from > RDD works fine. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8635) improve performance of CatalystTypeConverters
[ https://issues.apache.org/jira/browse/SPARK-8635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-8635: - Assignee: Wenchen Fan > improve performance of CatalystTypeConverters > - > > Key: SPARK-8635 > URL: https://issues.apache.org/jira/browse/SPARK-8635 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 1.5.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8237) misc function: sha2
[ https://issues.apache.org/jira/browse/SPARK-8237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-8237: - Assignee: Liang-Chi Hsieh > misc function: sha2 > --- > > Key: SPARK-8237 > URL: https://issues.apache.org/jira/browse/SPARK-8237 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Liang-Chi Hsieh > Fix For: 1.5.0 > > > sha2(string/binary, int): string > Calculates the SHA-2 family of hash functions (SHA-224, SHA-256, SHA-384, and > SHA-512) (as of Hive 1.3.0). The first argument is the string or binary to be > hashed. The second argument indicates the desired bit length of the result, > which must have a value of 224, 256, 384, 512, or 0 (which is equivalent to > 256). SHA-224 is supported starting from Java 8. If either argument is NULL > or the hash length is not one of the permitted values, the return value is > NULL. Example: sha2('ABC', 256) = > 'b5d4045c3f466fa91fe2cc6abe79232a1a57cdf104f7a26e716e0a1e2789df78'. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7088) [REGRESSION] Spark 1.3.1 breaks analysis of third-party logical plans
[ https://issues.apache.org/jira/browse/SPARK-7088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-7088: - Assignee: Santiago M. Mola > [REGRESSION] Spark 1.3.1 breaks analysis of third-party logical plans > - > > Key: SPARK-7088 > URL: https://issues.apache.org/jira/browse/SPARK-7088 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.1 >Reporter: Santiago M. Mola >Assignee: Santiago M. Mola >Priority: Critical > Labels: regression > Fix For: 1.5.0 > > > We're using some custom logical plans. We are now migrating from Spark 1.3.0 > to 1.3.1 and found a few incompatible API changes. All of them seem to be in > internal code, so we understand that. But now the ResolveReferences rule, > that used to work with third-party logical plans just does not work, without > any possible workaround that I'm aware other than just copying > ResolveReferences rule and using it with our own fix. > The change in question is this section of code: > {code} > }.headOption.getOrElse { // Only handle first case, others will be > fixed on the next pass. > sys.error( > s""" > |Failure when resolving conflicting references in Join: > |$plan > | > |Conflicting attributes: ${conflictingAttributes.mkString(",")} > """.stripMargin) > } > {code} > Which causes the following error on analysis: > {code} > Failure when resolving conflicting references in Join: > 'Project ['l.name,'r.name,'FUNC1('l.node,'r.node) AS > c2#37,'FUNC2('l.node,'r.node) AS c3#38,'FUNC3('r.node,'l.node) AS c4#39] > 'Join Inner, None > Subquery l >Subquery h > Project [name#12,node#36] > CustomPlan H, u, (p#13L = s#14L), [ord#15 ASC], IS NULL p#13L, node#36 > Subquery v >Subquery h_src > LogicalRDD [name#12,p#13L,s#14L,ord#15], MapPartitionsRDD[1] at > mapPartitions at ExistingRDD.scala:37 > Subquery r >Subquery h > Project [name#40,node#36] > CustomPlan H, u, (p#41L = s#42L), [ord#43 ASC], IS NULL pred#41L, node#36 > Subquery v >Subquery h_src > LogicalRDD [name#40,p#41L,s#42L,ord#43], MapPartitionsRDD[1] at > mapPartitions at ExistingRDD.scala:37 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6777) Implement backwards-compatibility rules in Parquet schema converters
[ https://issues.apache.org/jira/browse/SPARK-6777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6777: - Assignee: Cheng Lian > Implement backwards-compatibility rules in Parquet schema converters > > > Key: SPARK-6777 > URL: https://issues.apache.org/jira/browse/SPARK-6777 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.0.2, 1.1.1, 1.2.1, 1.3.0, 1.4.0 >Reporter: Cheng Lian >Assignee: Cheng Lian >Priority: Critical > Fix For: 1.5.0 > > > When converting Parquet schemas to/from Spark SQL schemas, we should > recognize commonly used legacy non-standard representation of complex types. > We can follow the pattern used in Parquet's {{AvroSchemaConverter}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6777) Implement backwards-compatibility rules in Parquet schema converters
[ https://issues.apache.org/jira/browse/SPARK-6777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14604619#comment-14604619 ] Sean Owen commented on SPARK-6777: -- [~lian cheng] can you please assign JIRAs when you resolve them? even if it's to yourself. That's a necessary final step and a lot of yours miss that > Implement backwards-compatibility rules in Parquet schema converters > > > Key: SPARK-6777 > URL: https://issues.apache.org/jira/browse/SPARK-6777 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.0.2, 1.1.1, 1.2.1, 1.3.0, 1.4.0 >Reporter: Cheng Lian >Priority: Critical > Fix For: 1.5.0 > > > When converting Parquet schemas to/from Spark SQL schemas, we should > recognize commonly used legacy non-standard representation of complex types. > We can follow the pattern used in Parquet's {{AvroSchemaConverter}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5482) Allow individual test suites in python/run-tests
[ https://issues.apache.org/jira/browse/SPARK-5482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5482: - Assignee: Josh Rosen > Allow individual test suites in python/run-tests > > > Key: SPARK-5482 > URL: https://issues.apache.org/jira/browse/SPARK-5482 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Katsunori Kanda >Assignee: Josh Rosen >Priority: Minor > Fix For: 1.5.0 > > > Add options to run individual test suites in python/run-tests. The usage is > as follow. > ./python/run-tests \[core|sql|mllib|ml|streaming\] > When you select none, all test suites are run for backward compatibility. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8603) In Windows,Not able to create a Spark context from R studio
[ https://issues.apache.org/jira/browse/SPARK-8603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-8603: - Target Version/s: (was: 1.4.0) Fix Version/s: (was: 1.4.0) [~Prakashpc] Before opening a JIRA, you need to read https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark You do not set Fix or Target version, for example > In Windows,Not able to create a Spark context from R studio > > > Key: SPARK-8603 > URL: https://issues.apache.org/jira/browse/SPARK-8603 > Project: Spark > Issue Type: Bug > Components: R >Affects Versions: 1.4.0 > Environment: Windows, R studio >Reporter: Prakash Ponshankaarchinnusamy > Original Estimate: 0.5m > Remaining Estimate: 0.5m > > In windows ,creation of spark context fails using below code from R studio > Sys.setenv(SPARK_HOME="C:\\spark\\spark-1.4.0") > .libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths())) > library(SparkR) > sc <- sparkR.init(master="spark://localhost:7077", appName="SparkR") > Error: JVM is not ready after 10 seconds > Reason: Wrong file path computed in client.R. File seperator for windows["\"] > is not respected by "file.Path" function by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8688) Hadoop Configuration has to disable client cache when writing or reading delegation tokens.
[ https://issues.apache.org/jira/browse/SPARK-8688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8688: --- Assignee: (was: Apache Spark) > Hadoop Configuration has to disable client cache when writing or reading > delegation tokens. > --- > > Key: SPARK-8688 > URL: https://issues.apache.org/jira/browse/SPARK-8688 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.5.0 >Reporter: SaintBacchus > > In class *AMDelegationTokenRenewer* and *ExecutorDelegationTokenUpdater*, > Spark will write and read the credentials. > But if we don't disable the *fs.hdfs.impl.disable.cache*, Spark will use > cached FileSystem (which will use old token ) to upload or download file. > Then when the old token is expired, it can't gain the auth to get/put the > hdfs. > (I only tested in a very short time with the configuration: > dfs.namenode.delegation.token.renew-interval=3min > dfs.namenode.delegation.token.max-lifetime=10min > I'm not sure whatever it matters. > ) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8688) Hadoop Configuration has to disable client cache when writing or reading delegation tokens.
[ https://issues.apache.org/jira/browse/SPARK-8688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14604613#comment-14604613 ] Apache Spark commented on SPARK-8688: - User 'SaintBacchus' has created a pull request for this issue: https://github.com/apache/spark/pull/7069 > Hadoop Configuration has to disable client cache when writing or reading > delegation tokens. > --- > > Key: SPARK-8688 > URL: https://issues.apache.org/jira/browse/SPARK-8688 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.5.0 >Reporter: SaintBacchus > > In class *AMDelegationTokenRenewer* and *ExecutorDelegationTokenUpdater*, > Spark will write and read the credentials. > But if we don't disable the *fs.hdfs.impl.disable.cache*, Spark will use > cached FileSystem (which will use old token ) to upload or download file. > Then when the old token is expired, it can't gain the auth to get/put the > hdfs. > (I only tested in a very short time with the configuration: > dfs.namenode.delegation.token.renew-interval=3min > dfs.namenode.delegation.token.max-lifetime=10min > I'm not sure whatever it matters. > ) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8688) Hadoop Configuration has to disable client cache when writing or reading delegation tokens.
[ https://issues.apache.org/jira/browse/SPARK-8688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8688: --- Assignee: Apache Spark > Hadoop Configuration has to disable client cache when writing or reading > delegation tokens. > --- > > Key: SPARK-8688 > URL: https://issues.apache.org/jira/browse/SPARK-8688 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.5.0 >Reporter: SaintBacchus >Assignee: Apache Spark > > In class *AMDelegationTokenRenewer* and *ExecutorDelegationTokenUpdater*, > Spark will write and read the credentials. > But if we don't disable the *fs.hdfs.impl.disable.cache*, Spark will use > cached FileSystem (which will use old token ) to upload or download file. > Then when the old token is expired, it can't gain the auth to get/put the > hdfs. > (I only tested in a very short time with the configuration: > dfs.namenode.delegation.token.renew-interval=3min > dfs.namenode.delegation.token.max-lifetime=10min > I'm not sure whatever it matters. > ) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-8684) Update R version in Spark EC2 AMI
[ https://issues.apache.org/jira/browse/SPARK-8684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14604600#comment-14604600 ] Vincent Warmerdam edited comment on SPARK-8684 at 6/28/15 9:52 AM: --- I'll double check. I've done this before but I recall it was always a bit messy. Do we prefer using yum or is installing from source also a possbility? What centos version will we use? I just worked up a quick server from digital ocean with centos 7 and this seems to work just fine: ``` [root@servy-server ~]# yum install -y epel-release [root@servy-server ~]# yum update -y [root@servy-server ~]# yum install -y R [root@servy-server ~]# R R version 3.2.0 (2015-04-16) -- "Full of Ingredients" Copyright (C) 2015 The R Foundation for Statistical Computing Platform: x86_64-redhat-linux-gnu (64-bit) > ``` The downside of YUM is that it is not always up to date (the latest version is 3.2.1). This version of R should allow almost all Rstudio packages to just go and work out of the box. Depending of which version of CentOS we use, getting epel might be a problem. In the past this made it more practical to just go and install R from source. This is not too terrible, it can be done this way: ``` [root@servy-server ~]# wget http://cran.rstudio.com/src/base/R-3/R-3.2.1.tar.gz [root@servy-server ~]# tar xvf R-3.2.1.tar.gz [root@servy-server ~]# cd R-3.2.1 [root@servy-server ~]# ./configure --prefix=$HOME/R-3.2 --with-readline=no --with-x=no [root@servy-server ~]# make && make install ``` The main downside is that this will take a fair amount of time. was (Author: cantdutchthis): I'll double check. I've done this before but I recall it was always a bit messy. Do we prefer using yum or is installing from source also a possbility? > Update R version in Spark EC2 AMI > - > > Key: SPARK-8684 > URL: https://issues.apache.org/jira/browse/SPARK-8684 > Project: Spark > Issue Type: Improvement > Components: EC2, SparkR >Reporter: Shivaram Venkataraman >Priority: Minor > > Right now the R version in the AMI is 3.1 -- However a number of R libraries > need R version 3.2 and it will be good to update the R version on the AMI > while launching a EC2 cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8684) Update R version in Spark EC2 AMI
[ https://issues.apache.org/jira/browse/SPARK-8684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14604600#comment-14604600 ] Vincent Warmerdam commented on SPARK-8684: -- I'll double check. I've done this before but I recall it was always a bit messy. Do we prefer using yum or is installing from source also a possbility? > Update R version in Spark EC2 AMI > - > > Key: SPARK-8684 > URL: https://issues.apache.org/jira/browse/SPARK-8684 > Project: Spark > Issue Type: Improvement > Components: EC2, SparkR >Reporter: Shivaram Venkataraman >Priority: Minor > > Right now the R version in the AMI is 3.1 -- However a number of R libraries > need R version 3.2 and it will be good to update the R version on the AMI > while launching a EC2 cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8596) Install and configure RStudio server on Spark EC2
[ https://issues.apache.org/jira/browse/SPARK-8596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8596: --- Assignee: (was: Apache Spark) > Install and configure RStudio server on Spark EC2 > - > > Key: SPARK-8596 > URL: https://issues.apache.org/jira/browse/SPARK-8596 > Project: Spark > Issue Type: Improvement > Components: EC2, SparkR >Reporter: Shivaram Venkataraman > > This will make it convenient for R users to use SparkR from their browsers -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8596) Install and configure RStudio server on Spark EC2
[ https://issues.apache.org/jira/browse/SPARK-8596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8596: --- Assignee: Apache Spark > Install and configure RStudio server on Spark EC2 > - > > Key: SPARK-8596 > URL: https://issues.apache.org/jira/browse/SPARK-8596 > Project: Spark > Issue Type: Improvement > Components: EC2, SparkR >Reporter: Shivaram Venkataraman >Assignee: Apache Spark > > This will make it convenient for R users to use SparkR from their browsers -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8596) Install and configure RStudio server on Spark EC2
[ https://issues.apache.org/jira/browse/SPARK-8596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14604588#comment-14604588 ] Apache Spark commented on SPARK-8596: - User 'koaning' has created a pull request for this issue: https://github.com/apache/spark/pull/7068 > Install and configure RStudio server on Spark EC2 > - > > Key: SPARK-8596 > URL: https://issues.apache.org/jira/browse/SPARK-8596 > Project: Spark > Issue Type: Improvement > Components: EC2, SparkR >Reporter: Shivaram Venkataraman > > This will make it convenient for R users to use SparkR from their browsers -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-8596) Install and configure RStudio server on Spark EC2
[ https://issues.apache.org/jira/browse/SPARK-8596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14604587#comment-14604587 ] Vincent Warmerdam edited comment on SPARK-8596 at 6/28/15 8:49 AM: --- 1. on it. just created [SPARK-8596][EC2] Added port for Rstudio 2. on it was (Author: cantdutchthis): 1. on it. > Install and configure RStudio server on Spark EC2 > - > > Key: SPARK-8596 > URL: https://issues.apache.org/jira/browse/SPARK-8596 > Project: Spark > Issue Type: Improvement > Components: EC2, SparkR >Reporter: Shivaram Venkataraman > > This will make it convenient for R users to use SparkR from their browsers -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8596) Install and configure RStudio server on Spark EC2
[ https://issues.apache.org/jira/browse/SPARK-8596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14604587#comment-14604587 ] Vincent Warmerdam commented on SPARK-8596: -- 1. on it. > Install and configure RStudio server on Spark EC2 > - > > Key: SPARK-8596 > URL: https://issues.apache.org/jira/browse/SPARK-8596 > Project: Spark > Issue Type: Improvement > Components: EC2, SparkR >Reporter: Shivaram Venkataraman > > This will make it convenient for R users to use SparkR from their browsers -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8689) Add get* methods with fieldName to Row
[ https://issues.apache.org/jira/browse/SPARK-8689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8689: --- Assignee: (was: Apache Spark) > Add get* methods with fieldName to Row > -- > > Key: SPARK-8689 > URL: https://issues.apache.org/jira/browse/SPARK-8689 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.5.0 >Reporter: Kousuke Saruta > > `Row ` has methods to get field a value by index but not has methods to get a > field value by fieldName. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8689) Add get* methods with fieldName to Row
[ https://issues.apache.org/jira/browse/SPARK-8689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14604586#comment-14604586 ] Apache Spark commented on SPARK-8689: - User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/7067 > Add get* methods with fieldName to Row > -- > > Key: SPARK-8689 > URL: https://issues.apache.org/jira/browse/SPARK-8689 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.5.0 >Reporter: Kousuke Saruta > > `Row ` has methods to get field a value by index but not has methods to get a > field value by fieldName. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8689) Add get* methods with fieldName to Row
[ https://issues.apache.org/jira/browse/SPARK-8689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8689: --- Assignee: Apache Spark > Add get* methods with fieldName to Row > -- > > Key: SPARK-8689 > URL: https://issues.apache.org/jira/browse/SPARK-8689 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.5.0 >Reporter: Kousuke Saruta >Assignee: Apache Spark > > `Row ` has methods to get field a value by index but not has methods to get a > field value by fieldName. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org