date:20150628

[jira] [Assigned] (SPARK-8698) partitionBy in Python DataFrame reader/writer interface should not default to empty tuple

2015-06-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8698:
---

Assignee: Apache Spark  (was: Reynold Xin)

> partitionBy in Python DataFrame reader/writer interface should not default to 
> empty tuple
> -
>
> Key: SPARK-8698
> URL: https://issues.apache.org/jira/browse/SPARK-8698
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8698) partitionBy in Python DataFrame reader/writer interface should not default to empty tuple

2015-06-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8698:
---

Assignee: Reynold Xin  (was: Apache Spark)

> partitionBy in Python DataFrame reader/writer interface should not default to 
> empty tuple
> -
>
> Key: SPARK-8698
> URL: https://issues.apache.org/jira/browse/SPARK-8698
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8698) partitionBy in Python DataFrame reader/writer interface should not default to empty tuple

2015-06-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14605203#comment-14605203
 ] 

Apache Spark commented on SPARK-8698:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/7079

> partitionBy in Python DataFrame reader/writer interface should not default to 
> empty tuple
> -
>
> Key: SPARK-8698
> URL: https://issues.apache.org/jira/browse/SPARK-8698
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-8698) partitionBy in Python DataFrame reader/writer interface should not default to empty tuple

2015-06-28 Thread Reynold Xin (JIRA)

Reynold Xin created SPARK-8698:
--

 Summary: partitionBy in Python DataFrame reader/writer interface 
should not default to empty tuple
 Key: SPARK-8698
 URL: https://issues.apache.org/jira/browse/SPARK-8698
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5571) LDA should handle text as well

2015-06-28 Thread Alok Singh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14605200#comment-14605200
 ] 

Alok Singh commented on SPARK-5571:
---

Just wanted to get more clarification on this.
Does this jira , expect all the components i.e tokenizer -> stemmer -> 
stopword->runWithPrunedBagOfWords? or is it that we assume that  input is 
already tokenized, stemmed and stopword removed?



> LDA should handle text as well
> --
>
> Key: SPARK-5571
> URL: https://issues.apache.org/jira/browse/SPARK-5571
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>
> Latent Dirichlet Allocation (LDA) currently operates only on vectors of word 
> counts.  It should also supporting training and prediction using text 
> (Strings).
> This plan is sketched in the [original LDA design 
> doc|https://docs.google.com/document/d/1kSsDqTeZMEB94Bs4GTd0mvdAmduvZSSkpoSfn-seAzo/edit?usp=sharing].
> There should be:
> * runWithText() method which takes an RDD with a collection of Strings (bags 
> of words).  This will also index terms and compute a dictionary.
> * dictionary parameter for when LDA is run with word count vectors
> * prediction/feedback methods returning Strings (such as 
> describeTopicsAsStrings, which is commented out in LDA currently)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8355) Python DataFrameReader/Writer should mirror scala

2015-06-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8355:
---

Assignee: (was: Apache Spark)

> Python DataFrameReader/Writer should mirror scala
> -
>
> Key: SPARK-8355
> URL: https://issues.apache.org/jira/browse/SPARK-8355
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>Priority: Critical
>
> All the functions that I can run in scala should also work in python.  At 
> least {{ctx.read.option}} is missing, but we should also audit to make sure 
> there aren't others.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8355) Python DataFrameReader/Writer should mirror scala

2015-06-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14605185#comment-14605185
 ] 

Apache Spark commented on SPARK-8355:
-

User 'piaozhexiu' has created a pull request for this issue:
https://github.com/apache/spark/pull/7078

> Python DataFrameReader/Writer should mirror scala
> -
>
> Key: SPARK-8355
> URL: https://issues.apache.org/jira/browse/SPARK-8355
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>Priority: Critical
>
> All the functions that I can run in scala should also work in python.  At 
> least {{ctx.read.option}} is missing, but we should also audit to make sure 
> there aren't others.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8355) Python DataFrameReader/Writer should mirror scala

2015-06-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8355:
---

Assignee: Apache Spark

> Python DataFrameReader/Writer should mirror scala
> -
>
> Key: SPARK-8355
> URL: https://issues.apache.org/jira/browse/SPARK-8355
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Apache Spark
>Priority: Critical
>
> All the functions that I can run in scala should also work in python.  At 
> least {{ctx.read.option}} is missing, but we should also audit to make sure 
> there aren't others.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5562) LDA should handle empty documents

2015-06-28 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-5562:
-
Shepherd: Joseph K. Bradley

> LDA should handle empty documents
> -
>
> Key: SPARK-5562
> URL: https://issues.apache.org/jira/browse/SPARK-5562
> Project: Spark
>  Issue Type: Test
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Priority: Minor
>  Labels: starter
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> Latent Dirichlet Allocation (LDA) could easily be given empty documents when 
> people select a small vocabulary.  We should check to make sure it is robust 
> to empty documents.
> This will hopefully take the form of a unit test, but may require modifying 
> the LDA implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8592) CoarseGrainedExecutorBackend: Cannot register with driver => NPE

2015-06-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14605177#comment-14605177
 ] 

Apache Spark commented on SPARK-8592:
-

User 'xuchenCN' has created a pull request for this issue:
https://github.com/apache/spark/pull/7077

> CoarseGrainedExecutorBackend: Cannot register with driver => NPE
> 
>
> Key: SPARK-8592
> URL: https://issues.apache.org/jira/browse/SPARK-8592
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 1.4.0
> Environment: Ubuntu 14.04, Scala 2.11, Java 8, 
>Reporter: Sjoerd Mulder
>Priority: Minor
>
> I cannot reproduce this consistently but when submitting jobs just after 
> another finished it will not come up:
> {code}
> 15/06/24 14:57:24 INFO WorkerWatcher: Connecting to worker 
> akka.tcp://sparkWorker@10.0.7.171:39135/user/Worker
> 15/06/24 14:57:24 INFO WorkerWatcher: Successfully connected to 
> akka.tcp://sparkWorker@10.0.7.171:39135/user/Worker
> 15/06/24 14:57:24 ERROR CoarseGrainedExecutorBackend: Cannot register with 
> driver: akka.tcp://sparkDriver@172.17.0.109:47462/user/CoarseGrainedScheduler
> java.lang.NullPointerException
>   at 
> org.apache.spark.rpc.akka.AkkaRpcEndpointRef.actorRef$lzycompute(AkkaRpcEnv.scala:273)
>   at 
> org.apache.spark.rpc.akka.AkkaRpcEndpointRef.actorRef(AkkaRpcEnv.scala:273)
>   at 
> org.apache.spark.rpc.akka.AkkaRpcEndpointRef.toString(AkkaRpcEnv.scala:313)
>   at java.lang.String.valueOf(String.java:2982)
>   at 
> scala.collection.mutable.StringBuilder.append(StringBuilder.scala:200)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$3.apply(CoarseGrainedSchedulerBackend.scala:125)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$3.apply(CoarseGrainedSchedulerBackend.scala:125)
>   at org.apache.spark.Logging$class.logInfo(Logging.scala:59)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.logInfo(CoarseGrainedSchedulerBackend.scala:69)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receiveAndReply$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:125)
>   at 
> org.apache.spark.rpc.akka.AkkaRpcEnv.org$apache$spark$rpc$akka$AkkaRpcEnv$$processMessage(AkkaRpcEnv.scala:178)
>   at 
> org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1$$anonfun$receiveWithLogging$1$$anonfun$applyOrElse$4.apply$mcV$sp(AkkaRpcEnv.scala:127)
>   at 
> org.apache.spark.rpc.akka.AkkaRpcEnv.org$apache$spark$rpc$akka$AkkaRpcEnv$$safelyCall(AkkaRpcEnv.scala:198)
>   at 
> org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1$$anonfun$receiveWithLogging$1.applyOrElse(AkkaRpcEnv.scala:126)
>   at 
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
>   at 
> org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:59)
>   at 
> org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42)
>   at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
>   at 
> org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42)
>   at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
>   at 
> org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1.aroundReceive(AkkaRpcEnv.scala:93)
>   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>   at akka.actor.ActorCell.invoke(ActorCell.scala:487)
>   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
>   at akka.dispatch.Mailbox.run(Mailbox.scala:220)
>   at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
>   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>   at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8592) CoarseGrainedExecutorBackend: Cannot register with driver => NPE

2015-06-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8592:
---

Assignee: (was: Apache Spark)

> CoarseGrainedExecutorBackend: Cannot register with driver => NPE
> 
>
> Key: SPARK-8592
> URL: https://issues.apache.org/jira/browse/SPARK-8592
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 1.4.0
> Environment: Ubuntu 14.04, Scala 2.11, Java 8, 
>Reporter: Sjoerd Mulder
>Priority: Minor
>
> I cannot reproduce this consistently but when submitting jobs just after 
> another finished it will not come up:
> {code}
> 15/06/24 14:57:24 INFO WorkerWatcher: Connecting to worker 
> akka.tcp://sparkWorker@10.0.7.171:39135/user/Worker
> 15/06/24 14:57:24 INFO WorkerWatcher: Successfully connected to 
> akka.tcp://sparkWorker@10.0.7.171:39135/user/Worker
> 15/06/24 14:57:24 ERROR CoarseGrainedExecutorBackend: Cannot register with 
> driver: akka.tcp://sparkDriver@172.17.0.109:47462/user/CoarseGrainedScheduler
> java.lang.NullPointerException
>   at 
> org.apache.spark.rpc.akka.AkkaRpcEndpointRef.actorRef$lzycompute(AkkaRpcEnv.scala:273)
>   at 
> org.apache.spark.rpc.akka.AkkaRpcEndpointRef.actorRef(AkkaRpcEnv.scala:273)
>   at 
> org.apache.spark.rpc.akka.AkkaRpcEndpointRef.toString(AkkaRpcEnv.scala:313)
>   at java.lang.String.valueOf(String.java:2982)
>   at 
> scala.collection.mutable.StringBuilder.append(StringBuilder.scala:200)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$3.apply(CoarseGrainedSchedulerBackend.scala:125)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$3.apply(CoarseGrainedSchedulerBackend.scala:125)
>   at org.apache.spark.Logging$class.logInfo(Logging.scala:59)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.logInfo(CoarseGrainedSchedulerBackend.scala:69)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receiveAndReply$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:125)
>   at 
> org.apache.spark.rpc.akka.AkkaRpcEnv.org$apache$spark$rpc$akka$AkkaRpcEnv$$processMessage(AkkaRpcEnv.scala:178)
>   at 
> org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1$$anonfun$receiveWithLogging$1$$anonfun$applyOrElse$4.apply$mcV$sp(AkkaRpcEnv.scala:127)
>   at 
> org.apache.spark.rpc.akka.AkkaRpcEnv.org$apache$spark$rpc$akka$AkkaRpcEnv$$safelyCall(AkkaRpcEnv.scala:198)
>   at 
> org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1$$anonfun$receiveWithLogging$1.applyOrElse(AkkaRpcEnv.scala:126)
>   at 
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
>   at 
> org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:59)
>   at 
> org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42)
>   at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
>   at 
> org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42)
>   at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
>   at 
> org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1.aroundReceive(AkkaRpcEnv.scala:93)
>   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>   at akka.actor.ActorCell.invoke(ActorCell.scala:487)
>   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
>   at akka.dispatch.Mailbox.run(Mailbox.scala:220)
>   at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
>   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>   at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8592) CoarseGrainedExecutorBackend: Cannot register with driver => NPE

2015-06-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8592:
---

Assignee: Apache Spark

> CoarseGrainedExecutorBackend: Cannot register with driver => NPE
> 
>
> Key: SPARK-8592
> URL: https://issues.apache.org/jira/browse/SPARK-8592
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 1.4.0
> Environment: Ubuntu 14.04, Scala 2.11, Java 8, 
>Reporter: Sjoerd Mulder
>Assignee: Apache Spark
>Priority: Minor
>
> I cannot reproduce this consistently but when submitting jobs just after 
> another finished it will not come up:
> {code}
> 15/06/24 14:57:24 INFO WorkerWatcher: Connecting to worker 
> akka.tcp://sparkWorker@10.0.7.171:39135/user/Worker
> 15/06/24 14:57:24 INFO WorkerWatcher: Successfully connected to 
> akka.tcp://sparkWorker@10.0.7.171:39135/user/Worker
> 15/06/24 14:57:24 ERROR CoarseGrainedExecutorBackend: Cannot register with 
> driver: akka.tcp://sparkDriver@172.17.0.109:47462/user/CoarseGrainedScheduler
> java.lang.NullPointerException
>   at 
> org.apache.spark.rpc.akka.AkkaRpcEndpointRef.actorRef$lzycompute(AkkaRpcEnv.scala:273)
>   at 
> org.apache.spark.rpc.akka.AkkaRpcEndpointRef.actorRef(AkkaRpcEnv.scala:273)
>   at 
> org.apache.spark.rpc.akka.AkkaRpcEndpointRef.toString(AkkaRpcEnv.scala:313)
>   at java.lang.String.valueOf(String.java:2982)
>   at 
> scala.collection.mutable.StringBuilder.append(StringBuilder.scala:200)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$3.apply(CoarseGrainedSchedulerBackend.scala:125)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$3.apply(CoarseGrainedSchedulerBackend.scala:125)
>   at org.apache.spark.Logging$class.logInfo(Logging.scala:59)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.logInfo(CoarseGrainedSchedulerBackend.scala:69)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receiveAndReply$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:125)
>   at 
> org.apache.spark.rpc.akka.AkkaRpcEnv.org$apache$spark$rpc$akka$AkkaRpcEnv$$processMessage(AkkaRpcEnv.scala:178)
>   at 
> org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1$$anonfun$receiveWithLogging$1$$anonfun$applyOrElse$4.apply$mcV$sp(AkkaRpcEnv.scala:127)
>   at 
> org.apache.spark.rpc.akka.AkkaRpcEnv.org$apache$spark$rpc$akka$AkkaRpcEnv$$safelyCall(AkkaRpcEnv.scala:198)
>   at 
> org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1$$anonfun$receiveWithLogging$1.applyOrElse(AkkaRpcEnv.scala:126)
>   at 
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
>   at 
> org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:59)
>   at 
> org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42)
>   at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
>   at 
> org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42)
>   at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
>   at 
> org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1.aroundReceive(AkkaRpcEnv.scala:93)
>   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>   at akka.actor.ActorCell.invoke(ActorCell.scala:487)
>   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
>   at akka.dispatch.Mailbox.run(Mailbox.scala:220)
>   at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
>   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>   at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-8697) MatchIterator not serializable exception in RegexTokeinzer

2015-06-28 Thread Xiangrui Meng (JIRA)

Xiangrui Meng created SPARK-8697:


 Summary: MatchIterator not serializable exception in RegexTokeinzer
 Key: SPARK-8697
 URL: https://issues.apache.org/jira/browse/SPARK-8697
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Priority: Minor


I'm not sure whether this is a real bug or not. In REPL, I saw MatchIterator 
not serializable exception in RegexTokeinzer during some ad-hoc testing. 
However, I couldn't reproduce this issue. Maybe it is a REPL bug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7739) Improve ChiSqSelector example code in the user guide

2015-06-28 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-7739:
-
Target Version/s: 1.5.0

> Improve ChiSqSelector example code in the user guide
> 
>
> Key: SPARK-7739
> URL: https://issues.apache.org/jira/browse/SPARK-7739
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, MLlib
>Affects Versions: 1.3.1, 1.4.0
>Reporter: Xiangrui Meng
>Assignee: Seth Hendrickson
>Priority: Minor
>  Labels: starter
>
> As discussed in 
> http://apache-spark-user-list.1001560.n3.nabble.com/Discretization-td22811.html
> We should mention the values are gray levels (0-255) and change the division 
> to integer division.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-8696) StreamingLDA

2015-06-28 Thread yuhao yang (JIRA)

yuhao yang created SPARK-8696:
-

 Summary: StreamingLDA
 Key: SPARK-8696
 URL: https://issues.apache.org/jira/browse/SPARK-8696
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: yuhao yang


Streaming LDA can be a natural extension from online LDA. 

Yet for now we need to settle down the implementation for LDA prediction, to 
support the predictOn method in the streaming version.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7739) Improve ChiSqSelector example code in the user guide

2015-06-28 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-7739:
-
Assignee: Seth Hendrickson

> Improve ChiSqSelector example code in the user guide
> 
>
> Key: SPARK-7739
> URL: https://issues.apache.org/jira/browse/SPARK-7739
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, MLlib
>Affects Versions: 1.3.1, 1.4.0
>Reporter: Xiangrui Meng
>Assignee: Seth Hendrickson
>Priority: Minor
>  Labels: starter
>
> As discussed in 
> http://apache-spark-user-list.1001560.n3.nabble.com/Discretization-td22811.html
> We should mention the values are gray levels (0-255) and change the division 
> to integer division.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-8575) Deprecate callUDF in favor of udf

2015-06-28 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-8575.
--
Resolution: Fixed

Issue resolved by pull request 6993
[https://github.com/apache/spark/pull/6993]

> Deprecate callUDF in favor of udf
> -
>
> Key: SPARK-8575
> URL: https://issues.apache.org/jira/browse/SPARK-8575
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Benjamin Fradet
>Assignee: Benjamin Fradet
>Priority: Minor
> Fix For: 1.5.0
>
>
> Follow-up of [SPARK-8356|https://issues.apache.org/jira/browse/SPARK-8356] to 
> use {{callUDF}} in favor of {{udf}} wherever possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5962) [MLLIB] Python support for Power Iteration Clustering

2015-06-28 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-5962.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 6992
[https://github.com/apache/spark/pull/6992]

> [MLLIB] Python support for Power Iteration Clustering
> -
>
> Key: SPARK-5962
> URL: https://issues.apache.org/jira/browse/SPARK-5962
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Stephen Boesch
>Assignee: Yanbo Liang
>  Labels: python
> Fix For: 1.5.0
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Add python support for the Power Iteration Clustering feature.  Here is a 
> fragment of the python API as we plan to implement it:
>   /**
>* Java stub for Python mllib PowerIterationClustering.run()
>*/
>   def trainPowerIterationClusteringModel(
>   data: JavaRDD[(java.lang.Long, java.lang.Long, java.lang.Double)],
>   k: Int,
>   maxIterations: Int,
>   runs: Int,
>   initializationMode: String,
>   seed: java.lang.Long): PowerIterationClusteringModel = {
> val picAlg = new PowerIterationClustering()
>   .setK(k)
>   .setMaxIterations(maxIterations)
> try {
>   picAlg.run(data.rdd.persist(StorageLevel.MEMORY_AND_DISK))
> } finally {
>   data.rdd.unpersist(blocking = false)
> }
>   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-7212) Frequent pattern mining for sequential item sets

2015-06-28 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-7212.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 6997
[https://github.com/apache/spark/pull/6997]

> Frequent pattern mining for sequential item sets
> 
>
> Key: SPARK-7212
> URL: https://issues.apache.org/jira/browse/SPARK-7212
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Assignee: Feynman Liang
> Fix For: 1.5.0
>
>
> Currently, FPGrowth handles unordered item sets.  It would be great to be 
> able to handle sequences of items, in which the order matters.  This JIRA is 
> for discussing modifications to FPGrowth and/or new algorithms for handling 
> sequences.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8695) TreeAggregation shouldn't be triggered for 5 partitions

2015-06-28 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-8695:
-
Priority: Minor  (was: Major)

> TreeAggregation shouldn't be triggered for 5 partitions
> ---
>
> Key: SPARK-8695
> URL: https://issues.apache.org/jira/browse/SPARK-8695
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, Spark Core
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Minor
>
> If an RDD has 5 partitions, tree aggregation doesn't reduce the wall-clock 
> time. Instead, it introduces scheduling and shuffling overhead. We should 
> update the condition to use tree aggregation (code attached):
> {code}
>   while (numPartitions > scale + numPartitions / scale) {
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-8695) TreeAggregation shouldn't be triggered for 5 partitions

2015-06-28 Thread Xiangrui Meng (JIRA)

Xiangrui Meng created SPARK-8695:


 Summary: TreeAggregation shouldn't be triggered for 5 partitions
 Key: SPARK-8695
 URL: https://issues.apache.org/jira/browse/SPARK-8695
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, Spark Core
Affects Versions: 1.5.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng


If an RDD has 5 partitions, tree aggregation doesn't reduce the wall-clock 
time. Instead, it introduces scheduling and shuffling overhead. We should 
update the condition to use tree aggregation (code attached):

{code}
  while (numPartitions > scale + numPartitions / scale) {
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8405) Show executor logs on Web UI when Yarn log aggregation is enabled

2015-06-28 Thread Carson Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14605126#comment-14605126
 ] 

Carson Wang commented on SPARK-8405:


Thank you very much, [~hshreedharan]
After I configured yarn.log.server.url in yarn-site.xml, the log url did 
redirect to MR job history server's UI and was able to show the logs. But this 
seems a little strange because we need view Spark's log on MR history server. 
Would it make more sense that Spark history server reads and shows the 
aggregated logs itself?

> Show executor logs on Web UI when Yarn log aggregation is enabled
> -
>
> Key: SPARK-8405
> URL: https://issues.apache.org/jira/browse/SPARK-8405
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.4.0
>Reporter: Carson Wang
> Attachments: SparkLogError.png
>
>
> When running Spark application in Yarn mode and Yarn log aggregation is 
> enabled, customer is not able to view executor logs on the history server Web 
> UI. The only way for customer to view the logs is through the Yarn command 
> "yarn logs -applicationId ".
> An screenshot of the error is attached. When you click an executor’s log link 
> on the Spark history server, you’ll see the error if Yarn log aggregation is 
> enabled. The log URL redirects user to the node manager’s UI. This works if 
> the logs are located on that node. But since log aggregation is enabled, the 
> local logs are deleted once log aggregation is completed. 
> The logs should be available through the web UIs just like other Hadoop 
> components like MapReduce. For security reasons, end users may not be able to 
> log into the nodes and run the yarn logs -applicationId command. The web UIs 
> can be viewable and exposed through the firewall if necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8694) Defer executing drawTaskAssignmentTimeline until page loaded to avoid to freeze the page

2015-06-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8694:
---

Assignee: (was: Apache Spark)

> Defer executing drawTaskAssignmentTimeline until page loaded to avoid to 
> freeze the page
> 
>
> Key: SPARK-8694
> URL: https://issues.apache.org/jira/browse/SPARK-8694
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 1.4.0, 1.5.0
>Reporter: Kousuke Saruta
>
> When there are massive tasks in the stage page (such as, running 
> sc.parallelize(1 to 10, 1).count()), Event Timeline needs 15+ seconds 
> to render the graph (drawTaskAssignmentTimeline) in my environment. The page 
> is unresponsive until the graph is ready.
> However, since Event Timeline is hidden by default, we can defer 
> drawTaskAssignmentTimeline until page loaded to avoid freezing the page. So 
> that the user can view the page while rendering Event Timeline in the 
> background.
> This PR puts drawTaskAssignmentTimeline into $(function(){}) to avoid 
> blocking loading page.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8694) Defer executing drawTaskAssignmentTimeline until page loaded to avoid to freeze the page

2015-06-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14605123#comment-14605123
 ] 

Apache Spark commented on SPARK-8694:
-

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/7071

> Defer executing drawTaskAssignmentTimeline until page loaded to avoid to 
> freeze the page
> 
>
> Key: SPARK-8694
> URL: https://issues.apache.org/jira/browse/SPARK-8694
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 1.4.0, 1.5.0
>Reporter: Kousuke Saruta
>
> When there are massive tasks in the stage page (such as, running 
> sc.parallelize(1 to 10, 1).count()), Event Timeline needs 15+ seconds 
> to render the graph (drawTaskAssignmentTimeline) in my environment. The page 
> is unresponsive until the graph is ready.
> However, since Event Timeline is hidden by default, we can defer 
> drawTaskAssignmentTimeline until page loaded to avoid freezing the page. So 
> that the user can view the page while rendering Event Timeline in the 
> background.
> This PR puts drawTaskAssignmentTimeline into $(function(){}) to avoid 
> blocking loading page.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8694) Defer executing drawTaskAssignmentTimeline until page loaded to avoid to freeze the page

2015-06-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8694:
---

Assignee: Apache Spark

> Defer executing drawTaskAssignmentTimeline until page loaded to avoid to 
> freeze the page
> 
>
> Key: SPARK-8694
> URL: https://issues.apache.org/jira/browse/SPARK-8694
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 1.4.0, 1.5.0
>Reporter: Kousuke Saruta
>Assignee: Apache Spark
>
> When there are massive tasks in the stage page (such as, running 
> sc.parallelize(1 to 10, 1).count()), Event Timeline needs 15+ seconds 
> to render the graph (drawTaskAssignmentTimeline) in my environment. The page 
> is unresponsive until the graph is ready.
> However, since Event Timeline is hidden by default, we can defer 
> drawTaskAssignmentTimeline until page loaded to avoid freezing the page. So 
> that the user can view the page while rendering Event Timeline in the 
> background.
> This PR puts drawTaskAssignmentTimeline into $(function(){}) to avoid 
> blocking loading page.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8374) Job frequently hangs after YARN preemption

2015-06-28 Thread Shay Rojansky (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14605117#comment-14605117
 ] 

Shay Rojansky commented on SPARK-8374:
--

Any chance someone can look at this bug, at least to confirm it? This is a 
pretty serious issue preventing Spark 1.4 use in YARN where preemption may 
happen...

> Job frequently hangs after YARN preemption
> --
>
> Key: SPARK-8374
> URL: https://issues.apache.org/jira/browse/SPARK-8374
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.4.0
> Environment: YARN 2.7.0, Spark 1.4.0, Ubuntu 14.04
>Reporter: Shay Rojansky
>Priority: Critical
>
> After upgrading to Spark 1.4.0, jobs that get preempted very frequently will 
> not reacquire executors and will therefore hang. To reproduce:
> 1. I run Spark job A that acquires all grid resources
> 2. I run Spark job B in a higher-priority queue that acquires all grid 
> resources. Job A is fully preempted.
> 3. Kill job B, releasing all resources
> 4. Job A should at this point reacquire all grid resources, but occasionally 
> doesn't. Repeating the preemption scenario makes the bad behavior occur 
> within a few attempts.
> (see logs at bottom).
> Note issue SPARK-7451 that was supposed to fix some Spark YARN preemption 
> issues, maybe the work there is related to the new issues.
> The 1.4.0 preemption situation is considerably worse than 1.3.1 (we've 
> downgraded to 1.3.1 just because of this issue).
> Logs
> --
> When job B (the preemptor first acquires an application master, the following 
> is logged by job A (the preemptee):
> {noformat}
> ERROR YarnScheduler: Lost executor 447 on g023.grid.eaglerd.local: remote Rpc 
> client disassociated
> INFO TaskSetManager: Re-queueing tasks for 447 from TaskSet 0.0
> WARN ReliableDeliverySupervisor: Association with remote system 
> [akka.tcp://sparkexecu...@g023.grid.eaglerd.local:54167] has failed, address 
> is now gated for [5000] ms. Reason is: [Disassociated].
> WARN TaskSetManager: Lost task 15.0 in stage 0.0 (TID 15, 
> g023.grid.eaglerd.local): ExecutorLostFailure (executor 447 lost)
> INFO DAGScheduler: Executor lost: 447 (epoch 0)
> INFO BlockManagerMasterEndpoint: Trying to remove executor 447 from 
> BlockManagerMaster.
> INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(447, 
> g023.grid.eaglerd.local, 41406)
> INFO BlockManagerMaster: Removed 447 successfully in removeExecutor
> {noformat}
> (It's strange for errors/warnings to be logged for preemption)
> Later, when job B's AM starts requesting its resources, I get lots of the 
> following in job A:
> {noformat}
> ERROR YarnScheduler: Lost executor 415 on g033.grid.eaglerd.local: remote Rpc 
> client disassociated
> INFO TaskSetManager: Re-queueing tasks for 415 from TaskSet 0.0
> WARN TaskSetManager: Lost task 231.0 in stage 0.0 (TID 231, 
> g033.grid.eaglerd.local): ExecutorLostFailure (executor 415 lost)
> WARN ReliableDeliverySupervisor: Association with remote system 
> [akka.tcp://sparkexecu...@g023.grid.eaglerd.local:34357] has failed, address 
> is now gated for [5000] ms. Reason is: [Disassociated].
> {noformat}
> Finally, when I kill job B, job A emits lots of the following:
> {noformat}
> INFO YarnClientSchedulerBackend: Requesting to kill executor(s) 31
> WARN YarnClientSchedulerBackend: Executor to kill 31 does not exist!
> {noformat}
> And finally after some time:
> {noformat}
> WARN HeartbeatReceiver: Removing executor 466 with no recent heartbeats: 
> 165964 ms exceeds timeout 12 ms
> ERROR YarnScheduler: Lost an executor 466 (already removed): Executor 
> heartbeat timed out after 165964 ms
> {noformat}
> At this point the job never requests/acquires more resources and hangs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-8694) Defer executing drawTaskAssignmentTimeline until page loaded to avoid to freeze the page

2015-06-28 Thread Kousuke Saruta (JIRA)

Kousuke Saruta created SPARK-8694:
-

 Summary: Defer executing drawTaskAssignmentTimeline until page 
loaded to avoid to freeze the page
 Key: SPARK-8694
 URL: https://issues.apache.org/jira/browse/SPARK-8694
 Project: Spark
  Issue Type: Sub-task
  Components: Web UI
Affects Versions: 1.4.0, 1.5.0
Reporter: Kousuke Saruta


When there are massive tasks in the stage page (such as, running 
sc.parallelize(1 to 10, 1).count()), Event Timeline needs 15+ seconds 
to render the graph (drawTaskAssignmentTimeline) in my environment. The page is 
unresponsive until the graph is ready.

However, since Event Timeline is hidden by default, we can defer 
drawTaskAssignmentTimeline until page loaded to avoid freezing the page. So 
that the user can view the page while rendering Event Timeline in the 
background.

This PR puts drawTaskAssignmentTimeline into $(function(){}) to avoid blocking 
loading page.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8636) CaseKeyWhen has incorrect NULL handling

2015-06-28 Thread Animesh Baranawal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14605112#comment-14605112
 ] 

Animesh Baranawal commented on SPARK-8636:
--

No a row with null will not be equal to another row with null if we follow the 
modification. What do you say [~smolav] ?

> CaseKeyWhen has incorrect NULL handling
> ---
>
> Key: SPARK-8636
> URL: https://issues.apache.org/jira/browse/SPARK-8636
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Santiago M. Mola
>  Labels: starter
>
> CaseKeyWhen implementation in Spark uses the following equals implementation:
> {code}
>   private def equalNullSafe(l: Any, r: Any) = {
> if (l == null && r == null) {
>   true
> } else if (l == null || r == null) {
>   false
> } else {
>   l == r
> }
>   }
> {code}
> Which is not correct, since in SQL, NULL is never equal to NULL (actually, it 
> is not unequal either). In this case, a NULL value in a CASE WHEN expression 
> should never match.
> For example, you can execute this in MySQL:
> {code}
> SELECT CASE NULL WHEN NULL THEN "NULL MATCHES" ELSE "NULL DOES NOT MATCH" END 
> FROM DUAL;
> {code}
> And the result will be "NULL DOES NOT MATCH".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-8616) SQLContext doesn't handle tricky column names when loading from JDBC

2015-06-28 Thread Gergely Svigruha (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14605078#comment-14605078
 ] 

Gergely Svigruha edited comment on SPARK-8616 at 6/29/15 3:43 AM:
--

Thanks for working on this!

FYI, MYSQL may allow double quotes if the ANSI_QUOTES mode is enabled:
https://dev.mysql.com/doc/refman/5.0/en/identifiers.html


was (Author: gsvigruha):
FYI, MYSQL may allow double quotes if the ANSI_QUOTES mode is enabled:
https://dev.mysql.com/doc/refman/5.0/en/identifiers.html

> SQLContext doesn't handle tricky column names when loading from JDBC
> 
>
> Key: SPARK-8616
> URL: https://issues.apache.org/jira/browse/SPARK-8616
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
> Environment: Ubuntu 14.04, Sqlite 3.8.7, Spark 1.4.0
>Reporter: Gergely Svigruha
>
> Reproduce:
>  - create a table in a relational database (in my case sqlite) with a column 
> name containing a space:
>  CREATE TABLE my_table (id INTEGER, "tricky column" TEXT);
>  - try to create a DataFrame using that table:
> sqlContext.read.format("jdbc").options(Map(
>   "url" -> "jdbs:sqlite:...",
>   "dbtable" -> "my_table")).load()
> java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database (no such 
> column: tricky)
> According to the SQL spec this should be valid:
> http://savage.net.au/SQL/sql-99.bnf.html#delimited%20identifier



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8616) SQLContext doesn't handle tricky column names when loading from JDBC

2015-06-28 Thread Gergely Svigruha (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14605078#comment-14605078
 ] 

Gergely Svigruha commented on SPARK-8616:
-

FYI, MYSQL may allow double quotes if the ANSI_QUOTES mode is enabled:
https://dev.mysql.com/doc/refman/5.0/en/identifiers.html

> SQLContext doesn't handle tricky column names when loading from JDBC
> 
>
> Key: SPARK-8616
> URL: https://issues.apache.org/jira/browse/SPARK-8616
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
> Environment: Ubuntu 14.04, Sqlite 3.8.7, Spark 1.4.0
>Reporter: Gergely Svigruha
>
> Reproduce:
>  - create a table in a relational database (in my case sqlite) with a column 
> name containing a space:
>  CREATE TABLE my_table (id INTEGER, "tricky column" TEXT);
>  - try to create a DataFrame using that table:
> sqlContext.read.format("jdbc").options(Map(
>   "url" -> "jdbs:sqlite:...",
>   "dbtable" -> "my_table")).load()
> java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database (no such 
> column: tricky)
> According to the SQL spec this should be valid:
> http://savage.net.au/SQL/sql-99.bnf.html#delimited%20identifier



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8651) Lasso with SGD not Converging properly

2015-06-28 Thread yuhao yang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14605056#comment-14605056
 ] 

yuhao yang commented on SPARK-8651:
---

[~aazout] I know someone met the similar issue. Usually it's because the 
weights did not start converging. Try more iteration numbers ( 2000 or larger) 
or larger step size (4 or 5). Let me know if that helps. (if not, please share 
a part of your code or data).

> Lasso with SGD not Converging properly
> --
>
> Key: SPARK-8651
> URL: https://issues.apache.org/jira/browse/SPARK-8651
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.4.0
>Reporter: Albert Azout
>
> We are having issues getting Lasso with SGD to converge properly. The weights 
> outputted are extremely large values. We have tried multiple miniBatchRatios 
> and still see same issue. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7845) Bump "Hadoop 1" tests to version 1.2.1

2015-06-28 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-7845:

Fix Version/s: 1.5.0

> Bump "Hadoop 1" tests to version 1.2.1
> --
>
> Key: SPARK-7845
> URL: https://issues.apache.org/jira/browse/SPARK-7845
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Reporter: Patrick Wendell
>Assignee: Cheng Lian
>Priority: Critical
> Fix For: 1.4.0, 1.5.0
>
>
> A small number of API's in Hadoop were added between 1.0.4 and 1.2.1. It 
> appears this is one cause of SPARK-7843 since some Hive code relies on newer 
> Hadoop API's. My feeling is we should just bump our tested version up to 
> 1.2.1 (both versions are extremely old). If users are still on < 1.2.1 and 
> run into some of these corner cases, we can consider doing some engineering 
> and supporting the older versions. I'd like to bump our test version though 
> and let this be driven by users, if they exist.
> https://github.com/apache/spark/blob/master/dev/run-tests#L43



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-7845) Bump "Hadoop 1" tests to version 1.2.1

2015-06-28 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-7845.
-
  Resolution: Fixed
Assignee: Cheng Lian  (was: Yin Huai)
Target Version/s: 1.4.0, 1.5.0  (was: 1.4.0)

https://github.com/apache/spark/pull/7062 has fixed the master branch (1.5.0). 

> Bump "Hadoop 1" tests to version 1.2.1
> --
>
> Key: SPARK-7845
> URL: https://issues.apache.org/jira/browse/SPARK-7845
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Reporter: Patrick Wendell
>Assignee: Cheng Lian
>Priority: Critical
> Fix For: 1.4.0
>
>
> A small number of API's in Hadoop were added between 1.0.4 and 1.2.1. It 
> appears this is one cause of SPARK-7843 since some Hive code relies on newer 
> Hadoop API's. My feeling is we should just bump our tested version up to 
> 1.2.1 (both versions are extremely old). If users are still on < 1.2.1 and 
> run into some of these corner cases, we can consider doing some engineering 
> and supporting the older versions. I'd like to bump our test version though 
> and let this be driven by users, if they exist.
> https://github.com/apache/spark/blob/master/dev/run-tests#L43



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8693) profiles and goals are not printed in a nice way

2015-06-28 Thread Yin Huai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14605044#comment-14605044
 ] 

Yin Huai commented on SPARK-8693:
-

[~boyork] Can you take a look at it when you get time?

> profiles and goals are not printed in a nice way
> 
>
> Key: SPARK-8693
> URL: https://issues.apache.org/jira/browse/SPARK-8693
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, Project Infra
>Reporter: Yin Huai
>Priority: Minor
>
> In our master build, I see
> {code}
> -Phadoop-1[info] Building Spark (w/Hive 0.13.1) using SBT with these 
> arguments:  -Dhadoop.version=1.0.4[info] Building Spark (w/Hive 0.13.1) using 
> SBT with these arguments:  -Pkinesis-asl[info] Building Spark (w/Hive 0.13.1) 
> using SBT with these arguments:  -Phive-thriftserver[info] Building Spark 
> (w/Hive 0.13.1) using SBT with these arguments:  -Phive[info] Building Spark 
> (w/Hive 0.13.1) using SBT with these arguments:  package[info] Building Spark 
> (w/Hive 0.13.1) using SBT with these arguments:  assembly/assembly[info] 
> Building Spark (w/Hive 0.13.1) using SBT with these arguments:  
> streaming-kafka-assembly/assembly
> {code}
> Seems we format the string in a wrong way?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-8693) profiles and goals are not printed in a nice way

2015-06-28 Thread Yin Huai (JIRA)

Yin Huai created SPARK-8693:
---

 Summary: profiles and goals are not printed in a nice way
 Key: SPARK-8693
 URL: https://issues.apache.org/jira/browse/SPARK-8693
 Project: Spark
  Issue Type: Sub-task
Reporter: Yin Huai
Priority: Minor


In our master build, I see
{code}
-Phadoop-1[info] Building Spark (w/Hive 0.13.1) using SBT with these arguments: 
 -Dhadoop.version=1.0.4[info] Building Spark (w/Hive 0.13.1) using SBT with 
these arguments:  -Pkinesis-asl[info] Building Spark (w/Hive 0.13.1) using SBT 
with these arguments:  -Phive-thriftserver[info] Building Spark (w/Hive 0.13.1) 
using SBT with these arguments:  -Phive[info] Building Spark (w/Hive 0.13.1) 
using SBT with these arguments:  package[info] Building Spark (w/Hive 0.13.1) 
using SBT with these arguments:  assembly/assembly[info] Building Spark (w/Hive 
0.13.1) using SBT with these arguments:  streaming-kafka-assembly/assembly
{code}
Seems we format the string in a wrong way?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8443) GenerateMutableProjection Exceeds JVM Code Size Limits

2015-06-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8443:
---

Assignee: Apache Spark

> GenerateMutableProjection Exceeds JVM Code Size Limits
> --
>
> Key: SPARK-8443
> URL: https://issues.apache.org/jira/browse/SPARK-8443
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Sen Fang
>Assignee: Apache Spark
>
> GenerateMutableProjection put all expressions columns into a single apply 
> function. When there are a lot of columns, the apply function code size 
> exceeds the 64kb limit, which is a hard limit on jvm that cannot change.
> This comes up when we were aggregating about 100 columns using codegen and 
> unsafe feature.
> I wrote an unit test that reproduces this issue. 
> https://github.com/saurfang/spark/blob/codegen_size_limit/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CodeGenerationSuite.scala
> This test currently fails at 2048 expressions. It seems the master is more 
> tolerant than branch-1.4 about this because code is more concise.
> While the code on master has changed since branch-1.4, I am able to reproduce 
> the problem in master. For now I hacked my way in branch-1.4 to workaround 
> this problem by wrapping each expression with a separate function then call 
> those functions sequentially in apply. The proper way is probably check the 
> length of the projectCode and break it up as necessary. (This seems to be 
> easier in master actually since we are building code by string rather than 
> quasiquote)
> Let me know if anyone has additional thoughts on this, I'm happy to 
> contribute a pull request.
> Attaching stack trace produced by unit test
> {code}
> [info] - code size limit *** FAILED *** (7 seconds, 103 milliseconds)
> [info]   com.google.common.util.concurrent.UncheckedExecutionException: 
> org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Ljava/lang/Object;)Ljava/lang/Object;" of class "SC$SpecificProjection" 
> grows beyond 64 KB
> [info]   at 
> com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2263)
> [info]   at com.google.common.cache.LocalCache.get(LocalCache.java:4000)
> [info]   at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4004)
> [info]   at 
> com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:285)
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.CodeGenerationSuite$$anonfun$2$$anonfun$apply$mcV$sp$2.apply(CodeGenerationSuite.scala:50)
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.CodeGenerationSuite$$anonfun$2$$anonfun$apply$mcV$sp$2.apply(CodeGenerationSuite.scala:48)
> [info]   at 
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)
> [info]   at 
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)
> [info]   at scala.collection.immutable.Range.foreach(Range.scala:141)
> [info]   at 
> scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144)
> [info]   at 
> scala.collection.AbstractTraversable.foldLeft(Traversable.scala:105)
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.CodeGenerationSuite$$anonfun$2.apply$mcV$sp(CodeGenerationSuite.scala:47)
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.CodeGenerationSuite$$anonfun$2.apply(CodeGenerationSuite.scala:47)
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.CodeGenerationSuite$$anonfun$2.apply(CodeGenerationSuite.scala:47)
> [info]   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
> [info]   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
> [info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
> [info]   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
> [info]   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:42)
> [info]   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
> [info]   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
> [info]   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
> [info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
> [info]   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
> [info]   at org.scalatest.FunSuite.runTest(FunSuite.scala:1555)
> [info]   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
> [info]   at 
> org.sc

[jira] [Assigned] (SPARK-8443) GenerateMutableProjection Exceeds JVM Code Size Limits

2015-06-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8443:
---

Assignee: (was: Apache Spark)

> GenerateMutableProjection Exceeds JVM Code Size Limits
> --
>
> Key: SPARK-8443
> URL: https://issues.apache.org/jira/browse/SPARK-8443
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Sen Fang
>
> GenerateMutableProjection put all expressions columns into a single apply 
> function. When there are a lot of columns, the apply function code size 
> exceeds the 64kb limit, which is a hard limit on jvm that cannot change.
> This comes up when we were aggregating about 100 columns using codegen and 
> unsafe feature.
> I wrote an unit test that reproduces this issue. 
> https://github.com/saurfang/spark/blob/codegen_size_limit/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CodeGenerationSuite.scala
> This test currently fails at 2048 expressions. It seems the master is more 
> tolerant than branch-1.4 about this because code is more concise.
> While the code on master has changed since branch-1.4, I am able to reproduce 
> the problem in master. For now I hacked my way in branch-1.4 to workaround 
> this problem by wrapping each expression with a separate function then call 
> those functions sequentially in apply. The proper way is probably check the 
> length of the projectCode and break it up as necessary. (This seems to be 
> easier in master actually since we are building code by string rather than 
> quasiquote)
> Let me know if anyone has additional thoughts on this, I'm happy to 
> contribute a pull request.
> Attaching stack trace produced by unit test
> {code}
> [info] - code size limit *** FAILED *** (7 seconds, 103 milliseconds)
> [info]   com.google.common.util.concurrent.UncheckedExecutionException: 
> org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Ljava/lang/Object;)Ljava/lang/Object;" of class "SC$SpecificProjection" 
> grows beyond 64 KB
> [info]   at 
> com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2263)
> [info]   at com.google.common.cache.LocalCache.get(LocalCache.java:4000)
> [info]   at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4004)
> [info]   at 
> com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:285)
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.CodeGenerationSuite$$anonfun$2$$anonfun$apply$mcV$sp$2.apply(CodeGenerationSuite.scala:50)
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.CodeGenerationSuite$$anonfun$2$$anonfun$apply$mcV$sp$2.apply(CodeGenerationSuite.scala:48)
> [info]   at 
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)
> [info]   at 
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)
> [info]   at scala.collection.immutable.Range.foreach(Range.scala:141)
> [info]   at 
> scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144)
> [info]   at 
> scala.collection.AbstractTraversable.foldLeft(Traversable.scala:105)
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.CodeGenerationSuite$$anonfun$2.apply$mcV$sp(CodeGenerationSuite.scala:47)
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.CodeGenerationSuite$$anonfun$2.apply(CodeGenerationSuite.scala:47)
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.CodeGenerationSuite$$anonfun$2.apply(CodeGenerationSuite.scala:47)
> [info]   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
> [info]   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
> [info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
> [info]   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
> [info]   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:42)
> [info]   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
> [info]   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
> [info]   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
> [info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
> [info]   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
> [info]   at org.scalatest.FunSuite.runTest(FunSuite.scala:1555)
> [info]   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
> [info]   at 
> org.scalatest.FunSuiteLike$$ano

[jira] [Commented] (SPARK-8443) GenerateMutableProjection Exceeds JVM Code Size Limits

2015-06-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14605028#comment-14605028
 ] 

Apache Spark commented on SPARK-8443:
-

User 'saurfang' has created a pull request for this issue:
https://github.com/apache/spark/pull/7076

> GenerateMutableProjection Exceeds JVM Code Size Limits
> --
>
> Key: SPARK-8443
> URL: https://issues.apache.org/jira/browse/SPARK-8443
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Sen Fang
>
> GenerateMutableProjection put all expressions columns into a single apply 
> function. When there are a lot of columns, the apply function code size 
> exceeds the 64kb limit, which is a hard limit on jvm that cannot change.
> This comes up when we were aggregating about 100 columns using codegen and 
> unsafe feature.
> I wrote an unit test that reproduces this issue. 
> https://github.com/saurfang/spark/blob/codegen_size_limit/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CodeGenerationSuite.scala
> This test currently fails at 2048 expressions. It seems the master is more 
> tolerant than branch-1.4 about this because code is more concise.
> While the code on master has changed since branch-1.4, I am able to reproduce 
> the problem in master. For now I hacked my way in branch-1.4 to workaround 
> this problem by wrapping each expression with a separate function then call 
> those functions sequentially in apply. The proper way is probably check the 
> length of the projectCode and break it up as necessary. (This seems to be 
> easier in master actually since we are building code by string rather than 
> quasiquote)
> Let me know if anyone has additional thoughts on this, I'm happy to 
> contribute a pull request.
> Attaching stack trace produced by unit test
> {code}
> [info] - code size limit *** FAILED *** (7 seconds, 103 milliseconds)
> [info]   com.google.common.util.concurrent.UncheckedExecutionException: 
> org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Ljava/lang/Object;)Ljava/lang/Object;" of class "SC$SpecificProjection" 
> grows beyond 64 KB
> [info]   at 
> com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2263)
> [info]   at com.google.common.cache.LocalCache.get(LocalCache.java:4000)
> [info]   at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4004)
> [info]   at 
> com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:285)
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.CodeGenerationSuite$$anonfun$2$$anonfun$apply$mcV$sp$2.apply(CodeGenerationSuite.scala:50)
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.CodeGenerationSuite$$anonfun$2$$anonfun$apply$mcV$sp$2.apply(CodeGenerationSuite.scala:48)
> [info]   at 
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)
> [info]   at 
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)
> [info]   at scala.collection.immutable.Range.foreach(Range.scala:141)
> [info]   at 
> scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144)
> [info]   at 
> scala.collection.AbstractTraversable.foldLeft(Traversable.scala:105)
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.CodeGenerationSuite$$anonfun$2.apply$mcV$sp(CodeGenerationSuite.scala:47)
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.CodeGenerationSuite$$anonfun$2.apply(CodeGenerationSuite.scala:47)
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.CodeGenerationSuite$$anonfun$2.apply(CodeGenerationSuite.scala:47)
> [info]   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
> [info]   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
> [info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
> [info]   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
> [info]   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:42)
> [info]   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
> [info]   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
> [info]   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
> [info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
> [info]   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
> [info]   at org.scalatest.FunSuite.runTest(FunSuite.scala:1555)
> [info]   at 
> org.scalate

[jira] [Updated] (SPARK-8692) re-order the case statements that handling catalyst data types

2015-06-28 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-8692:
--
Shepherd: Cheng Lian

> re-order the case statements that handling catalyst data types 
> ---
>
> Key: SPARK-8692
> URL: https://issues.apache.org/jira/browse/SPARK-8692
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-8424) Add blacklist mechanism for task scheduler and Yarn container allocation

2015-06-28 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao closed SPARK-8424.
--
Resolution: Duplicate

> Add blacklist mechanism for task scheduler and Yarn container allocation
> 
>
> Key: SPARK-8424
> URL: https://issues.apache.org/jira/browse/SPARK-8424
> Project: Spark
>  Issue Type: New Feature
>  Components: Scheduler, YARN
>Affects Versions: 1.4.0
>Reporter: Saisai Shao
>
> Previously MapReduce has  a blacklist and graylist to exclude some constantly 
> failed TaskTrackers/nodes, it is important for a large cluster to alleviate 
> the problem of  increasing chance of hardware and software failure. 
> Unfortunately current version of Spark lacks such mechanism to blacklist some 
> constantly failed executors/nodes. The only blacklist mechanism in Spark is 
> to avoid relaunching the task on the same executor when this task is 
> previously failed on this executor within specified time. So here propose a 
> new feature to add blacklist mechanism for Spark, this proposal is divided 
> into two sub-tasks:
> 1. Add a heuristic blacklist algorithm to track the status of executors by 
> the status of finished tasks, and enable blacklist mechanism in tasking 
> scheduling.
> 2. Enable blacklist mechanism in YARN container allocation (avoid allocating 
> containers on the blacklist hosts).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8424) Add blacklist mechanism for task scheduler and Yarn container allocation

2015-06-28 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14605001#comment-14605001
 ] 

Saisai Shao commented on SPARK-8424:


OK, I will close this JIRA.

> Add blacklist mechanism for task scheduler and Yarn container allocation
> 
>
> Key: SPARK-8424
> URL: https://issues.apache.org/jira/browse/SPARK-8424
> Project: Spark
>  Issue Type: New Feature
>  Components: Scheduler, YARN
>Affects Versions: 1.4.0
>Reporter: Saisai Shao
>
> Previously MapReduce has  a blacklist and graylist to exclude some constantly 
> failed TaskTrackers/nodes, it is important for a large cluster to alleviate 
> the problem of  increasing chance of hardware and software failure. 
> Unfortunately current version of Spark lacks such mechanism to blacklist some 
> constantly failed executors/nodes. The only blacklist mechanism in Spark is 
> to avoid relaunching the task on the same executor when this task is 
> previously failed on this executor within specified time. So here propose a 
> new feature to add blacklist mechanism for Spark, this proposal is divided 
> into two sub-tasks:
> 1. Add a heuristic blacklist algorithm to track the status of executors by 
> the status of finished tasks, and enable blacklist mechanism in tasking 
> scheduling.
> 2. Enable blacklist mechanism in YARN container allocation (avoid allocating 
> containers on the blacklist hosts).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8692) re-order the case statements that handling catalyst data types

2015-06-28 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-8692:
---
Summary: re-order the case statements that handling catalyst data types   
(was: refactor the handling of date and timestamp)

> re-order the case statements that handling catalyst data types 
> ---
>
> Key: SPARK-8692
> URL: https://issues.apache.org/jira/browse/SPARK-8692
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8692) refactor the handling of date and timestamp

2015-06-28 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-8692:
---
Summary: refactor the handling of date and timestamp  (was: improve date 
and timestamp handling)

> refactor the handling of date and timestamp
> ---
>
> Key: SPARK-8692
> URL: https://issues.apache.org/jira/browse/SPARK-8692
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8577) ScalaReflectionLock.synchronized can cause deadlock

2015-06-28 Thread koert kuipers (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14604935#comment-14604935
 ] 

koert kuipers commented on SPARK-8577:
--

i do not have a way to reproduce this in non-test code. i do have a thread
dump but can not share all of it.
see below. both threads are blocked by ScalaReflectionLock.synchronized

Thread 2942: (state = BLOCKED)
 -
org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(org.apache.spark.sql.catalyst.ScalaReflection,
scala.reflect.api.Types$TypeApi) @bci=2611, line=140 (Interpret\
ed frame)
 -
org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(scala.reflect.api.Types$TypeApi)
@bci=2, line=28 (Interpreted frame)
 -
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(scala.reflect.api.Symbols$SymbolApi)
@bci=21, line=123 (Interpreted frame)
 -
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(java.lang.Object)
@bci=5, line=121 (Interpreted frame)
 - scala.collection.immutable.List.map(scala.Function1,
scala.collection.generic.CanBuildFrom) @bci=74, line=277 (Compiled frame)
 -
org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(org.apache.spark.sql.catalyst.ScalaReflection,
scala.reflect.api.Types$TypeApi) @bci=1699, line=121 (Interpret\
ed frame)
 -
org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(scala.reflect.api.Types$TypeApi)
@bci=2, line=28 (Interpreted frame)
 -
org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(org.apache.spark.sql.catalyst.ScalaReflection,
scala.reflect.api.TypeTags$TypeTag) @bci=12, line=59 (Interpret\
ed frame)
 -
org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(scala.reflect.api.TypeTags$TypeTag)
@bci=2, line=28 (Interpreted frame)

Thread 2941: (state = BLOCKED)
 - org.apache.spark.sql.types.AtomicType.() @bci=11, line=95
(Interpreted frame)
 - org.apache.spark.sql.types.NumericType.() @bci=1, line=107
(Interpreted frame)
 - org.apache.spark.sql.types.IntegralType.() @bci=1, line=141
(Interpreted frame)
 - org.apache.spark.sql.types.IntegerType.() @bci=1, line=34
(Interpreted frame)
 - org.apache.spark.sql.types.IntegerType$.() @bci=1, line=54
(Interpreted frame)
 - org.apache.spark.sql.types.IntegerType$.() @bci=3 (Interpreted
frame)








> ScalaReflectionLock.synchronized can cause deadlock
> ---
>
> Key: SPARK-8577
> URL: https://issues.apache.org/jira/browse/SPARK-8577
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: koert kuipers
>Priority: Minor
>
> Just a heads up, i was doing some basic coding using DataFrame, Row, 
> StructType, etc. in my own project and i ended up with deadlocks in my sbt 
> tests due to the usage of ScalaReflectionLock.synchronized in the spark sql 
> code.
> the issue went away when i changed my build to have:
>   parallelExecution in Test := false
> so that the tests run consecutively...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8677) Decimal divide operation throws ArithmeticException

2015-06-28 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-8677:
--
Assignee: Liang-Chi Hsieh

> Decimal divide operation throws ArithmeticException
> ---
>
> Key: SPARK-8677
> URL: https://issues.apache.org/jira/browse/SPARK-8677
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
> Fix For: 1.5.0
>
>
> Please refer to [BigDecimal 
> doc|http://docs.oracle.com/javase/1.5.0/docs/api/java/math/BigDecimal.html]:
> {quote}
> ... the rounding mode setting of a MathContext object with a precision 
> setting of 0 is not used and thus irrelevant. In the case of divide, the 
> exact quotient could have an infinitely long decimal expansion; for example, 
> 1 divided by 3.
> {quote}
> Because we provide a MathContext.UNLIMITED in toBigDecimal, Decimal divide 
> operation will throw the following exception:
> {code}
> val decimal = Decimal(1.0, 10, 3) / Decimal(3.0, 10, 3)
> [info]   java.lang.ArithmeticException: Non-terminating decimal expansion; no 
> exact representable decimal result.
> [info]   at java.math.BigDecimal.divide(BigDecimal.java:1690)
> [info]   at java.math.BigDecimal.divide(BigDecimal.java:1723)
> [info]   at scala.math.BigDecimal.$div(BigDecimal.scala:256)
> [info]   at org.apache.spark.sql.types.Decimal.$div(Decimal.scala:272)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-8677) Decimal divide operation throws ArithmeticException

2015-06-28 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-8677.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7056
[https://github.com/apache/spark/pull/7056]

> Decimal divide operation throws ArithmeticException
> ---
>
> Key: SPARK-8677
> URL: https://issues.apache.org/jira/browse/SPARK-8677
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Liang-Chi Hsieh
> Fix For: 1.5.0
>
>
> Please refer to [BigDecimal 
> doc|http://docs.oracle.com/javase/1.5.0/docs/api/java/math/BigDecimal.html]:
> {quote}
> ... the rounding mode setting of a MathContext object with a precision 
> setting of 0 is not used and thus irrelevant. In the case of divide, the 
> exact quotient could have an infinitely long decimal expansion; for example, 
> 1 divided by 3.
> {quote}
> Because we provide a MathContext.UNLIMITED in toBigDecimal, Decimal divide 
> operation will throw the following exception:
> {code}
> val decimal = Decimal(1.0, 10, 3) / Decimal(3.0, 10, 3)
> [info]   java.lang.ArithmeticException: Non-terminating decimal expansion; no 
> exact representable decimal result.
> [info]   at java.math.BigDecimal.divide(BigDecimal.java:1690)
> [info]   at java.math.BigDecimal.divide(BigDecimal.java:1723)
> [info]   at scala.math.BigDecimal.$div(BigDecimal.scala:256)
> [info]   at org.apache.spark.sql.types.Decimal.$div(Decimal.scala:272)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8674) 2-sample, 2-sided Kolmogorov Smirnov Test Implementation

2015-06-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8674:
---

Assignee: Apache Spark

> 2-sample, 2-sided Kolmogorov Smirnov Test Implementation
> 
>
> Key: SPARK-8674
> URL: https://issues.apache.org/jira/browse/SPARK-8674
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Jose Cambronero
>Assignee: Apache Spark
>Priority: Minor
>
> We added functionality to calculate a 2-sample, 2-sided Kolmogorov Smirnov 
> test for 2 RDD[Double]. The calculation provides a test for the null 
> hypothesis that both samples come from the same probability distribution. The 
> implementation seeks to minimize the shuffles necessary. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8674) 2-sample, 2-sided Kolmogorov Smirnov Test Implementation

2015-06-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8674:
---

Assignee: (was: Apache Spark)

> 2-sample, 2-sided Kolmogorov Smirnov Test Implementation
> 
>
> Key: SPARK-8674
> URL: https://issues.apache.org/jira/browse/SPARK-8674
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Jose Cambronero
>Priority: Minor
>
> We added functionality to calculate a 2-sample, 2-sided Kolmogorov Smirnov 
> test for 2 RDD[Double]. The calculation provides a test for the null 
> hypothesis that both samples come from the same probability distribution. The 
> implementation seeks to minimize the shuffles necessary. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8674) 2-sample, 2-sided Kolmogorov Smirnov Test Implementation

2015-06-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14604898#comment-14604898
 ] 

Apache Spark commented on SPARK-8674:
-

User 'josepablocam' has created a pull request for this issue:
https://github.com/apache/spark/pull/7075

> 2-sample, 2-sided Kolmogorov Smirnov Test Implementation
> 
>
> Key: SPARK-8674
> URL: https://issues.apache.org/jira/browse/SPARK-8674
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Jose Cambronero
>Priority: Minor
>
> We added functionality to calculate a 2-sample, 2-sided Kolmogorov Smirnov 
> test for 2 RDD[Double]. The calculation provides a test for the null 
> hypothesis that both samples come from the same probability distribution. The 
> implementation seeks to minimize the shuffles necessary. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8596) Install and configure RStudio server on Spark EC2

2015-06-28 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14604897#comment-14604897
 ] 

Shivaram Venkataraman commented on SPARK-8596:
--

Merged https://github.com/apache/spark/pull/7068 to open the RStudio port. I'm 
keeping the JIRA open till we fix the second part of this issue too.

> Install and configure RStudio server on Spark EC2
> -
>
> Key: SPARK-8596
> URL: https://issues.apache.org/jira/browse/SPARK-8596
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2, SparkR
>Reporter: Shivaram Venkataraman
>
> This will make it convenient for R users to use SparkR from their browsers 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7894) Graph Union Operator

2015-06-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7894:
---

Assignee: Apache Spark

> Graph Union Operator
> 
>
> Key: SPARK-7894
> URL: https://issues.apache.org/jira/browse/SPARK-7894
> Project: Spark
>  Issue Type: Sub-task
>  Components: GraphX
>Reporter: Andy Huang
>Assignee: Apache Spark
>  Labels: graph, union
> Attachments: union_operator.png
>
>
> This operator aims to union two graphs and generate a new graph directly. The 
> union of two graphs is the union of their vertex sets and their edge 
> families.Vertexes and edges which are included in either graph will be part 
> of the new graph.
> bq.  G ∪ H = (VG ∪ VH, EG ∪ EH).
> The below image shows a union of graph G and graph H
> !union_operator.png|width=600px,align=center!
> A Simple interface would be：
> bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED]
> However, inevitably vertexes and edges overlapping will happen between 
> borders of graphs. For vertex, it's quite nature to just make a union and 
> remove those duplicate ones. But for edges, a mergeEdges function seems to be 
> more reasonable.
> bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: 
> (ED, ED) => ED): Graph[VD, ED]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7894) Graph Union Operator

2015-06-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7894:
---

Assignee: (was: Apache Spark)

> Graph Union Operator
> 
>
> Key: SPARK-7894
> URL: https://issues.apache.org/jira/browse/SPARK-7894
> Project: Spark
>  Issue Type: Sub-task
>  Components: GraphX
>Reporter: Andy Huang
>  Labels: graph, union
> Attachments: union_operator.png
>
>
> This operator aims to union two graphs and generate a new graph directly. The 
> union of two graphs is the union of their vertex sets and their edge 
> families.Vertexes and edges which are included in either graph will be part 
> of the new graph.
> bq.  G ∪ H = (VG ∪ VH, EG ∪ EH).
> The below image shows a union of graph G and graph H
> !union_operator.png|width=600px,align=center!
> A Simple interface would be：
> bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED]
> However, inevitably vertexes and edges overlapping will happen between 
> borders of graphs. For vertex, it's quite nature to just make a union and 
> remove those duplicate ones. But for edges, a mergeEdges function seems to be 
> more reasonable.
> bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: 
> (ED, ED) => ED): Graph[VD, ED]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7894) Graph Union Operator

2015-06-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14604830#comment-14604830
 ] 

Apache Spark commented on SPARK-7894:
-

User 'arnabguin' has created a pull request for this issue:
https://github.com/apache/spark/pull/7074

> Graph Union Operator
> 
>
> Key: SPARK-7894
> URL: https://issues.apache.org/jira/browse/SPARK-7894
> Project: Spark
>  Issue Type: Sub-task
>  Components: GraphX
>Reporter: Andy Huang
>  Labels: graph, union
> Attachments: union_operator.png
>
>
> This operator aims to union two graphs and generate a new graph directly. The 
> union of two graphs is the union of their vertex sets and their edge 
> families.Vertexes and edges which are included in either graph will be part 
> of the new graph.
> bq.  G ∪ H = (VG ∪ VH, EG ∪ EH).
> The below image shows a union of graph G and graph H
> !union_operator.png|width=600px,align=center!
> A Simple interface would be：
> bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED]
> However, inevitably vertexes and edges overlapping will happen between 
> borders of graphs. For vertex, it's quite nature to just make a union and 
> remove those duplicate ones. But for edges, a mergeEdges function seems to be 
> more reasonable.
> bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: 
> (ED, ED) => ED): Graph[VD, ED]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-8337) KafkaUtils.createDirectStream for python is lacking API/feature parity with the Scala/Java version

2015-06-28 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-8337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14604633#comment-14604633
 ] 

Juan Rodríguez Hortalá edited comment on SPARK-8337 at 6/28/15 6:16 PM:


Hi, 

I have worked a bit on the OffsetRange way, you can access the code at 
https://github.com/juanrh/spark/commit/56fbd5c38bd30b825a7818f1c56abb1f8b2beaff.
 I have added the following method to pyspark KafkaUtils

{code}
@staticmethod
def getOffsetRanges(rdd):
scalaRdd = rdd._jrdd.rdd()
offsetRangesArray = scalaRdd.offsetRanges()
return [ OffsetRange(topic = offsetRange.topic(),
 partition = offsetRange.partition(), 
 fromOffset = offsetRange.fromOffset(), 
 untilOffset = offsetRange.untilOffset())
for offsetRange in offsetRangesArray]
{code}

This method is used in KafkaUtils.createDirectStreamJB, which is based on the 
original KafkaUtilsPythonHelper.createDirectStream. The main problem I have is 
that I don't  know where to store the OffsetRange objects. The naive trick of 
adding them to the __dict__ of each python RDD object doesn't work, the new 
field is lost in the pyspark wrappers. So the new method createDirectStreamJB 
takes two additional options, one for performing an action on the OffsetRange 
list, and another for adding it to each record of the DStream

{code}
def createDirectStreamJB(ssc, topics, kafkaParams, fromOffsets={},
  keyDecoder=utf8_decoder, 
valueDecoder=utf8_decoder,   
 offsetRangeForeach=None, 
addOffsetRange=False):
"""
FIXME: temporary working placeholder
:param offsetRangeForeach: if different to None, this function should 
be a function from a list of OffsetRange to None, and is applied to the 
OffsetRange
list of each rdd
:param addOffsetRange: if False (default) output records are of the 
shape (kafkaKey, kafkaValue); if True output records are of the shape 
(offsetRange, (kafkaKey, kafkaValue)) for offsetRange the OffsetRange value for 
the Spark partition for the record
{code}

This is an example of using createDirectStreamJB:

{code}
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
ssc = StreamingContext(sc, 1)
topics = ["test"]
kafkaParams = {"metadata.broker.list" : "localhost:9092"}
def offsetRangeForeach(offsetRangeList):
print 
print 
for offsetRange in offsetRangeList:
print offsetRange
print 
print 

kafkaStream = KafkaUtils.createDirectStreamJB(ssc, topics, kafkaParams, 
offsetRangeForeach=offsetRangeForeach, addOffsetRange=True)

# OffsetRange printed as , I guess due to some kind of pyspark proxy 
kafkaStrStream = kafkaStream.map(lambda (offRan, (k, v)) :  
str(offRan._fromOffset) + " " + str(offRan._untilOffset) + " " + str(k) + " " + 
str(v))
# kafkaStream.pprint()
kafkaStrStream.pprint()
ssc.start()
ssc.awaitTermination(timeout=5)
{code}

which gets the following output

{code}
15/06/28 12:36:03 INFO InputInfoTracker: remove old batch metadata: 
1435487761000 ms


OffsetRange(topic=test, partition=0, fromOffset=178, untilOffset=179)


15/06/28 12:36:04 INFO JobScheduler: Added jobs for time 1435487764000 ms
...
15/06/28 12:36:04 INFO DAGScheduler: Job 4 finished: runJob at 
PythonRDD.scala:366, took 0,075387 s
---
Time: 2015-06-28 12:36:04
---
178 179 None hola
()
15/06/28 12:36:04 INFO JobScheduler: Finished job streaming job 1435487764000 
ms.0 from job set of time 1435487764000 ms
...
15/06/28 12:36:05 INFO BlockManager: Removing RDD 12


OffsetRange(topic=test, partition=0, fromOffset=179, untilOffset=180)


15/06/28 12:36:06 INFO JobScheduler: Starting job streaming job 1435487766000 
ms.0 from job set of time 1435487766000 ms

15/06/28 12:36:06 INFO DAGScheduler: Job 6 finished: start at 
NativeMethodAccessorImpl.java:-2, took 0,077993 s
---
Time: 2015-06-28 12:36:06
---
179 180 None caracola
()
15/06/28 12:36:06 INFO JobScheduler: Finished job streaming job 1435487766000 
ms.0 from job set of time 1435487766000 ms
{code}

Any thoughts on this will be appreciated, in particular about a suitable place 
to store the list of OffsetRange objects

Greetings, 

Juan




was (Author: juanrh):
Hi, 

I have worked a bit on the OffsetRange way, you can access the code at 
https://github.com/juanrh/spark/commit/56fbd5c38bd30b825a7818f1c56abb1f8b2beaff.
 I have added the following method to pyspark KafkaUtils

@staticmethod
def getOffsetRanges(rdd):
scalaRdd = rdd._jrdd.rdd()
offsetRangesArray = scalaRdd.offsetRanges()
return [

[jira] [Assigned] (SPARK-8692) improve date and timestamp handling

2015-06-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8692:
---

Assignee: (was: Apache Spark)

> improve date and timestamp handling
> ---
>
> Key: SPARK-8692
> URL: https://issues.apache.org/jira/browse/SPARK-8692
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8692) improve date and timestamp handling

2015-06-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14604773#comment-14604773
 ] 

Apache Spark commented on SPARK-8692:
-

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/7073

> improve date and timestamp handling
> ---
>
> Key: SPARK-8692
> URL: https://issues.apache.org/jira/browse/SPARK-8692
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8692) improve date and timestamp handling

2015-06-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8692:
---

Assignee: Apache Spark

> improve date and timestamp handling
> ---
>
> Key: SPARK-8692
> URL: https://issues.apache.org/jira/browse/SPARK-8692
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-8692) improve date and timestamp handling

2015-06-28 Thread Wenchen Fan (JIRA)

Wenchen Fan created SPARK-8692:
--

 Summary: improve date and timestamp handling
 Key: SPARK-8692
 URL: https://issues.apache.org/jira/browse/SPARK-8692
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Wenchen Fan
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-8645) Incorrect expression analysis with Hive

2015-06-28 Thread Eugene Zhulenev (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Zhulenev closed SPARK-8645.
--
Resolution: Won't Fix

> Incorrect expression analysis with Hive
> ---
>
> Key: SPARK-8645
> URL: https://issues.apache.org/jira/browse/SPARK-8645
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
> Environment: CDH 5.4.2 1.3.0
>Reporter: Eugene Zhulenev
>  Labels: dataframe
>
> When using DataFrame backed by Hive table groupBy with agg can't resolve 
> column if I pass them by String and not Column:
> This fails with: org.apache.spark.sql.AnalysisException: expression 'dt' is 
> neither present in the group by, nor is it an aggregate function.
> {code}
> val grouped = eventLogHLL
>   .groupBy(dt, ad_id, site_id).agg(
> dt,
> ad_id,
> col(site_id) as site_id,
> sum(imp_count)   as imp_count,
> sum(click_count) as click_count
>   )
> {code}
> This works fine:
> {code}
>   val grouped = eventLogHLL
>   .groupBy(col(dt), col(ad_id), col(site_id)).agg(
> col(dt)as dt,
> col(ad_id) as ad_id,
> col(site_id)   as site_id,
> sum(imp_count) as imp_count,
> sum(click_count)   as click_count
>   )
> {code}
> Integration tests running with "embedded" spark and DataFrames generated from 
> RDD works fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8691) Enable GZip for Web UI

2015-06-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8691:
---

Assignee: Apache Spark

> Enable GZip for Web UI
> --
>
> Key: SPARK-8691
> URL: https://issues.apache.org/jira/browse/SPARK-8691
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>
> When there are massive tasks in the stage page (such as, running 
> {{sc.parallelize(1 to 10, 1).count()}}), the size of the stage page 
> is large. Enabling GZip can reduce the size significantly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8691) Enable GZip for Web UI

2015-06-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14604733#comment-14604733
 ] 

Apache Spark commented on SPARK-8691:
-

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/7072

> Enable GZip for Web UI
> --
>
> Key: SPARK-8691
> URL: https://issues.apache.org/jira/browse/SPARK-8691
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Reporter: Shixiong Zhu
>
> When there are massive tasks in the stage page (such as, running 
> {{sc.parallelize(1 to 10, 1).count()}}), the size of the stage page 
> is large. Enabling GZip can reduce the size significantly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8691) Enable GZip for Web UI

2015-06-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8691:
---

Assignee: (was: Apache Spark)

> Enable GZip for Web UI
> --
>
> Key: SPARK-8691
> URL: https://issues.apache.org/jira/browse/SPARK-8691
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Reporter: Shixiong Zhu
>
> When there are massive tasks in the stage page (such as, running 
> {{sc.parallelize(1 to 10, 1).count()}}), the size of the stage page 
> is large. Enabling GZip can reduce the size significantly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-8686) DataFrame should support `where` with expression represented by String

2015-06-28 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-8686.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7063
[https://github.com/apache/spark/pull/7063]

> DataFrame should support `where` with expression represented by String
> --
>
> Key: SPARK-8686
> URL: https://issues.apache.org/jira/browse/SPARK-8686
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Kousuke Saruta
>Priority: Minor
> Fix For: 1.5.0
>
>
> DataFrame supports `filter` function with two types of argument, `Column` and 
> `String`. But `where` doesn't.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-8691) Enable GZip for Web UI

2015-06-28 Thread Shixiong Zhu (JIRA)

Shixiong Zhu created SPARK-8691:
---

 Summary: Enable GZip for Web UI
 Key: SPARK-8691
 URL: https://issues.apache.org/jira/browse/SPARK-8691
 Project: Spark
  Issue Type: Sub-task
  Components: Web UI
Reporter: Shixiong Zhu


When there are massive tasks in the stage page (such as, running 
{{sc.parallelize(1 to 10, 1).count()}}), the size of the stage page is 
large. Enabling GZip can reduce the size significantly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8636) CaseKeyWhen has incorrect NULL handling

2015-06-28 Thread Davies Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14604724#comment-14604724
 ] 

Davies Liu commented on SPARK-8636:
---

[~animeshbaranawal] What happen if there is null in the grouping key? Does a 
row with null equal to another row with null?

> CaseKeyWhen has incorrect NULL handling
> ---
>
> Key: SPARK-8636
> URL: https://issues.apache.org/jira/browse/SPARK-8636
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Santiago M. Mola
>  Labels: starter
>
> CaseKeyWhen implementation in Spark uses the following equals implementation:
> {code}
>   private def equalNullSafe(l: Any, r: Any) = {
> if (l == null && r == null) {
>   true
> } else if (l == null || r == null) {
>   false
> } else {
>   l == r
> }
>   }
> {code}
> Which is not correct, since in SQL, NULL is never equal to NULL (actually, it 
> is not unequal either). In this case, a NULL value in a CASE WHEN expression 
> should never match.
> For example, you can execute this in MySQL:
> {code}
> SELECT CASE NULL WHEN NULL THEN "NULL MATCHES" ELSE "NULL DOES NOT MATCH" END 
> FROM DUAL;
> {code}
> And the result will be "NULL DOES NOT MATCH".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-2017) web ui stage page becomes unresponsive when the number of tasks is large

2015-06-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-2017:
---

Assignee: (was: Apache Spark)

> web ui stage page becomes unresponsive when the number of tasks is large
> 
>
> Key: SPARK-2017
> URL: https://issues.apache.org/jira/browse/SPARK-2017
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Reporter: Reynold Xin
>  Labels: starter
>
> {code}
> sc.parallelize(1 to 100, 100).count()
> {code}
> The above code creates one million tasks to be executed. The stage detail web 
> ui page takes forever to load (if it ever completes).
> There are again a few different alternatives:
> 0. Limit the number of tasks we show.
> 1. Pagination
> 2. By default only show the aggregate metrics and failed tasks, and hide the 
> successful ones.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2017) web ui stage page becomes unresponsive when the number of tasks is large

2015-06-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14604723#comment-14604723
 ] 

Apache Spark commented on SPARK-2017:
-

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/7071

> web ui stage page becomes unresponsive when the number of tasks is large
> 
>
> Key: SPARK-2017
> URL: https://issues.apache.org/jira/browse/SPARK-2017
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Reporter: Reynold Xin
>  Labels: starter
>
> {code}
> sc.parallelize(1 to 100, 100).count()
> {code}
> The above code creates one million tasks to be executed. The stage detail web 
> ui page takes forever to load (if it ever completes).
> There are again a few different alternatives:
> 0. Limit the number of tasks we show.
> 1. Pagination
> 2. By default only show the aggregate metrics and failed tasks, and hide the 
> successful ones.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-2017) web ui stage page becomes unresponsive when the number of tasks is large

2015-06-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-2017:
---

Assignee: Apache Spark

> web ui stage page becomes unresponsive when the number of tasks is large
> 
>
> Key: SPARK-2017
> URL: https://issues.apache.org/jira/browse/SPARK-2017
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Reporter: Reynold Xin
>Assignee: Apache Spark
>  Labels: starter
>
> {code}
> sc.parallelize(1 to 100, 100).count()
> {code}
> The above code creates one million tasks to be executed. The stage detail web 
> ui page takes forever to load (if it ever completes).
> There are again a few different alternatives:
> 0. Limit the number of tasks we show.
> 1. Pagination
> 2. By default only show the aggregate metrics and failed tasks, and hide the 
> successful ones.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-8610) Separate Row and InternalRow (part 2)

2015-06-28 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-8610.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7003
[https://github.com/apache/spark/pull/7003]

> Separate Row and InternalRow (part 2)
> -
>
> Key: SPARK-8610
> URL: https://issues.apache.org/jira/browse/SPARK-8610
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 1.5.0
>
>
> Currently, we use GenericRow both for Row and InternalRow, which is confusing 
> because it could contain Scala type also Catalyst types.
> We should have different implementation for them, to avoid some potential 
> bugs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8684) Update R version in Spark EC2 AMI

2015-06-28 Thread Vincent Warmerdam (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14604642#comment-14604642
 ] 

Vincent Warmerdam commented on SPARK-8684:
--

I have an initial commit that I would like to try out that does the yum 
approach: 
https://github.com/koaning/spark-ec2/blob/rstudio-install/create_image.sh#L22-23

[~shivaram] what's the easiest way to test this? 

> Update R version in Spark EC2 AMI
> -
>
> Key: SPARK-8684
> URL: https://issues.apache.org/jira/browse/SPARK-8684
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2, SparkR
>Reporter: Shivaram Venkataraman
>Priority: Minor
>
> Right now the R version in the AMI is 3.1 -- However a number of R libraries 
> need R version 3.2 and it will be good to update the R version on the AMI 
> while launching a EC2 cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-8684) Update R version in Spark EC2 AMI

2015-06-28 Thread Vincent Warmerdam (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14604600#comment-14604600
 ] 

Vincent Warmerdam edited comment on SPARK-8684 at 6/28/15 11:26 AM:


I'll double check. I've done this before but I recall it was always a bit 
messy. Do we prefer using yum or is installing from source also a possbility? 

What centos version will we use? I just worked up a quick server from digital 
ocean with centos 7 and this seems to work just fine: 

```
[root@servy-server ~]# yum install -y epel-release
[root@servy-server ~]# yum update -y
[root@servy-server ~]# yum install -y R
[root@servy-server ~]# R

R version 3.2.0 (2015-04-16) -- "Full of Ingredients"
Copyright (C) 2015 The R Foundation for Statistical Computing
Platform: x86_64-redhat-linux-gnu (64-bit)

>
```

The downside of YUM is that it is not always up to date (the latest version is 
3.2.1, not 3.2.0 which is what yum gives us). The yum version of R should allow 
almost all Rstudio packages to just go and work out of the box though, so it 
might not be the biggest issue.  

Depending of which version of CentOS we use, getting epel might be a problem. 
In the past this  made it more practical to just go and install R from source. 
This is not too terrible, it can be done this way: 

```
[root@servy-server ~]# wget http://cran.rstudio.com/src/base/R-3/R-3.2.1.tar.gz
[root@servy-server ~]# tar xvf R-3.2.1.tar.gz
[root@servy-server ~]# cd R-3.2.1
[root@servy-server ~]# ./configure --prefix=$HOME/R-3.2 --with-readline=no 
--with-x=no
[root@servy-server ~]# make && make install
```

The main downside is that this will take a fair amount of time. 

Another thing we might need to keep in mind is that R has many C++ dependencies 
so we may also need to install up to date compilers for it. 


was (Author: cantdutchthis):
I'll double check. I've done this before but I recall it was always a bit 
messy. Do we prefer using yum or is installing from source also a possbility? 

What centos version will we use? I just worked up a quick server from digital 
ocean with centos 7 and this seems to work just fine: 

```
[root@servy-server ~]# yum install -y epel-release
[root@servy-server ~]# yum update -y
[root@servy-server ~]# yum install -y R
[root@servy-server ~]# R

R version 3.2.0 (2015-04-16) -- "Full of Ingredients"
Copyright (C) 2015 The R Foundation for Statistical Computing
Platform: x86_64-redhat-linux-gnu (64-bit)

>
```

The downside of YUM is that it is not always up to date (the latest version is 
3.2.1). This version of R should allow almost all Rstudio packages to just go 
and work out of the box. 

Depending of which version of CentOS we use, getting epel might be a problem. 
In the past this  made it more practical to just go and install R from source. 
This is not too terrible, it can be done this way: 

```
[root@servy-server ~]# wget http://cran.rstudio.com/src/base/R-3/R-3.2.1.tar.gz
[root@servy-server ~]# tar xvf R-3.2.1.tar.gz
[root@servy-server ~]# cd R-3.2.1
[root@servy-server ~]# ./configure --prefix=$HOME/R-3.2 --with-readline=no 
--with-x=no
[root@servy-server ~]# make && make install
```

The main downside is that this will take a fair amount of time. 

> Update R version in Spark EC2 AMI
> -
>
> Key: SPARK-8684
> URL: https://issues.apache.org/jira/browse/SPARK-8684
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2, SparkR
>Reporter: Shivaram Venkataraman
>Priority: Minor
>
> Right now the R version in the AMI is 3.1 -- However a number of R libraries 
> need R version 3.2 and it will be good to update the R version on the AMI 
> while launching a EC2 cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8337) KafkaUtils.createDirectStream for python is lacking API/feature parity with the Scala/Java version

2015-06-28 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-8337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14604633#comment-14604633
 ] 

Juan Rodríguez Hortalá commented on SPARK-8337:
---

Hi, 

I have worked a bit on the OffsetRange way, you can access the code at 
https://github.com/juanrh/spark/commit/56fbd5c38bd30b825a7818f1c56abb1f8b2beaff.
 I have added the following method to pyspark KafkaUtils

@staticmethod
def getOffsetRanges(rdd):
scalaRdd = rdd._jrdd.rdd()
offsetRangesArray = scalaRdd.offsetRanges()
return [ OffsetRange(topic = offsetRange.topic(),
 partition = offsetRange.partition(), 
 fromOffset = offsetRange.fromOffset(), 
 untilOffset = offsetRange.untilOffset())
for offsetRange in offsetRangesArray]


This method is used in KafkaUtils.createDirectStreamJB, which is based on the 
original KafkaUtilsPythonHelper.createDirectStream. The main problem I have is 
that I don't  know where to store the OffsetRange objects. The naive trick of 
adding them to the __dict__ of each python RDD object doesn't work, the new 
field is lost in the pyspark wrappers. So the new method createDirectStreamJB 
takes two additional options, one for performing an action on the OffsetRange 
list, and another for adding it to each record of the DStream

def createDirectStreamJB(ssc, topics, kafkaParams, fromOffsets={},
   keyDecoder=utf8_decoder, valueDecoder=utf8_decoder, 
offsetRangeForeach=None, addOffsetRange=False):
"""
FIXME: temporary working placeholder
:param offsetRangeForeach: if different to None, this function should 
be a function from a list of OffsetRange to None, and is applied to the 
OffsetRange
list of each rdd
:param addOffsetRange: if False (default) output records are of the 
shape (kafkaKey, kafkaValue); if True output records are of the shape 
(offsetRange, (kafkaKey, kafkaValue)) for offsetRange the OffsetRange value for 
the Spark partition for the record


This is an example of using createDirectStreamJB:

from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
ssc = StreamingContext(sc, 1)
topics = ["test"]
kafkaParams = {"metadata.broker.list" : "localhost:9092"}
def offsetRangeForeach(offsetRangeList):
print 
print 
for offsetRange in offsetRangeList:
print offsetRange
print 
print 

kafkaStream = KafkaUtils.createDirectStreamJB(ssc, topics, kafkaParams, 
offsetRangeForeach=offsetRangeForeach, addOffsetRange=True)

# OffsetRange printed as , I guess due to some kind of pyspark proxy 
kafkaStrStream = kafkaStream.map(lambda (offRan, (k, v)) :  
str(offRan._fromOffset) + " " + str(offRan._untilOffset) + " " + str(k) + " " + 
str(v))
# kafkaStream.pprint()
kafkaStrStream.pprint()
ssc.start()
ssc.awaitTermination(timeout=5)

which gets the following output

15/06/28 12:36:03 INFO InputInfoTracker: remove old batch metadata: 
1435487761000 ms


OffsetRange(topic=test, partition=0, fromOffset=178, untilOffset=179)


15/06/28 12:36:04 INFO JobScheduler: Added jobs for time 1435487764000 ms
...
15/06/28 12:36:04 INFO DAGScheduler: Job 4 finished: runJob at 
PythonRDD.scala:366, took 0,075387 s
---
Time: 2015-06-28 12:36:04
---
178 179 None hola
()
15/06/28 12:36:04 INFO JobScheduler: Finished job streaming job 1435487764000 
ms.0 from job set of time 1435487764000 ms
...
15/06/28 12:36:05 INFO BlockManager: Removing RDD 12


OffsetRange(topic=test, partition=0, fromOffset=179, untilOffset=180)


15/06/28 12:36:06 INFO JobScheduler: Starting job streaming job 1435487766000 
ms.0 from job set of time 1435487766000 ms

15/06/28 12:36:06 INFO DAGScheduler: Job 6 finished: start at 
NativeMethodAccessorImpl.java:-2, took 0,077993 s
---
Time: 2015-06-28 12:36:06
---
179 180 None caracola
()
15/06/28 12:36:06 INFO JobScheduler: Finished job streaming job 1435487766000 
ms.0 from job set of time 1435487766000 ms

Any thoughts on this will be appreciated, in particular about a suitable place 
to store the list of OffsetRange objects

Greetings, 

Juan



> KafkaUtils.createDirectStream for python is lacking API/feature parity with 
> the Scala/Java version
> --
>
> Key: SPARK-8337
> URL: https://issues.apache.org/jira/browse/SPARK-8337
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Streaming
>Affects Versions: 1.4.0
>Reporter: Amit Ramesh
>Priority: Critical
>
> See the following thread for context.
> htt

[jira] [Commented] (SPARK-5281) Registering table on RDD is giving MissingRequirementError

2015-06-28 Thread Littlestar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14604632#comment-14604632
 ] 

Littlestar commented on SPARK-5281:
---

Does this patch merge to spark 1.3 branch, thanks.


> Registering table on RDD is giving MissingRequirementError
> --
>
> Key: SPARK-5281
> URL: https://issues.apache.org/jira/browse/SPARK-5281
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0, 1.3.1
>Reporter: sarsol
>Assignee: Iulian Dragos
>Priority: Critical
> Fix For: 1.4.0
>
>
> Application crashes on this line  {{rdd.registerTempTable("temp")}}  in 1.2 
> version when using sbt or Eclipse SCALA IDE
> Stacktrace:
> {code}
> Exception in thread "main" scala.reflect.internal.MissingRequirementError: 
> class org.apache.spark.sql.catalyst.ScalaReflection in JavaMirror with 
> primordial classloader with boot classpath 
> [C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-library.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-reflect.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-actor.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-swing.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-compiler.jar;C:\Program
>  Files\Java\jre7\lib\resources.jar;C:\Program 
> Files\Java\jre7\lib\rt.jar;C:\Program 
> Files\Java\jre7\lib\sunrsasign.jar;C:\Program 
> Files\Java\jre7\lib\jsse.jar;C:\Program 
> Files\Java\jre7\lib\jce.jar;C:\Program 
> Files\Java\jre7\lib\charsets.jar;C:\Program 
> Files\Java\jre7\lib\jfr.jar;C:\Program Files\Java\jre7\classes] not found.
>   at 
> scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16)
>   at 
> scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17)
>   at 
> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:48)
>   at 
> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:61)
>   at 
> scala.reflect.internal.Mirrors$RootsBase.staticModuleOrClass(Mirrors.scala:72)
>   at 
> scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:119)
>   at 
> scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:21)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$$typecreator1$1.apply(ScalaReflection.scala:115)
>   at 
> scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:231)
>   at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231)
>   at scala.reflect.api.TypeTags$class.typeOf(TypeTags.scala:335)
>   at scala.reflect.api.Universe.typeOf(Universe.scala:59)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:115)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:33)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:100)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:33)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$class.attributesFor(ScalaReflection.scala:94)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.attributesFor(ScalaReflection.scala:33)
>   at org.apache.spark.sql.SQLContext.createSchemaRDD(SQLContext.scala:111)
>   at 
> com.sar.spark.dq.poc.SparkPOC$delayedInit$body.apply(SparkPOC.scala:43)
>   at scala.Function0$class.apply$mcV$sp(Function0.scala:40)
>   at 
> scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
>   at scala.App$$anonfun$main$1.apply(App.scala:71)
>   at scala.App$$anonfun$main$1.apply(App.scala:71)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32)
>   at scala.App$class.main(App.scala:71)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8690) Add a setting to disable SparkSQL parquet schema merge by using datasource API

2015-06-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14604631#comment-14604631
 ] 

Apache Spark commented on SPARK-8690:
-

User 'thegiive' has created a pull request for this issue:
https://github.com/apache/spark/pull/7070

> Add a setting to disable SparkSQL parquet schema merge by using datasource 
> API 
> ---
>
> Key: SPARK-8690
> URL: https://issues.apache.org/jira/browse/SPARK-8690
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.4.0
> Environment: all
>Reporter: thegiive
>Priority: Minor
>
> We need a general config to disable the parquet schema merge feature. 
> Our sparkSQL application requirement is 
> # In spark 1.1, 1.2, sparkSQL read parquet time is around 1~5 sec. We don't 
> want increase too much read parquet time. Around 2000 parquet file,  the 
> schema is the same. So we don't need  schema merge feature
> # We need to use datasource API's feature like partition discovery. So we 
> cannot use Spark 1.2 or pervious version 
> # We have a lot of SparkSQL product. We use 
> *sqlContext.parquetFile(filename)* to read the parquet file. We don't want to 
> change the application code. One setting to disable this feature is what we 
> want 
> In  1.4, we have serval method. But both of them cannot perfect match our use 
> case 
> # Set spark.sql.parquet.useDataSourceApi to false. It will match requirement 
> 1,3. But it will use old parquet API and fail in requirement 2 
> # Use sqlContext.load("parquet" , Map( "path" -> "..." , "mergeSchema" -> 
> "false" ))  will meet requirement 1,2. But it need to change a lot of code we 
> use in parquet load. 
> # Spark 1.4 improve a lot on schema merge than 1.3. But directly use default 
> version of parquet will increase the load time from 1~5 sec to 100 sec. It 
> will fail requirement 1. 
> # Try PR 5231 config. But it  cannot disable schema merge. 
> I think it is better to use a config to disable datasource API's schema merge 
> feature. A PR will be provide later 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8690) Add a setting to disable SparkSQL parquet schema merge by using datasource API

2015-06-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8690:
---

Assignee: (was: Apache Spark)

> Add a setting to disable SparkSQL parquet schema merge by using datasource 
> API 
> ---
>
> Key: SPARK-8690
> URL: https://issues.apache.org/jira/browse/SPARK-8690
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.4.0
> Environment: all
>Reporter: thegiive
>Priority: Minor
>
> We need a general config to disable the parquet schema merge feature. 
> Our sparkSQL application requirement is 
> # In spark 1.1, 1.2, sparkSQL read parquet time is around 1~5 sec. We don't 
> want increase too much read parquet time. Around 2000 parquet file,  the 
> schema is the same. So we don't need  schema merge feature
> # We need to use datasource API's feature like partition discovery. So we 
> cannot use Spark 1.2 or pervious version 
> # We have a lot of SparkSQL product. We use 
> *sqlContext.parquetFile(filename)* to read the parquet file. We don't want to 
> change the application code. One setting to disable this feature is what we 
> want 
> In  1.4, we have serval method. But both of them cannot perfect match our use 
> case 
> # Set spark.sql.parquet.useDataSourceApi to false. It will match requirement 
> 1,3. But it will use old parquet API and fail in requirement 2 
> # Use sqlContext.load("parquet" , Map( "path" -> "..." , "mergeSchema" -> 
> "false" ))  will meet requirement 1,2. But it need to change a lot of code we 
> use in parquet load. 
> # Spark 1.4 improve a lot on schema merge than 1.3. But directly use default 
> version of parquet will increase the load time from 1~5 sec to 100 sec. It 
> will fail requirement 1. 
> # Try PR 5231 config. But it  cannot disable schema merge. 
> I think it is better to use a config to disable datasource API's schema merge 
> feature. A PR will be provide later 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8690) Add a setting to disable SparkSQL parquet schema merge by using datasource API

2015-06-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8690:
---

Assignee: Apache Spark

> Add a setting to disable SparkSQL parquet schema merge by using datasource 
> API 
> ---
>
> Key: SPARK-8690
> URL: https://issues.apache.org/jira/browse/SPARK-8690
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.4.0
> Environment: all
>Reporter: thegiive
>Assignee: Apache Spark
>Priority: Minor
>
> We need a general config to disable the parquet schema merge feature. 
> Our sparkSQL application requirement is 
> # In spark 1.1, 1.2, sparkSQL read parquet time is around 1~5 sec. We don't 
> want increase too much read parquet time. Around 2000 parquet file,  the 
> schema is the same. So we don't need  schema merge feature
> # We need to use datasource API's feature like partition discovery. So we 
> cannot use Spark 1.2 or pervious version 
> # We have a lot of SparkSQL product. We use 
> *sqlContext.parquetFile(filename)* to read the parquet file. We don't want to 
> change the application code. One setting to disable this feature is what we 
> want 
> In  1.4, we have serval method. But both of them cannot perfect match our use 
> case 
> # Set spark.sql.parquet.useDataSourceApi to false. It will match requirement 
> 1,3. But it will use old parquet API and fail in requirement 2 
> # Use sqlContext.load("parquet" , Map( "path" -> "..." , "mergeSchema" -> 
> "false" ))  will meet requirement 1,2. But it need to change a lot of code we 
> use in parquet load. 
> # Spark 1.4 improve a lot on schema merge than 1.3. But directly use default 
> version of parquet will increase the load time from 1~5 sec to 100 sec. It 
> will fail requirement 1. 
> # Try PR 5231 config. But it  cannot disable schema merge. 
> I think it is better to use a config to disable datasource API's schema merge 
> feature. A PR will be provide later 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-8690) Add a setting to disable SparkSQL parquet schema merge by using datasource API

2015-06-28 Thread thegiive (JIRA)

thegiive created SPARK-8690:
---

 Summary: Add a setting to disable SparkSQL parquet schema merge by 
using datasource API 
 Key: SPARK-8690
 URL: https://issues.apache.org/jira/browse/SPARK-8690
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.4.0
 Environment: all
Reporter: thegiive
Priority: Minor


We need a general config to disable the parquet schema merge feature. 

Our sparkSQL application requirement is 

# In spark 1.1, 1.2, sparkSQL read parquet time is around 1~5 sec. We don't 
want increase too much read parquet time. Around 2000 parquet file,  the schema 
is the same. So we don't need  schema merge feature
# We need to use datasource API's feature like partition discovery. So we 
cannot use Spark 1.2 or pervious version 
# We have a lot of SparkSQL product. We use *sqlContext.parquetFile(filename)* 
to read the parquet file. We don't want to change the application code. One 
setting to disable this feature is what we want 


In  1.4, we have serval method. But both of them cannot perfect match our use 
case 

# Set spark.sql.parquet.useDataSourceApi to false. It will match requirement 
1,3. But it will use old parquet API and fail in requirement 2 
# Use sqlContext.load("parquet" , Map( "path" -> "..." , "mergeSchema" -> 
"false" ))  will meet requirement 1,2. But it need to change a lot of code we 
use in parquet load. 
# Spark 1.4 improve a lot on schema merge than 1.3. But directly use default 
version of parquet will increase the load time from 1~5 sec to 100 sec. It will 
fail requirement 1. 
# Try PR 5231 config. But it  cannot disable schema merge. 

I think it is better to use a config to disable datasource API's schema merge 
feature. A PR will be provide later 





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8679) remove dummy java class BaseRow and BaseMutableRow

2015-06-28 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-8679:
-
   Priority: Minor  (was: Major)
Component/s: SQL
 Issue Type: Improvement  (was: Bug)

[~cloud_fan] Let's assign component, priority and type please

> remove dummy java class BaseRow and BaseMutableRow
> --
>
> Key: SPARK-8679
> URL: https://issues.apache.org/jira/browse/SPARK-8679
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8680) PropagateTypes is very slow when there are lots of columns

2015-06-28 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-8680:
-
Component/s: SQL

> PropagateTypes is very slow when there are lots of columns
> --
>
> Key: SPARK-8680
> URL: https://issues.apache.org/jira/browse/SPARK-8680
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.1, 1.4.0
>Reporter: Davies Liu
>
> The time for PropagateTypes is O(N*N), N is the number of columns, which is 
> very slow if there many columns (>1000)
> There easiest optimization could be put `q.inputSet` outside of  
> transformExpressions which could have about 4 times improvement for N=3000



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8645) Incorrect expression analysis with Hive

2015-06-28 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-8645:
-
Component/s: SQL

Please read 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
Set component for example

> Incorrect expression analysis with Hive
> ---
>
> Key: SPARK-8645
> URL: https://issues.apache.org/jira/browse/SPARK-8645
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
> Environment: CDH 5.4.2 1.3.0
>Reporter: Eugene Zhulenev
>  Labels: dataframe
>
> When using DataFrame backed by Hive table groupBy with agg can't resolve 
> column if I pass them by String and not Column:
> This fails with: org.apache.spark.sql.AnalysisException: expression 'dt' is 
> neither present in the group by, nor is it an aggregate function.
> {code}
> val grouped = eventLogHLL
>   .groupBy(dt, ad_id, site_id).agg(
> dt,
> ad_id,
> col(site_id) as site_id,
> sum(imp_count)   as imp_count,
> sum(click_count) as click_count
>   )
> {code}
> This works fine:
> {code}
>   val grouped = eventLogHLL
>   .groupBy(col(dt), col(ad_id), col(site_id)).agg(
> col(dt)as dt,
> col(ad_id) as ad_id,
> col(site_id)   as site_id,
> sum(imp_count) as imp_count,
> sum(click_count)   as click_count
>   )
> {code}
> Integration tests running with "embedded" spark and DataFrames generated from 
> RDD works fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8635) improve performance of CatalystTypeConverters

2015-06-28 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-8635:
-
Assignee: Wenchen Fan

> improve performance of CatalystTypeConverters
> -
>
> Key: SPARK-8635
> URL: https://issues.apache.org/jira/browse/SPARK-8635
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 1.5.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8237) misc function: sha2

2015-06-28 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-8237:
-
Assignee: Liang-Chi Hsieh

> misc function: sha2
> ---
>
> Key: SPARK-8237
> URL: https://issues.apache.org/jira/browse/SPARK-8237
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Liang-Chi Hsieh
> Fix For: 1.5.0
>
>
> sha2(string/binary, int): string
> Calculates the SHA-2 family of hash functions (SHA-224, SHA-256, SHA-384, and 
> SHA-512) (as of Hive 1.3.0). The first argument is the string or binary to be 
> hashed. The second argument indicates the desired bit length of the result, 
> which must have a value of 224, 256, 384, 512, or 0 (which is equivalent to 
> 256). SHA-224 is supported starting from Java 8. If either argument is NULL 
> or the hash length is not one of the permitted values, the return value is 
> NULL. Example: sha2('ABC', 256) = 
> 'b5d4045c3f466fa91fe2cc6abe79232a1a57cdf104f7a26e716e0a1e2789df78'.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7088) [REGRESSION] Spark 1.3.1 breaks analysis of third-party logical plans

2015-06-28 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-7088:
-
Assignee: Santiago M. Mola

> [REGRESSION] Spark 1.3.1 breaks analysis of third-party logical plans
> -
>
> Key: SPARK-7088
> URL: https://issues.apache.org/jira/browse/SPARK-7088
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.1
>Reporter: Santiago M. Mola
>Assignee: Santiago M. Mola
>Priority: Critical
>  Labels: regression
> Fix For: 1.5.0
>
>
> We're using some custom logical plans. We are now migrating from Spark 1.3.0 
> to 1.3.1 and found a few incompatible API changes. All of them seem to be in 
> internal code, so we understand that. But now the ResolveReferences rule, 
> that used to work with third-party logical plans just does not work, without 
> any possible workaround that I'm aware other than just copying 
> ResolveReferences rule and using it with our own fix.
> The change in question is this section of code:
> {code}
> }.headOption.getOrElse { // Only handle first case, others will be 
> fixed on the next pass.
>   sys.error(
> s"""
>   |Failure when resolving conflicting references in Join:
>   |$plan
>   |
>   |Conflicting attributes: ${conflictingAttributes.mkString(",")}
>   """.stripMargin)
> }
> {code}
> Which causes the following error on analysis:
> {code}
> Failure when resolving conflicting references in Join:
> 'Project ['l.name,'r.name,'FUNC1('l.node,'r.node) AS 
> c2#37,'FUNC2('l.node,'r.node) AS c3#38,'FUNC3('r.node,'l.node) AS c4#39]
>  'Join Inner, None
>   Subquery l
>Subquery h
> Project [name#12,node#36]
>  CustomPlan H, u, (p#13L = s#14L), [ord#15 ASC], IS NULL p#13L, node#36
>   Subquery v
>Subquery h_src
> LogicalRDD [name#12,p#13L,s#14L,ord#15], MapPartitionsRDD[1] at 
> mapPartitions at ExistingRDD.scala:37
>   Subquery r
>Subquery h
> Project [name#40,node#36]
>  CustomPlan H, u, (p#41L = s#42L), [ord#43 ASC], IS NULL pred#41L, node#36
>   Subquery v
>Subquery h_src
> LogicalRDD [name#40,p#41L,s#42L,ord#43], MapPartitionsRDD[1] at 
> mapPartitions at ExistingRDD.scala:37
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6777) Implement backwards-compatibility rules in Parquet schema converters

2015-06-28 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6777:
-
Assignee: Cheng Lian

> Implement backwards-compatibility rules in Parquet schema converters
> 
>
> Key: SPARK-6777
> URL: https://issues.apache.org/jira/browse/SPARK-6777
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.0.2, 1.1.1, 1.2.1, 1.3.0, 1.4.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Critical
> Fix For: 1.5.0
>
>
> When converting Parquet schemas to/from  Spark SQL schemas, we should 
> recognize commonly used legacy non-standard representation of complex types. 
> We can follow the pattern used in Parquet's {{AvroSchemaConverter}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6777) Implement backwards-compatibility rules in Parquet schema converters

2015-06-28 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14604619#comment-14604619
 ] 

Sean Owen commented on SPARK-6777:
--

[~lian cheng] can you please assign JIRAs when you resolve them? even if it's 
to yourself. That's a necessary final step and a lot of yours miss that

> Implement backwards-compatibility rules in Parquet schema converters
> 
>
> Key: SPARK-6777
> URL: https://issues.apache.org/jira/browse/SPARK-6777
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.0.2, 1.1.1, 1.2.1, 1.3.0, 1.4.0
>Reporter: Cheng Lian
>Priority: Critical
> Fix For: 1.5.0
>
>
> When converting Parquet schemas to/from  Spark SQL schemas, we should 
> recognize commonly used legacy non-standard representation of complex types. 
> We can follow the pattern used in Parquet's {{AvroSchemaConverter}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5482) Allow individual test suites in python/run-tests

2015-06-28 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5482:
-
Assignee: Josh Rosen

> Allow individual test suites in python/run-tests
> 
>
> Key: SPARK-5482
> URL: https://issues.apache.org/jira/browse/SPARK-5482
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Katsunori Kanda
>Assignee: Josh Rosen
>Priority: Minor
> Fix For: 1.5.0
>
>
> Add options to run individual test suites in python/run-tests. The usage is 
> as follow.
> ./python/run-tests \[core|sql|mllib|ml|streaming\]
> When you select none, all test suites are run for backward compatibility.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8603) In Windows,Not able to create a Spark context from R studio

2015-06-28 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-8603:
-
Target Version/s:   (was: 1.4.0)
   Fix Version/s: (was: 1.4.0)

[~Prakashpc] Before opening a JIRA, you need to read 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark  You do 
not set Fix or Target version, for example

> In Windows,Not able to create a Spark context from R studio 
> 
>
> Key: SPARK-8603
> URL: https://issues.apache.org/jira/browse/SPARK-8603
> Project: Spark
>  Issue Type: Bug
>  Components: R
>Affects Versions: 1.4.0
> Environment: Windows, R studio
>Reporter: Prakash Ponshankaarchinnusamy
>   Original Estimate: 0.5m
>  Remaining Estimate: 0.5m
>
> In windows ,creation of spark context fails using below code from R studio
> Sys.setenv(SPARK_HOME="C:\\spark\\spark-1.4.0")
> .libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
> library(SparkR)
> sc <- sparkR.init(master="spark://localhost:7077", appName="SparkR")
> Error: JVM is not ready after 10 seconds
> Reason: Wrong file path computed in client.R. File seperator for windows["\"] 
> is not respected by "file.Path" function by default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8688) Hadoop Configuration has to disable client cache when writing or reading delegation tokens.

2015-06-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8688:
---

Assignee: (was: Apache Spark)

> Hadoop Configuration has to disable client cache when writing or reading 
> delegation tokens.
> ---
>
> Key: SPARK-8688
> URL: https://issues.apache.org/jira/browse/SPARK-8688
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.5.0
>Reporter: SaintBacchus
>
> In class *AMDelegationTokenRenewer* and *ExecutorDelegationTokenUpdater*, 
> Spark will write and read the credentials.
> But if we don't disable the *fs.hdfs.impl.disable.cache*, Spark will use 
> cached  FileSystem (which will use old token ) to  upload or download file.
> Then when the old token is expired, it can't gain the auth to get/put the 
> hdfs.
> (I only tested in a very short time with the configuration:
> dfs.namenode.delegation.token.renew-interval=3min
> dfs.namenode.delegation.token.max-lifetime=10min
> I'm not sure whatever it matters.
>  )



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8688) Hadoop Configuration has to disable client cache when writing or reading delegation tokens.

2015-06-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14604613#comment-14604613
 ] 

Apache Spark commented on SPARK-8688:
-

User 'SaintBacchus' has created a pull request for this issue:
https://github.com/apache/spark/pull/7069

> Hadoop Configuration has to disable client cache when writing or reading 
> delegation tokens.
> ---
>
> Key: SPARK-8688
> URL: https://issues.apache.org/jira/browse/SPARK-8688
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.5.0
>Reporter: SaintBacchus
>
> In class *AMDelegationTokenRenewer* and *ExecutorDelegationTokenUpdater*, 
> Spark will write and read the credentials.
> But if we don't disable the *fs.hdfs.impl.disable.cache*, Spark will use 
> cached  FileSystem (which will use old token ) to  upload or download file.
> Then when the old token is expired, it can't gain the auth to get/put the 
> hdfs.
> (I only tested in a very short time with the configuration:
> dfs.namenode.delegation.token.renew-interval=3min
> dfs.namenode.delegation.token.max-lifetime=10min
> I'm not sure whatever it matters.
>  )



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8688) Hadoop Configuration has to disable client cache when writing or reading delegation tokens.

2015-06-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8688:
---

Assignee: Apache Spark

> Hadoop Configuration has to disable client cache when writing or reading 
> delegation tokens.
> ---
>
> Key: SPARK-8688
> URL: https://issues.apache.org/jira/browse/SPARK-8688
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.5.0
>Reporter: SaintBacchus
>Assignee: Apache Spark
>
> In class *AMDelegationTokenRenewer* and *ExecutorDelegationTokenUpdater*, 
> Spark will write and read the credentials.
> But if we don't disable the *fs.hdfs.impl.disable.cache*, Spark will use 
> cached  FileSystem (which will use old token ) to  upload or download file.
> Then when the old token is expired, it can't gain the auth to get/put the 
> hdfs.
> (I only tested in a very short time with the configuration:
> dfs.namenode.delegation.token.renew-interval=3min
> dfs.namenode.delegation.token.max-lifetime=10min
> I'm not sure whatever it matters.
>  )



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-8684) Update R version in Spark EC2 AMI

2015-06-28 Thread Vincent Warmerdam (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14604600#comment-14604600
 ] 

Vincent Warmerdam edited comment on SPARK-8684 at 6/28/15 9:52 AM:
---

I'll double check. I've done this before but I recall it was always a bit 
messy. Do we prefer using yum or is installing from source also a possbility? 

What centos version will we use? I just worked up a quick server from digital 
ocean with centos 7 and this seems to work just fine: 

```
[root@servy-server ~]# yum install -y epel-release
[root@servy-server ~]# yum update -y
[root@servy-server ~]# yum install -y R
[root@servy-server ~]# R

R version 3.2.0 (2015-04-16) -- "Full of Ingredients"
Copyright (C) 2015 The R Foundation for Statistical Computing
Platform: x86_64-redhat-linux-gnu (64-bit)

>
```

The downside of YUM is that it is not always up to date (the latest version is 
3.2.1). This version of R should allow almost all Rstudio packages to just go 
and work out of the box. 

Depending of which version of CentOS we use, getting epel might be a problem. 
In the past this  made it more practical to just go and install R from source. 
This is not too terrible, it can be done this way: 

```
[root@servy-server ~]# wget http://cran.rstudio.com/src/base/R-3/R-3.2.1.tar.gz
[root@servy-server ~]# tar xvf R-3.2.1.tar.gz
[root@servy-server ~]# cd R-3.2.1
[root@servy-server ~]# ./configure --prefix=$HOME/R-3.2 --with-readline=no 
--with-x=no
[root@servy-server ~]# make && make install
```

The main downside is that this will take a fair amount of time. 


was (Author: cantdutchthis):
I'll double check. I've done this before but I recall it was always a bit 
messy. Do we prefer using yum or is installing from source also a possbility? 

> Update R version in Spark EC2 AMI
> -
>
> Key: SPARK-8684
> URL: https://issues.apache.org/jira/browse/SPARK-8684
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2, SparkR
>Reporter: Shivaram Venkataraman
>Priority: Minor
>
> Right now the R version in the AMI is 3.1 -- However a number of R libraries 
> need R version 3.2 and it will be good to update the R version on the AMI 
> while launching a EC2 cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8684) Update R version in Spark EC2 AMI

2015-06-28 Thread Vincent Warmerdam (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14604600#comment-14604600
 ] 

Vincent Warmerdam commented on SPARK-8684:
--

I'll double check. I've done this before but I recall it was always a bit 
messy. Do we prefer using yum or is installing from source also a possbility? 

> Update R version in Spark EC2 AMI
> -
>
> Key: SPARK-8684
> URL: https://issues.apache.org/jira/browse/SPARK-8684
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2, SparkR
>Reporter: Shivaram Venkataraman
>Priority: Minor
>
> Right now the R version in the AMI is 3.1 -- However a number of R libraries 
> need R version 3.2 and it will be good to update the R version on the AMI 
> while launching a EC2 cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8596) Install and configure RStudio server on Spark EC2

2015-06-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8596:
---

Assignee: (was: Apache Spark)

> Install and configure RStudio server on Spark EC2
> -
>
> Key: SPARK-8596
> URL: https://issues.apache.org/jira/browse/SPARK-8596
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2, SparkR
>Reporter: Shivaram Venkataraman
>
> This will make it convenient for R users to use SparkR from their browsers 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8596) Install and configure RStudio server on Spark EC2

2015-06-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8596:
---

Assignee: Apache Spark

> Install and configure RStudio server on Spark EC2
> -
>
> Key: SPARK-8596
> URL: https://issues.apache.org/jira/browse/SPARK-8596
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2, SparkR
>Reporter: Shivaram Venkataraman
>Assignee: Apache Spark
>
> This will make it convenient for R users to use SparkR from their browsers 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8596) Install and configure RStudio server on Spark EC2

2015-06-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14604588#comment-14604588
 ] 

Apache Spark commented on SPARK-8596:
-

User 'koaning' has created a pull request for this issue:
https://github.com/apache/spark/pull/7068

> Install and configure RStudio server on Spark EC2
> -
>
> Key: SPARK-8596
> URL: https://issues.apache.org/jira/browse/SPARK-8596
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2, SparkR
>Reporter: Shivaram Venkataraman
>
> This will make it convenient for R users to use SparkR from their browsers 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-8596) Install and configure RStudio server on Spark EC2

2015-06-28 Thread Vincent Warmerdam (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14604587#comment-14604587
 ] 

Vincent Warmerdam edited comment on SPARK-8596 at 6/28/15 8:49 AM:
---

1. on it. just created [SPARK-8596][EC2] Added port for Rstudio
2. on it




was (Author: cantdutchthis):
1. on it. 



> Install and configure RStudio server on Spark EC2
> -
>
> Key: SPARK-8596
> URL: https://issues.apache.org/jira/browse/SPARK-8596
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2, SparkR
>Reporter: Shivaram Venkataraman
>
> This will make it convenient for R users to use SparkR from their browsers 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8596) Install and configure RStudio server on Spark EC2

2015-06-28 Thread Vincent Warmerdam (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14604587#comment-14604587
 ] 

Vincent Warmerdam commented on SPARK-8596:
--

1. on it. 



> Install and configure RStudio server on Spark EC2
> -
>
> Key: SPARK-8596
> URL: https://issues.apache.org/jira/browse/SPARK-8596
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2, SparkR
>Reporter: Shivaram Venkataraman
>
> This will make it convenient for R users to use SparkR from their browsers 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8689) Add get* methods with fieldName to Row

2015-06-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8689:
---

Assignee: (was: Apache Spark)

> Add get* methods with fieldName to Row
> --
>
> Key: SPARK-8689
> URL: https://issues.apache.org/jira/browse/SPARK-8689
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Kousuke Saruta
>
> `Row ` has methods to get field a value by index but not has methods to get a 
> field value by fieldName.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8689) Add get* methods with fieldName to Row

2015-06-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14604586#comment-14604586
 ] 

Apache Spark commented on SPARK-8689:
-

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/7067

> Add get* methods with fieldName to Row
> --
>
> Key: SPARK-8689
> URL: https://issues.apache.org/jira/browse/SPARK-8689
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Kousuke Saruta
>
> `Row ` has methods to get field a value by index but not has methods to get a 
> field value by fieldName.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8689) Add get* methods with fieldName to Row

2015-06-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8689:
---

Assignee: Apache Spark

> Add get* methods with fieldName to Row
> --
>
> Key: SPARK-8689
> URL: https://issues.apache.org/jira/browse/SPARK-8689
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Kousuke Saruta
>Assignee: Apache Spark
>
> `Row ` has methods to get field a value by index but not has methods to get a 
> field value by fieldName.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 112 matches

Mail list logo