[jira] [Commented] (SPARK-3129) Prevent data loss in Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14105086#comment-14105086 ] Saisai Shao commented on SPARK-3129: Hi Hari, I have some high level questions about this: 1. In the design doc, you mentioned to do Once the RDD is generated, the RDD is checkpointed to HDFS - at which point it is fully recoverable, I'm not sure you checkpoint only the metadata of RDD or also about the data? I think RDD checkpointing is little expensive for each batch duration if the batch duration is quite short. 2. If we keep executors alive when driver dies, do we still need to keep receivers to receive data from external source? If so I think there may potentially have some problems: firstly memory usage will be accumulated since no data is consumed; secondly when driver comes back how to balance the data processing priority, since old data needs to be processed first, this will delay the newly coming data processing time and lead to unwanted issue if latency is larger than the batch duration. 3. In some scenarios we need to operate DStream with RDD (like join real-time data with history log), normally RDD is cached in BM's memory, I think we also need to recover this RDD's metadata, not only streaming data if we need to recover the processing. Maybe there are many other details we need to think about, because to do driver HA is quite complex. Please correct me if something is misunderstood. Thanks a lot. Prevent data loss in Spark Streaming Key: SPARK-3129 URL: https://issues.apache.org/jira/browse/SPARK-3129 Project: Spark Issue Type: New Feature Reporter: Hari Shreedharan Assignee: Hari Shreedharan Attachments: StreamingPreventDataLoss.pdf Spark Streaming can small amounts of data when the driver goes down - and the sending system cannot re-send the data (or the data has already expired on the sender side). The document attached has more details. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2978) Provide an MR-style shuffle transformation
[ https://issues.apache.org/jira/browse/SPARK-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14105126#comment-14105126 ] Sandy Ryza commented on SPARK-2978: --- So I started looking into this a little more and wanted to bring up a semantics issue I came across. The proposed implementation would be to use a similar path to that used by sortByKey in each reduce task, and then wrap the Iterator over sorted records with an Iterator that groups them. I.e. wrap an the Iterator[(K, V)] in an Iterator[(K, Iterator[V])]. The question is how to handle the validity of an inner V iterator with respect to the outer Iterator. The options as I see it are: 1. Calling next() or hasNext() on the outer iterator invalidates the current inner V iterator. 2. The inner V iterator must be exhausted before calling next() or hasNext() on the outer iterator. 3. On each next() call on the outer iterator, scan over all the values for that key and put them in a separate buffer. The MapReduce approach, where the outer iterator is replaced by a sequence of calls to the reduce function, is similar to (1). When the Iterators returned by groupByKey are eventually disk-backed, we'll face the same issue, so we probably want to make the semantics there consistent with whatever we decide here. Provide an MR-style shuffle transformation -- Key: SPARK-2978 URL: https://issues.apache.org/jira/browse/SPARK-2978 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Sandy Ryza For Hive on Spark joins in particular, and for running legacy MR code in general, I think it would be useful to provide a transformation with the semantics of the Hadoop MR shuffle, i.e. one that * groups by key: provides (Key, Iterator[Value]) * within each partition, provides keys in sorted order A couple ways that could make sense to expose this: * Add a new operator. groupAndSortByKey, groupByKeyAndSortWithinPartition, hadoopStyleShuffle, maybe? * Allow groupByKey to take an ordering param for keys within a partition -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2978) Provide an MR-style shuffle transformation
[ https://issues.apache.org/jira/browse/SPARK-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14105128#comment-14105128 ] Sandy Ryza commented on SPARK-2978: --- [~jerryshao], if I understand correctly, ShuffleRDD already supports what's needed here, and satisfying that need is independent of whether we sort on the map side. That said, I think the changes you proposed on SPARK-2926 could definitely make this more performant, and we would likely see the same improvements you benchmarked for sortByKey. Provide an MR-style shuffle transformation -- Key: SPARK-2978 URL: https://issues.apache.org/jira/browse/SPARK-2978 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Sandy Ryza For Hive on Spark joins in particular, and for running legacy MR code in general, I think it would be useful to provide a transformation with the semantics of the Hadoop MR shuffle, i.e. one that * groups by key: provides (Key, Iterator[Value]) * within each partition, provides keys in sorted order A couple ways that could make sense to expose this: * Add a new operator. groupAndSortByKey, groupByKeyAndSortWithinPartition, hadoopStyleShuffle, maybe? * Allow groupByKey to take an ordering param for keys within a partition -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2096) Correctly parse dot notations for accessing an array of structs
[ https://issues.apache.org/jira/browse/SPARK-2096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14105221#comment-14105221 ] Hanwei Jin commented on SPARK-2096: --- I think i almost have solved the issue. I have passed the test case in JsonSuite (Complex field and type inferring (Ignored)) which is ignored, by a little modified. modified test part : checkAnswer( sql(select arrayOfStruct.field1, arrayOfStruct.field2 from jsonTable), (Seq(true, false, null), Seq(str1, null, null)) :: Nil ) However, another question is repeated nested structure is a problem, like arrayOfStruct.field1.arrayOfStruct.field1 or arrayOfStruct[0].field1.arrayOfStruct[0].field1 I plan to ignore this problem and try to add select arrayOfStruct.field1, arrayOfStruct.field2 from jsonTable where arrayOfStruct.field1==true Besides, my friend anyweil (Wei Li) solved the problem of arrayOfStruct.field1 and its Filter part( means where parsing). I am fresh here but will continue working on spark :) Correctly parse dot notations for accessing an array of structs --- Key: SPARK-2096 URL: https://issues.apache.org/jira/browse/SPARK-2096 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.0 Reporter: Yin Huai Priority: Minor Labels: starter For example, arrayOfStruct is an array of structs and every element of this array has a field called field1. arrayOfStruct[0].field1 means to access the value of field1 for the first element of arrayOfStruct, but the SQL parser (in sql-core) treats field1 as an alias. Also, arrayOfStruct.field1 means to access all values of field1 in this array of structs and the returns those values as an array. But, the SQL parser cannot resolve it. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2988) Port repl to scala 2.11.
[ https://issues.apache.org/jira/browse/SPARK-2988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14105238#comment-14105238 ] Apache Spark commented on SPARK-2988: - User 'ScrapCodes' has created a pull request for this issue: https://github.com/apache/spark/pull/2079 Port repl to scala 2.11. Key: SPARK-2988 URL: https://issues.apache.org/jira/browse/SPARK-2988 Project: Spark Issue Type: Sub-task Components: Build, Spark Core Reporter: Prashant Sharma Assignee: Prashant Sharma -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-2963) The description about building to use HiveServer and CLI is incomplete
[ https://issues.apache.org/jira/browse/SPARK-2963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta reopened SPARK-2963: --- The description about building to use HiveServer and CLI is incomplete -- Key: SPARK-2963 URL: https://issues.apache.org/jira/browse/SPARK-2963 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Kousuke Saruta Fix For: 1.1.0 Currently, if we'd like to use HiveServer or CLI for SparkSQL, we need to use -Phive-thriftserver option when building but it's description is incomplete. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2963) The description about how to build for using CLI and Thrift JDBC server is absent in proper document
[ https://issues.apache.org/jira/browse/SPARK-2963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-2963: -- Summary: The description about how to build for using CLI and Thrift JDBC server is absent in proper document (was: The description about building to use HiveServer and CLI is incomplete) The description about how to build for using CLI and Thrift JDBC server is absent in proper document - Key: SPARK-2963 URL: https://issues.apache.org/jira/browse/SPARK-2963 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Kousuke Saruta Fix For: 1.1.0 Currently, if we'd like to use HiveServer or CLI for SparkSQL, we need to use -Phive-thriftserver option when building but it's description is incomplete. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2963) The description about how to build for using CLI and Thrift JDBC server is absent in proper document
[ https://issues.apache.org/jira/browse/SPARK-2963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14105273#comment-14105273 ] Apache Spark commented on SPARK-2963: - User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/2080 The description about how to build for using CLI and Thrift JDBC server is absent in proper document - Key: SPARK-2963 URL: https://issues.apache.org/jira/browse/SPARK-2963 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Kousuke Saruta Fix For: 1.1.0 Currently, if we'd like to use HiveServer or CLI for SparkSQL, we need to use -Phive-thriftserver option when building but it's description is incomplete. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3169) make-distribution.sh failed
[ https://issues.apache.org/jira/browse/SPARK-3169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14105277#comment-14105277 ] Sean Owen commented on SPARK-3169: -- Same as https://issues.apache.org/jira/browse/SPARK-2798 ? it's resolving similar problems in the Flume build. make-distribution.sh failed --- Key: SPARK-3169 URL: https://issues.apache.org/jira/browse/SPARK-3169 Project: Spark Issue Type: Bug Components: Build Reporter: Guoqiang Li Priority: Blocker {code}./make-distribution.sh -Pyarn -Phadoop-2.3 -Phive-thriftserver -Phive -Dhadoop.version=2.3.0 {code} = {noformat} java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289) at org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229) at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415) at org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356) Caused by: scala.reflect.internal.Types$TypeError: bad symbolic reference. A signature in TestSuiteBase.class refers to term dstream in package org.apache.spark.streaming which is not available. It may be completely missing from the current classpath, or the version on the classpath might be incompatible with the version used when compiling TestSuiteBase.class. {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1449) Please delete old releases from mirroring system
[ https://issues.apache.org/jira/browse/SPARK-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14105303#comment-14105303 ] Sean Owen commented on SPARK-1449: -- [~pwendell] can you or someone else on the PMC zap this one? should be straightforward. Please delete old releases from mirroring system Key: SPARK-1449 URL: https://issues.apache.org/jira/browse/SPARK-1449 Project: Spark Issue Type: Task Affects Versions: 0.8.0, 0.8.1, 0.9.0, 0.9.1 Reporter: Sebb To reduce the load on the ASF mirrors, projects are required to delete old releases [1] Please can you remove all non-current releases? Thanks! [Note that older releases are always available from the ASF archive server] Any links to older releases on download pages should first be adjusted to point to the archive server. [1] http://www.apache.org/dev/release.html#when-to-archive -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2096) Correctly parse dot notations for accessing an array of structs
[ https://issues.apache.org/jira/browse/SPARK-2096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14105329#comment-14105329 ] Hanwei Jin commented on SPARK-2096: --- I checked the problem where arrayOfStruct.field1==true this problem will lead to modify every kind of comparisonExpression. And I think it makes no sense to add this function. So I discard it. Over. Correctly parse dot notations for accessing an array of structs --- Key: SPARK-2096 URL: https://issues.apache.org/jira/browse/SPARK-2096 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.0 Reporter: Yin Huai Priority: Minor Labels: starter For example, arrayOfStruct is an array of structs and every element of this array has a field called field1. arrayOfStruct[0].field1 means to access the value of field1 for the first element of arrayOfStruct, but the SQL parser (in sql-core) treats field1 as an alias. Also, arrayOfStruct.field1 means to access all values of field1 in this array of structs and the returns those values as an array. But, the SQL parser cannot resolve it. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2291) Update EC2 scripts to use instance storage on m3 instance types
[ https://issues.apache.org/jira/browse/SPARK-2291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14105342#comment-14105342 ] Daniel Darabos commented on SPARK-2291: --- I don't know if something has changed on Amazon's end or if I'm missing something. (I'm pretty clueless.) But we still see missing SSDs. This change fixed it for us: https://github.com/apache/spark/pull/2081/files. The block device mapping entries are necessary according to http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/InstanceStorage.html#InstanceStore_UsageScenarios. I guess you tested PR #1156. Actually it seemed to have worked for us too for a while. But now some of the machines come up without SSDs. (/dev/sdb and /dev/sdc do not exist.) So I read the docs and tried adding the block device mappings. Seems to work. With PR #2081 all machines have the SSDs. Hope this makes sense. Update EC2 scripts to use instance storage on m3 instance types --- Key: SPARK-2291 URL: https://issues.apache.org/jira/browse/SPARK-2291 Project: Spark Issue Type: Improvement Components: EC2 Affects Versions: 0.9.0, 0.9.1, 1.0.0 Reporter: Alessandro Andrioni [On January 21|https://aws.amazon.com/about-aws/whats-new/2014/01/21/announcing-new-amazon-ec2-m3-instance-sizes-and-lower-prices-for-amazon-s3-and-amazon-ebs/], Amazon added SSD-backed instance storages for m3 instances, and also added two new types: m3.medium and m3.large. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2096) Correctly parse dot notations for accessing an array of structs
[ https://issues.apache.org/jira/browse/SPARK-2096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14105353#comment-14105353 ] Apache Spark commented on SPARK-2096: - User 'chuxi' has created a pull request for this issue: https://github.com/apache/spark/pull/2082 Correctly parse dot notations for accessing an array of structs --- Key: SPARK-2096 URL: https://issues.apache.org/jira/browse/SPARK-2096 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.0 Reporter: Yin Huai Priority: Minor Labels: starter For example, arrayOfStruct is an array of structs and every element of this array has a field called field1. arrayOfStruct[0].field1 means to access the value of field1 for the first element of arrayOfStruct, but the SQL parser (in sql-core) treats field1 as an alias. Also, arrayOfStruct.field1 means to access all values of field1 in this array of structs and the returns those values as an array. But, the SQL parser cannot resolve it. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3150) NullPointerException in Spark recovery after simultaneous fall of master and driver
[ https://issues.apache.org/jira/browse/SPARK-3150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tatiana Borisova updated SPARK-3150: Description: The issue happens when Spark is run standalone on a cluster. When master and driver fall simultaneously on one node in a cluster, master tries to recover its state and restart spark driver. While restarting driver, it falls with NPE exception (stacktrace is below). After falling, it restarts and tries to recover its state and restart Spark driver again. It happens over and over in an infinite cycle. Namely, Spark tries to read DriverInfo state from zookeeper, but after reading it happens to be null in DriverInfo.worker. Stacktrace (on version 1.0.0, but reproduceable on version 1.0.2, too) 2014-08-14 21:44:59,519] ERROR (akka.actor.OneForOneStrategy) java.lang.NullPointerException at org.apache.spark.deploy.master.Master$$anonfun$completeRecovery$5.apply(Master.scala:448) at org.apache.spark.deploy.master.Master$$anonfun$completeRecovery$5.apply(Master.scala:448) at scala.collection.TraversableLike$$anonfun$filter$1.apply(TraversableLike.scala:264) at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) at scala.collection.TraversableLike$class.filter(TraversableLike.scala:263) at scala.collection.AbstractTraversable.filter(Traversable.scala:105) at org.apache.spark.deploy.master.Master.completeRecovery(Master.scala:448) at org.apache.spark.deploy.master.Master$$anonfun$receive$1.applyOrElse(Master.scala:376) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) How to reproduce: kill all Spark processes when running Spark standalone on a cluster on some cluster node, where driver runs (kill driver, master and worker simultaneously). was: The issue happens when Spark is run standalone on a cluster. When master and driver fall simultaneously on one node in a cluster, master tries to recover its state and restart spark driver. While restarting driver, it falls with NPE exception (stacktrace is below). After falling, it restarts and tries to recover its state and restart Spark driver again. It happens over and over in an infinite cycle. Namely, Spark tries to read DriverInfo state from zookeeper, but after reading it happens to be null in DriverInfo.worker. Stacktrace (on version 1.0.0, but reproduceable on version 1.0.2, too) 2014-08-14 21:44:59,519] ERROR (akka.actor.OneForOneStrategy) java.lang.NullPointerException at org.apache.spark.deploy.master.Master$$anonfun$completeRecovery$5.apply(Master.scala:448) at org.apache.spark.deploy.master.Master$$anonfun$completeRecovery$5.apply(Master.scala:448) at scala.collection.TraversableLike$$anonfun$filter$1.apply(TraversableLike.scala:264) at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) at scala.collection.TraversableLike$class.filter(TraversableLike.scala:263) at scala.collection.AbstractTraversable.filter(Traversable.scala:105) at org.apache.spark.deploy.master.Master.completeRecovery(Master.scala:448) at org.apache.spark.deploy.master.Master$$anonfun$receive$1.applyOrElse(Master.scala:376) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) How to reproduce: kill both master and driver processes on some cluster node when running Spark standalone on a cluster. NullPointerException in Spark recovery after simultaneous fall of master and driver --- Key: SPARK-3150
[jira] [Created] (SPARK-3172) Distinguish between shuffle spill on the map and reduce side
Sandy Ryza created SPARK-3172: - Summary: Distinguish between shuffle spill on the map and reduce side Key: SPARK-3172 URL: https://issues.apache.org/jira/browse/SPARK-3172 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.2 Reporter: Sandy Ryza -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3070) Kryo deserialization without using the custom registrator
[ https://issues.apache.org/jira/browse/SPARK-3070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14105536#comment-14105536 ] Daniel Darabos commented on SPARK-3070: --- I think this is almost certainly a duplicate of https://issues.apache.org/jira/browse/SPARK-2878. Which is FIXED, thanks to Graham Dennis! Can you please check the repro against the fixed code to see if this can be closed? Thanks :). Kryo deserialization without using the custom registrator - Key: SPARK-3070 URL: https://issues.apache.org/jira/browse/SPARK-3070 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2 Reporter: Andras Nemeth If an RDD partition is cached on executor1 and used by a task on executor2 then the partition needs to be serialized and sent over. For this particular serialization/deserialization usecase, when using kry, it appears that the custom registrator will not be used on the deserialization side. This of course results in some totally misleading kry deserialization errors. The cause for this behavior seems to be that the thread running this deserialization has a classloader which does not have the jars specified in the SparkConf on its classpath. So it fails to load the Registrator with a ClassNotFoundException, but it catches the exception and happily continues without a registrator. (A bug on its own right in my opinion.) To reproduce, have two rdds partitioned the same way (as in with the same partitioner) but corresponding partitions cached on different machines, then join them. See below a somewhat convoluted way to achieve this. If you run the below program on a spark cluster with two workers, each with one core, you will be able to trigger the bug. Basically it runs two counts in parallel, which ensures that the two RDDs will be computed in parallel, and as a consequence on different executors. {code:java} import com.esotericsoftware.kryo.Kryo import org.apache.spark.HashPartitioner import org.apache.spark.SparkConf import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.apache.spark.rdd.RDD import org.apache.spark.serializer.KryoRegistrator import scala.actors.Actor case class MyClass(a: Int) class MyKryoRegistrator extends KryoRegistrator { override def registerClasses(kryo: Kryo) { kryo.register(classOf[MyClass]) } } class CountActor(rdd: RDD[_]) extends Actor { def act() { println(Start count) println(rdd.count) println(Stop count) } } object KryBugExample { def main(args: Array[String]) { val sparkConf = new SparkConf() .setMaster(args(0)) .setAppName(KryBugExample) .set(spark.serializer, org.apache.spark.serializer.KryoSerializer) .set(spark.kryo.registrator, MyKryoRegistrator) .setJars(Seq(target/scala-2.10/krybugexample_2.10-0.1-SNAPSHOT.jar)) val sc = new SparkContext(sparkConf) val partitioner = new HashPartitioner(1) val rdd1 = sc .parallelize((0 until 10).map(i = (i, MyClass(i))), 1) .partitionBy(partitioner).cache val rdd2 = sc .parallelize((0 until 10).map(i = (i, MyClass(i * 2))), 1) .partitionBy(partitioner).cache new CountActor(rdd1).start new CountActor(rdd2).start println(rdd1.join(rdd2).count) while (true) {} } } {code} -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2863) Emulate Hive type coercion in native reimplementations of Hive functions
[ https://issues.apache.org/jira/browse/SPARK-2863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14105537#comment-14105537 ] William Benton commented on SPARK-2863: --- I wrote up how Hive handles type coercions in a blog post: http://chapeau.freevariable.com/2014/08/existing-system-coercion.html The short version is that strings can be coerced to doubles or decimals and (in Hive 0.13) decimals can be coerced to doubles for numeric functions. As a first pass, I propose extending the numeric function helpers to handle strings. Emulate Hive type coercion in native reimplementations of Hive functions Key: SPARK-2863 URL: https://issues.apache.org/jira/browse/SPARK-2863 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.0 Reporter: William Benton Assignee: William Benton Native reimplementations of Hive functions no longer have the same type-coercion behavior as they would if executed via Hive. As [Michael Armbrust points out|https://github.com/apache/spark/pull/1750#discussion_r15790970], queries like {{SELECT SQRT(2) FROM src LIMIT 1}} succeed in Hive but fail if {{SQRT}} is implemented natively. Spark SQL should have Hive-compatible type coercions for arguments to natively-implemented functions. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3173) Timestamp support in the parser
Zdenek Farana created SPARK-3173: Summary: Timestamp support in the parser Key: SPARK-3173 URL: https://issues.apache.org/jira/browse/SPARK-3173 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.2, 1.1.0 Reporter: Zdenek Farana If you have a table with TIMESTAMP column, that column can't be used in WHERE clause properly - it is not evaluated properly. F.e., SELECT * FROM a WHERE timestamp='2014-08-21 00:00:00.0', would return nothing even if there would be a row with such a timestamp. The literal is not interpreted into a timestamp. The workaround SELECT * FROM a WHERE timestamp=CAST('2014-08-21 00:00:00.0' AS TIMESTAMP) fails, because the parser does not allow anything but STRING in the CAST dataType expression. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3065) Add Locale setting to HiveCompatibilitySuite
[ https://issues.apache.org/jira/browse/SPARK-3065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14105628#comment-14105628 ] Apache Spark commented on SPARK-3065: - User 'byF' has created a pull request for this issue: https://github.com/apache/spark/pull/2084 Add Locale setting to HiveCompatibilitySuite Key: SPARK-3065 URL: https://issues.apache.org/jira/browse/SPARK-3065 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.2 Environment: CentOS release 6.3 (Final) Reporter: luogankun Fix For: 1.0.2 Run the udf_unix_timestamp of org.apache.spark.sql.hive.execution.HiveCompatibilitySuite testcase with not America/Los_Angeles TimeZone throws: [info] - udf_unix_timestamp *** FAILED *** [info] Results do not match for udf_unix_timestamp: [info] SELECT [info] '2009 Mar 20 11:30:01 am', [info] unix_timestamp('2009 Mar 20 11:30:01 am', ' MMM dd h:mm:ss a') [info] FROM oneline [info] == Logical Plan == [info] Project [2009 Mar 20 11:30:01 am AS c_0#25,HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFUnixTimeStamp(2009 Mar 20 11:30:01 am, MMM dd h:mm:ss a) AS c_1#26L] [info]MetastoreRelation default, oneline, None [info] [info] == Optimized Logical Plan == [info] Project [2009 Mar 20 11:30:01 am AS c_0#25,HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFUnixTimeStamp(2009 Mar 20 11:30:01 am, MMM dd h:mm:ss a) AS c_1#26L] [info]MetastoreRelation default, oneline, None [info] [info] == Physical Plan == [info] Project [2009 Mar 20 11:30:01 am AS c_0#25,HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFUnixTimeStamp(2009 Mar 20 11:30:01 am, MMM dd h:mm:ss a) AS c_1#26L] [info]HiveTableScan [], (MetastoreRelation default, oneline, None), None [info] [info] Code Generation: false [info] == RDD == [info] (2) MappedRDD[37] at map at HiveContext.scala:350 [info] MapPartitionsRDD[36] at mapPartitions at basicOperators.scala:42 [info] MapPartitionsRDD[35] at mapPartitions at TableReader.scala:112 [info] MappedRDD[34] at map at TableReader.scala:240 [info] HadoopRDD[33] at HadoopRDD at TableReader.scala:230 [info] c_0c_1 [info] !== HIVE - 1 row(s) ==== CATALYST - 1 row(s) == [info] !2009 Mar 20 11:30:01 am 1237573801 2009 Mar 20 11:30:01 am NULL (HiveComparisonTest.scala:367) -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3174) Under YARN, add and remove executors based on load
Sandy Ryza created SPARK-3174: - Summary: Under YARN, add and remove executors based on load Key: SPARK-3174 URL: https://issues.apache.org/jira/browse/SPARK-3174 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 1.0.2 Reporter: Sandy Ryza A common complaint with Spark in a multi-tenant environment is that applications have a fixed allocation that doesn't grow and shrink with their resource needs. We're blocked on YARN-1197 for dynamically changing the resources within executors, but we can still allocate and discard whole executors. I think it would be useful to have some heuristics that * Request more executors when many pending tasks are building up * Request more executors when RDDs can't fit in memory * Discard executors when few tasks are running / pending and there's not much in memory Bonus points: migrate blocks from executors we're about to discard to executors with free space. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3175) Branch-1.1 SBT build failed for Yarn-Alpha
Chester created SPARK-3175: -- Summary: Branch-1.1 SBT build failed for Yarn-Alpha Key: SPARK-3175 URL: https://issues.apache.org/jira/browse/SPARK-3175 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.1.1 Reporter: Chester Fix For: 1.1.1 When trying to build yarn-alpha on branch-1.1 ᚛ |branch-1.1|$ sbt/sbt -Pyarn-alpha -Dhadoop.version=2.0.5-alpha projects [info] Loading project definition from /Users/chester/projects/spark/project org.apache.maven.model.building.ModelBuildingException: 1 problem was encountered while building the effective model for org.apache.spark:spark-yarn-alpha_2.10:1.1.0 [FATAL] Non-resolvable parent POM: Could not find artifact org.apache.spark:yarn-parent_2.10:pom:1.1.0 in central ( http://repo.maven.apache.org/maven2) and 'parent.relativePath' points at wrong local POM @ line 20, column 11 -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3129) Prevent data loss in Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14105928#comment-14105928 ] Hari Shreedharan commented on SPARK-3129: - [~tgraves] - Thanks for the pointers. Yes, using HDFS also allows us to use the same file with some protection to store the keys. This is something that might some design and discussion first. I will also update the PR with the reflection code. [~jerryshao]: 1. Today RDDs already get checkpointed at the end of every job when the runJob method gets called. Nothing is changing here. The entire graph does get checkpointed today already. 2. No, this is something that will need to be taken care of. When the driver dies, blocks can no longer be batched into RDDs - which means generating blocks without the driver makes no sense. Also, when the driver comes back online, new receivers get created, which would start receiving the data now. The only reason the executors are being kept around is to get the data in their memory - any processing/receiving should be killed. 3. Since it is an RDD, there is nothing that stops it from being recovered, right? It is recovered by the usual method of regenerating it. Only DStream data that has not been converted into an RDD is really lost - so getting the RDD back should not be a concern at all (of course, the cache is gone, but it can get pulled back into cache once the driver comes back up). Prevent data loss in Spark Streaming Key: SPARK-3129 URL: https://issues.apache.org/jira/browse/SPARK-3129 Project: Spark Issue Type: New Feature Reporter: Hari Shreedharan Assignee: Hari Shreedharan Attachments: StreamingPreventDataLoss.pdf Spark Streaming can small amounts of data when the driver goes down - and the sending system cannot re-send the data (or the data has already expired on the sender side). The document attached has more details. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3176) Implement math function 'POWER' and 'ABS' for sql
Xinyun Huang created SPARK-3176: --- Summary: Implement math function 'POWER' and 'ABS' for sql Key: SPARK-3176 URL: https://issues.apache.org/jira/browse/SPARK-3176 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.0.2, 1.1.0 Environment: All Reporter: Xinyun Huang Priority: Minor Fix For: 1.2.0 Add support for the mathematical function POWER and ABS within spark sql. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3177) Yarn-alpha ClientBaseSuite Unit test failed
Chester created SPARK-3177: -- Summary: Yarn-alpha ClientBaseSuite Unit test failed Key: SPARK-3177 URL: https://issues.apache.org/jira/browse/SPARK-3177 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.1.1 Reporter: Chester Priority: Minor Fix For: 1.1.1 Yarn-alpha ClientBaseSuite Unit test failed due to differences of MRJobConfig API between yarn-stable and yarn-alpha. The class field MRJobConfig.DEFAULT_MAPREDUCE_APPLICATION_CLASSPATH in yarn-alpha returns String Array in yarn returns String the method will works for yarn-stable but will fail as it try to cast String Array to String. val knownDefMRAppCP: Seq[String] = getFieldValue[String, Seq[String]](classOf[MRJobConfig], DEFAULT_MAPREDUCE_APPLICATION_CLASSPATH, Seq[String]())(a = a.split(,)) -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3177) Yarn-alpha ClientBaseSuite Unit test failed
[ https://issues.apache.org/jira/browse/SPARK-3177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14105984#comment-14105984 ] Chester commented on SPARK-3177: This issue should exists on master branch as well. It has been over there for a while. Yarn-alpha ClientBaseSuite Unit test failed --- Key: SPARK-3177 URL: https://issues.apache.org/jira/browse/SPARK-3177 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.1.1 Reporter: Chester Priority: Minor Labels: test Fix For: 1.1.1 Original Estimate: 1h Remaining Estimate: 1h Yarn-alpha ClientBaseSuite Unit test failed due to differences of MRJobConfig API between yarn-stable and yarn-alpha. The class field MRJobConfig.DEFAULT_MAPREDUCE_APPLICATION_CLASSPATH in yarn-alpha returns String Array in yarn returns String the method will works for yarn-stable but will fail as it try to cast String Array to String. val knownDefMRAppCP: Seq[String] = getFieldValue[String, Seq[String]](classOf[MRJobConfig], DEFAULT_MAPREDUCE_APPLICATION_CLASSPATH, Seq[String]())(a = a.split(,)) -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3178) setting SPARK_WORKER_MEMORY to a value without a label (m or g) sets the worker memory limit to zero
Jon Haddad created SPARK-3178: - Summary: setting SPARK_WORKER_MEMORY to a value without a label (m or g) sets the worker memory limit to zero Key: SPARK-3178 URL: https://issues.apache.org/jira/browse/SPARK-3178 Project: Spark Issue Type: Bug Environment: osx Reporter: Jon Haddad This should either default to m or just completely fail. Starting a worker with zero memory isn't very helpful. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3176) Implement 'POWER', 'ABS and 'LAST' for sql
[ https://issues.apache.org/jira/browse/SPARK-3176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinyun Huang updated SPARK-3176: Description: Add support for the mathematical function POWER and ABS and the analytic function last to return a subset of the rows satisfying a query within spark sql. (was: Add support for the mathematical function POWER and ABS within spark sql.) Summary: Implement 'POWER', 'ABS and 'LAST' for sql (was: Implement math function 'POWER' and 'ABS' for sql) Implement 'POWER', 'ABS and 'LAST' for sql -- Key: SPARK-3176 URL: https://issues.apache.org/jira/browse/SPARK-3176 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.0.2, 1.1.0 Environment: All Reporter: Xinyun Huang Priority: Minor Fix For: 1.2.0 Original Estimate: 3h Remaining Estimate: 3h Add support for the mathematical function POWER and ABS and the analytic function last to return a subset of the rows satisfying a query within spark sql. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3111) Implement the LAST analytic function for sql
[ https://issues.apache.org/jira/browse/SPARK-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106023#comment-14106023 ] Xinyun Huang commented on SPARK-3111: - Combined with 3176 Implement the LAST analytic function for sql Key: SPARK-3111 URL: https://issues.apache.org/jira/browse/SPARK-3111 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.0.2, 1.1.0 Environment: All Reporter: Xinyun Huang Priority: Minor Labels: sql Fix For: 1.2.0 Original Estimate: 0h Remaining Estimate: 0h Add support for the analytic function last to return a subset of the rows satisfying a query within spark sql. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2621) Update task InputMetrics incrementally
[ https://issues.apache.org/jira/browse/SPARK-2621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106118#comment-14106118 ] Apache Spark commented on SPARK-2621: - User 'sryza' has created a pull request for this issue: https://github.com/apache/spark/pull/2087 Update task InputMetrics incrementally -- Key: SPARK-2621 URL: https://issues.apache.org/jira/browse/SPARK-2621 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0 Reporter: Sandy Ryza -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3179) Add task OutputMetrics
Sandy Ryza created SPARK-3179: - Summary: Add task OutputMetrics Key: SPARK-3179 URL: https://issues.apache.org/jira/browse/SPARK-3179 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Sandy Ryza Track the bytes that tasks write to HDFS or other output destinations. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1537) Add integration with Yarn's Application Timeline Server
[ https://issues.apache.org/jira/browse/SPARK-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106164#comment-14106164 ] Marcelo Vanzin commented on SPARK-1537: --- No concrete timeline at the moment. I'm just starting to look at the 2.5.0 version of ATS so I can incorporate things into my patch. Add integration with Yarn's Application Timeline Server --- Key: SPARK-1537 URL: https://issues.apache.org/jira/browse/SPARK-1537 Project: Spark Issue Type: New Feature Components: YARN Reporter: Marcelo Vanzin Assignee: Marcelo Vanzin It would be nice to have Spark integrate with Yarn's Application Timeline Server (see YARN-321, YARN-1530). This would allow users running Spark on Yarn to have a single place to go for all their history needs, and avoid having to manage a separate service (Spark's built-in server). At the moment, there's a working version of the ATS in the Hadoop 2.4 branch, although there is still some ongoing work. But the basics are there, and I wouldn't expect them to change (much) at this point. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3180) Better control of security groups
Allan Douglas R. de Oliveira created SPARK-3180: --- Summary: Better control of security groups Key: SPARK-3180 URL: https://issues.apache.org/jira/browse/SPARK-3180 Project: Spark Issue Type: Improvement Reporter: Allan Douglas R. de Oliveira Two features can be combined together to provide better control of security group policies: - The ability to specify the address authorized to access the default security group (instead of letting everyone: 0.0.0.0/0) - The possibility to place the created machines on a custom security group One can use the combinations of the two flags to restrict external access to the provided security group (e.g by setting the authorized address to 127.0.0.1/32) while maintaining compatibility with the current behavior. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3180) Better control of security groups
[ https://issues.apache.org/jira/browse/SPARK-3180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106171#comment-14106171 ] Allan Douglas R. de Oliveira commented on SPARK-3180: - PR: https://github.com/apache/spark/pull/2088 Better control of security groups - Key: SPARK-3180 URL: https://issues.apache.org/jira/browse/SPARK-3180 Project: Spark Issue Type: Improvement Reporter: Allan Douglas R. de Oliveira Two features can be combined together to provide better control of security group policies: - The ability to specify the address authorized to access the default security group (instead of letting everyone: 0.0.0.0/0) - The possibility to place the created machines on a custom security group One can use the combinations of the two flags to restrict external access to the provided security group (e.g by setting the authorized address to 127.0.0.1/32) while maintaining compatibility with the current behavior. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3180) Better control of security groups
[ https://issues.apache.org/jira/browse/SPARK-3180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106177#comment-14106177 ] Allan Douglas R. de Oliveira commented on SPARK-3180: - Perhaps it also solves SPARK-2528 Better control of security groups - Key: SPARK-3180 URL: https://issues.apache.org/jira/browse/SPARK-3180 Project: Spark Issue Type: Improvement Reporter: Allan Douglas R. de Oliveira Two features can be combined together to provide better control of security group policies: - The ability to specify the address authorized to access the default security group (instead of letting everyone: 0.0.0.0/0) - The possibility to place the created machines on a custom security group One can use the combinations of the two flags to restrict external access to the provided security group (e.g by setting the authorized address to 127.0.0.1/32) while maintaining compatibility with the current behavior. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3180) Better control of security groups
[ https://issues.apache.org/jira/browse/SPARK-3180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106176#comment-14106176 ] Apache Spark commented on SPARK-3180: - User 'douglaz' has created a pull request for this issue: https://github.com/apache/spark/pull/2088 Better control of security groups - Key: SPARK-3180 URL: https://issues.apache.org/jira/browse/SPARK-3180 Project: Spark Issue Type: Improvement Reporter: Allan Douglas R. de Oliveira Two features can be combined together to provide better control of security group policies: - The ability to specify the address authorized to access the default security group (instead of letting everyone: 0.0.0.0/0) - The possibility to place the created machines on a custom security group One can use the combinations of the two flags to restrict external access to the provided security group (e.g by setting the authorized address to 127.0.0.1/32) while maintaining compatibility with the current behavior. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2360) CSV import to SchemaRDDs
[ https://issues.apache.org/jira/browse/SPARK-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106200#comment-14106200 ] Evan Chan commented on SPARK-2360: -- +1 for this feature. I just had to write something for importing tab-delimited CSVs and converting the types of each column. As for API, it really needs to do type conversion into the built-in types; otherwise it really affects the caching compression efficiency and query speed, as well as what functions can be run on it. I think this is crucial. Maybe one can pass in a Map[String, ColumnType] or something like that. If a type is not specified for a column, then it is assumed to be String. CSV import to SchemaRDDs Key: SPARK-2360 URL: https://issues.apache.org/jira/browse/SPARK-2360 Project: Spark Issue Type: New Feature Components: SQL Reporter: Michael Armbrust Assignee: Hossein Falaki I think the first step it to design the interface that we want to present to users. Mostly this is defining options when importing. Off the top of my head: - What is the separator? - Provide column names or infer them from the first row. - how to handle multiple files with possibly different schemas - do we have a method to let users specify the datatypes of the columns or are they just strings? - what types of quoting / escaping do we want to support? -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2871) Missing API in PySpark
[ https://issues.apache.org/jira/browse/SPARK-2871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106446#comment-14106446 ] Apache Spark commented on SPARK-2871: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/2091 Missing API in PySpark -- Key: SPARK-2871 URL: https://issues.apache.org/jira/browse/SPARK-2871 Project: Spark Issue Type: Improvement Reporter: Davies Liu Assignee: Davies Liu There are several APIs missing in PySpark: RDD.collectPartitions() RDD.histogram() RDD.zipWithIndex() RDD.zipWithUniqueId() RDD.min(comp) RDD.max(comp) A bunch of API related to approximate jobs. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2871) Missing API in PySpark
[ https://issues.apache.org/jira/browse/SPARK-2871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106449#comment-14106449 ] Apache Spark commented on SPARK-2871: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/2092 Missing API in PySpark -- Key: SPARK-2871 URL: https://issues.apache.org/jira/browse/SPARK-2871 Project: Spark Issue Type: Improvement Reporter: Davies Liu Assignee: Davies Liu There are several APIs missing in PySpark: RDD.collectPartitions() RDD.histogram() RDD.zipWithIndex() RDD.zipWithUniqueId() RDD.min(comp) RDD.max(comp) A bunch of API related to approximate jobs. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2871) Missing API in PySpark
[ https://issues.apache.org/jira/browse/SPARK-2871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106456#comment-14106456 ] Apache Spark commented on SPARK-2871: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/2093 Missing API in PySpark -- Key: SPARK-2871 URL: https://issues.apache.org/jira/browse/SPARK-2871 Project: Spark Issue Type: Improvement Reporter: Davies Liu Assignee: Davies Liu There are several APIs missing in PySpark: RDD.collectPartitions() RDD.histogram() RDD.zipWithIndex() RDD.zipWithUniqueId() RDD.min(comp) RDD.max(comp) A bunch of API related to approximate jobs. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2871) Missing API in PySpark
[ https://issues.apache.org/jira/browse/SPARK-2871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106462#comment-14106462 ] Apache Spark commented on SPARK-2871: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/2094 Missing API in PySpark -- Key: SPARK-2871 URL: https://issues.apache.org/jira/browse/SPARK-2871 Project: Spark Issue Type: Improvement Reporter: Davies Liu Assignee: Davies Liu There are several APIs missing in PySpark: RDD.collectPartitions() RDD.histogram() RDD.zipWithIndex() RDD.zipWithUniqueId() RDD.min(comp) RDD.max(comp) A bunch of API related to approximate jobs. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2871) Missing API in PySpark
[ https://issues.apache.org/jira/browse/SPARK-2871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106481#comment-14106481 ] Apache Spark commented on SPARK-2871: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/2095 Missing API in PySpark -- Key: SPARK-2871 URL: https://issues.apache.org/jira/browse/SPARK-2871 Project: Spark Issue Type: Improvement Reporter: Davies Liu Assignee: Davies Liu There are several APIs missing in PySpark: RDD.collectPartitions() RDD.histogram() RDD.zipWithIndex() RDD.zipWithUniqueId() RDD.min(comp) RDD.max(comp) A bunch of API related to approximate jobs. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org