[jira] [Commented] (SPARK-6320) Adding new query plan strategy to SQLContext
[ https://issues.apache.org/jira/browse/SPARK-6320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14365274#comment-14365274 ] Youssef Hatem commented on SPARK-6320: -- Thank you Michael for your response, Actually I am trying to extend SQL syntax with custom constructs which would require altering the workflow of {{SparkSQL}} all the way starting from the lexer and ending with the optimization and physical planning. I thought this should be possible without having to extend many classes. However not being able to use {{planLater}} forces me to do seemingly unnecessary extension of {{SparkPlanner}} specially that there seem to be some logic to handle these scenarios (i.e. the {{extraStrategies}} sequence). Adding new query plan strategy to SQLContext Key: SPARK-6320 URL: https://issues.apache.org/jira/browse/SPARK-6320 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Youssef Hatem Priority: Minor Hi, I would like to add a new strategy to {{SQLContext}}. To do this I created a new class which extends {{Strategy}}. In my new class I need to call {{planLater}} function. However this method is defined in {{SparkPlanner}} (which itself inherits the method from {{QueryPlanner}}). To my knowledge the only way to make {{planLater}} function visible to my new strategy is to define my strategy inside another class that extends {{SparkPlanner}} and inherits {{planLater}} as a result, by doing so I will have to extend the {{SQLContext}} such that I can override the {{planner}} field with the new {{Planner}} class I created. It seems that this is a design problem because adding a new strategy seems to require extending {{SQLContext}} (unless I am doing it wrong and there is a better way to do it). Thanks a lot, Youssef -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6250) Types are now reserved words in DDL parser.
[ https://issues.apache.org/jira/browse/SPARK-6250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14365307#comment-14365307 ] Yin Huai commented on SPARK-6250: - I see. The problem is that the data type string returned by Hive does not have backticks. Will take care it soon. Types are now reserved words in DDL parser. --- Key: SPARK-6250 URL: https://issues.apache.org/jira/browse/SPARK-6250 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Yin Huai Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6394) cleanup BlockManager companion object and improve the getCacheLocs method in DAGScheduler
Wenchen Fan created SPARK-6394: -- Summary: cleanup BlockManager companion object and improve the getCacheLocs method in DAGScheduler Key: SPARK-6394 URL: https://issues.apache.org/jira/browse/SPARK-6394 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Wenchen Fan Priority: Minor The current implementation of getCacheLocs include searching a HashMap many times, we can avoid this. And BlockManager.blockIdsToExecutorIds isn't called anywhere, we can remove it. Also we can combine BlockManager.blockIdsToHosts and blockIdsToBlockManagers into a single method in order to remove some unnecessary layers of indirection / collection creation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6200) Support dialect in SQL
[ https://issues.apache.org/jira/browse/SPARK-6200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14366575#comment-14366575 ] Michael Armbrust commented on SPARK-6200: - Thanks for taking the time to explain your design! I guess at a high level I don't see the value in having a configurable DialectManager. Perhaps I am missing some use case. That said, what do you think about the simpler implementation that I linked to below? Would that suite your use case? Support dialect in SQL -- Key: SPARK-6200 URL: https://issues.apache.org/jira/browse/SPARK-6200 Project: Spark Issue Type: Improvement Components: SQL Reporter: haiyang Created a new dialect manager,support dialect command and add new dialect use sql statement etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6394) cleanup BlockManager companion object and improve the getCacheLocs method in DAGScheduler
[ https://issues.apache.org/jira/browse/SPARK-6394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14366622#comment-14366622 ] Apache Spark commented on SPARK-6394: - User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/5043 cleanup BlockManager companion object and improve the getCacheLocs method in DAGScheduler - Key: SPARK-6394 URL: https://issues.apache.org/jira/browse/SPARK-6394 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Wenchen Fan Priority: Minor The current implementation of getCacheLocs include searching a HashMap many times, we can avoid this. And BlockManager.blockIdsToExecutorIds isn't called anywhere, we can remove it. Also we can combine BlockManager.blockIdsToHosts and blockIdsToBlockManagers into a single method in order to remove some unnecessary layers of indirection / collection creation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6145) ORDER BY fails to resolve nested fields
[ https://issues.apache.org/jira/browse/SPARK-6145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-6145: Target Version/s: 1.3.1 (was: 1.3.0) ORDER BY fails to resolve nested fields --- Key: SPARK-6145 URL: https://issues.apache.org/jira/browse/SPARK-6145 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Michael Armbrust Assignee: Michael Armbrust Priority: Critical Fix For: 1.3.0 {code} sqlContext.jsonRDD(sc.parallelize( {a: {b: 1}, c: 1} :: Nil)).registerTempTable(nestedOrder) // Works sqlContext.sql(SELECT 1 FROM nestedOrder ORDER BY c) // Fails now sqlContext.sql(SELECT 1 FROM nestedOrder ORDER BY a.b) // Fails now sqlContext.sql(SELECT a.b FROM nestedOrder ORDER BY a.b) {code} Relatedly the error message for bad get fields should also include the name of the field in question. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6299) ClassNotFoundException in standalone mode when running groupByKey with class defined in REPL.
[ https://issues.apache.org/jira/browse/SPARK-6299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-6299. Resolution: Fixed Fix Version/s: 1.3.1 1.4.0 Assignee: Kevin (Sangwoo) Kim ClassNotFoundException in standalone mode when running groupByKey with class defined in REPL. - Key: SPARK-6299 URL: https://issues.apache.org/jira/browse/SPARK-6299 Project: Spark Issue Type: Bug Components: Spark Shell Affects Versions: 1.3.0, 1.2.1 Reporter: Kevin (Sangwoo) Kim Assignee: Kevin (Sangwoo) Kim Fix For: 1.4.0, 1.3.1 Anyone can reproduce this issue by the code below (runs well in local mode, got exception with clusters) (it runs well in Spark 1.1.1) case class ClassA(value: String) val rdd = sc.parallelize(List((k1, ClassA(v1)), (k1, ClassA(v2)) )) rdd.groupByKey.collect org.apache.spark.SparkException: Job aborted due to stage failure: Task 162 in stage 1.0 failed 4 times, most recent failure: Lost task 162.3 in stage 1.0 (TID 1027, ip-172-16-182-27.ap-northeast-1.compute.internal): java.lang.ClassNotFoundException: $iwC$$iwC$$iwC$$iwC$UserRelationshipRow at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:274) at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:59) at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612) at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62) at org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at org.apache.spark.Aggregator.combineCombinersByKey(Aggregator.scala:91) at org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:44) at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at
[jira] [Updated] (SPARK-6299) ClassNotFoundException in standalone mode when running groupByKey with class defined in REPL.
[ https://issues.apache.org/jira/browse/SPARK-6299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-6299: --- Description: Anyone can reproduce this issue by the code below (runs well in local mode, got exception with clusters) (it runs well in Spark 1.1.1) {code} case class ClassA(value: String) val rdd = sc.parallelize(List((k1, ClassA(v1)), (k1, ClassA(v2)) )) rdd.groupByKey.collect {code} {code} org.apache.spark.SparkException: Job aborted due to stage failure: Task 162 in stage 1.0 failed 4 times, most recent failure: Lost task 162.3 in stage 1.0 (TID 1027, ip-172-16-182-27.ap-northeast-1.compute.internal): java.lang.ClassNotFoundException: $iwC$$iwC$$iwC$$iwC$UserRelationshipRow at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:274) at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:59) at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612) at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62) at org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at org.apache.spark.Aggregator.combineCombinersByKey(Aggregator.scala:91) at org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:44) at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1202) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:696) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1420) at akka.actor.Actor$class.aroundReceive(Actor.scala:465) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1375) at
[jira] [Resolved] (SPARK-6044) RDD.aggregate() should not use the closure serializer on the zero value
[ https://issues.apache.org/jira/browse/SPARK-6044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-6044. Resolution: Fixed Assignee: Sean Owen RDD.aggregate() should not use the closure serializer on the zero value --- Key: SPARK-6044 URL: https://issues.apache.org/jira/browse/SPARK-6044 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Reporter: Matt Cheah Assignee: Sean Owen Fix For: 1.4.0 PairRDDFunctions.aggregateByKey() correctly uses SparkEnv.get.serializer.newInstance() to serialize the zero value. It seems this logic is not mirrored in RDD.aggregate(), which computes the aggregation and returns the aggregation directly at the driver. We should change RDD.aggregate() to make this consistent; I ran into some serialization errors because I was expecting RDD.aggregate() to Kryo serialize the zero value. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6085) Increase default value for memory overhead
[ https://issues.apache.org/jira/browse/SPARK-6085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364740#comment-14364740 ] Apache Spark commented on SPARK-6085: - User 'jongyoul' has created a pull request for this issue: https://github.com/apache/spark/pull/5065 Increase default value for memory overhead -- Key: SPARK-6085 URL: https://issues.apache.org/jira/browse/SPARK-6085 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Ted Yu Assignee: Ted Yu Priority: Minor Fix For: 1.4.0 Several users have communicated how current default memory overhead value resulted in failed computation in Spark on YARN. See this thread: http://search-hadoop.com/m/JW1q58FDel Increasing default value for memory overhead would improve out of box user experience. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6334) spark-local dir not getting cleared during ALS
[ https://issues.apache.org/jira/browse/SPARK-6334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364764#comment-14364764 ] Antony Mayi commented on SPARK-6334: users: 12.5 millions ratings: 3.3 billions rank: 50 iters: 15 spark-local dir not getting cleared during ALS -- Key: SPARK-6334 URL: https://issues.apache.org/jira/browse/SPARK-6334 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.2.0 Reporter: Antony Mayi Attachments: als-diskusage.png when running bigger ALS training spark spills loads of temp data into the local-dir (in my case yarn/local/usercache/antony.mayi/appcache/... - running on YARN from cdh 5.3.2) eventually causing all the disks of all nodes running out of space (in my case I have 12TB of available disk capacity before kicking off the ALS but it all gets used (and yarn kills the containers when reaching 90%). even with all recommended options (configuring checkpointing and forcing GC when possible) it still doesn't get cleared. here is my (pseudo)code (pyspark): {code} sc.setCheckpointDir('/tmp') training = sc.pickleFile('/tmp/dataset').repartition(768).persist(StorageLevel.MEMORY_AND_DISK) model = ALS.trainImplicit(training, 50, 15, lambda_=0.1, blocks=-1, alpha=40) sc._jvm.System.gc() {code} the training RDD has about 3.5 billions of items (~60GB on disk). after about 6 hours the ALS will consume all 12TB of disk space in local-dir data and gets killed. my cluster has 192 cores, 1.5TB RAM and for this task I am using 37 executors of 4 cores/28+4GB RAM each. this is the graph of disk consumption pattern showing the space being all eaten from 7% to 90% during the ALS (90% is when YARN kills the container): !als-diskusage.png! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6370) RDD sampling with replacement intermittently yields incorrect number of samples
[ https://issues.apache.org/jira/browse/SPARK-6370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364700#comment-14364700 ] Marko Bonaci commented on SPARK-6370: - Isn't the {{fraction}} parameter the size of sample (percentage of elements you want to sample from the calling RDD)? Cuz' without replacement works as though it is. If not, could you please explain? RDD sampling with replacement intermittently yields incorrect number of samples --- Key: SPARK-6370 URL: https://issues.apache.org/jira/browse/SPARK-6370 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0, 1.2.1 Environment: Ubuntu 14.04 64-bit, spark-1.3.0-bin-hadoop2.4 Reporter: Marko Bonaci Labels: PoissonSampler, sample, sampler Here's the repl output: {{code:java}} scala uniqueIds.collect res10: Array[String] = Array(4, 8, 21, 80, 20, 98, 42, 15, 48, 36, 90, 46, 55, 16, 31, 71, 9, 50, 28, 61, 68, 85, 12, 94, 38, 77, 2, 11, 10) scala val swr = uniqueIds.sample(true, 0.5) swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[22] at sample at console:27 scala swr.count res17: Long = 16 scala val swr = uniqueIds.sample(true, 0.5) swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[23] at sample at console:27 scala swr.count res18: Long = 8 scala val swr = uniqueIds.sample(true, 0.5) swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[24] at sample at console:27 scala swr.count res19: Long = 18 scala val swr = uniqueIds.sample(true, 0.5) swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[25] at sample at console:27 scala swr.count res20: Long = 15 scala val swr = uniqueIds.sample(true, 0.5) swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[26] at sample at console:27 scala swr.count res21: Long = 11 scala val swr = uniqueIds.sample(true, 0.5) swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[27] at sample at console:27 scala swr.count res22: Long = 10 {{code}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5954) Add topByKey to pair RDDs
[ https://issues.apache.org/jira/browse/SPARK-5954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364703#comment-14364703 ] Reynold Xin commented on SPARK-5954: Shouldn't this just be an API in dataframe? Add topByKey to pair RDDs - Key: SPARK-5954 URL: https://issues.apache.org/jira/browse/SPARK-5954 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Xiangrui Meng Assignee: Shuo Xiang `topByKey(num: Int): RDD[(K, V)]` finds the top-k values for each key in a pair RDD. This is used, e.g., in computing top recommendations. We can use the Guava implementation of finding top-k from an iterator. See also https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/Utils.scala. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6370) RDD sampling with replacement intermittently yields incorrect number of samples
[ https://issues.apache.org/jira/browse/SPARK-6370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364700#comment-14364700 ] Marko Bonaci edited comment on SPARK-6370 at 3/17/15 7:17 AM: -- Isn't the {{fraction}} parameter the size of sample (percentage of elements you want to sample from the calling RDD)? Cuz' without replacement works as though it is. If not, could you please explain (and perhaps improve the scaladoc)? was (Author: mbonaci): Isn't the {{fraction}} parameter the size of sample (percentage of elements you want to sample from the calling RDD)? Cuz' without replacement works as though it is. If not, could you please explain? RDD sampling with replacement intermittently yields incorrect number of samples --- Key: SPARK-6370 URL: https://issues.apache.org/jira/browse/SPARK-6370 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0, 1.2.1 Environment: Ubuntu 14.04 64-bit, spark-1.3.0-bin-hadoop2.4 Reporter: Marko Bonaci Labels: PoissonSampler, sample, sampler Here's the repl output: {{code:java}} scala uniqueIds.collect res10: Array[String] = Array(4, 8, 21, 80, 20, 98, 42, 15, 48, 36, 90, 46, 55, 16, 31, 71, 9, 50, 28, 61, 68, 85, 12, 94, 38, 77, 2, 11, 10) scala val swr = uniqueIds.sample(true, 0.5) swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[22] at sample at console:27 scala swr.count res17: Long = 16 scala val swr = uniqueIds.sample(true, 0.5) swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[23] at sample at console:27 scala swr.count res18: Long = 8 scala val swr = uniqueIds.sample(true, 0.5) swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[24] at sample at console:27 scala swr.count res19: Long = 18 scala val swr = uniqueIds.sample(true, 0.5) swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[25] at sample at console:27 scala swr.count res20: Long = 15 scala val swr = uniqueIds.sample(true, 0.5) swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[26] at sample at console:27 scala swr.count res21: Long = 11 scala val swr = uniqueIds.sample(true, 0.5) swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[27] at sample at console:27 scala swr.count res22: Long = 10 {{code}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6380) Resolution of equi-join key in post-join projection
[ https://issues.apache.org/jira/browse/SPARK-6380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-6380: --- Description: {code} df1.join(df2, df1(key) === df2(key)).select(key) {code} It would be great to just resolve key to df1(key) in the case of inner joins. was: df1.join(df2, df1(key) === df2(key)).select(key) It would be great to just resolve key to df1(key) in the case of inner joins. Resolution of equi-join key in post-join projection --- Key: SPARK-6380 URL: https://issues.apache.org/jira/browse/SPARK-6380 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin {code} df1.join(df2, df1(key) === df2(key)).select(key) {code} It would be great to just resolve key to df1(key) in the case of inner joins. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6281) Support incremental updates for Graph
[ https://issues.apache.org/jira/browse/SPARK-6281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364787#comment-14364787 ] Apache Spark commented on SPARK-6281: - User 'maropu' has created a pull request for this issue: https://github.com/apache/spark/pull/5067 Support incremental updates for Graph - Key: SPARK-6281 URL: https://issues.apache.org/jira/browse/SPARK-6281 Project: Spark Issue Type: New Feature Components: GraphX Reporter: Takeshi Yamamuro Priority: Minor Add api to efficiently append new vertices and edges into existing Graph, e.g., Graph#append(newVerts: RDD[(VertexId, VD)], newEdges: RDD[Edge[ED]], defaultVertexAttr: VD) This is useful for time-evolving graphs; new vertices and edges are built from streaming data thru Spark Streaming, and then incrementally appended into a existing graph. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6380) Resolution of equi-join key in post-join projection
Reynold Xin created SPARK-6380: -- Summary: Resolution of equi-join key in post-join projection Key: SPARK-6380 URL: https://issues.apache.org/jira/browse/SPARK-6380 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin df1.join(df2, df1(key) === df2(key)).select(key) It would be great to just resolve key to df1(key) in the case of inner joins. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6350) Make mesosExecutorCores configurable in mesos fine-grained mode
[ https://issues.apache.org/jira/browse/SPARK-6350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jongyoul Lee updated SPARK-6350: Fix Version/s: 1.3.1 1.4.0 Make mesosExecutorCores configurable in mesos fine-grained mode - Key: SPARK-6350 URL: https://issues.apache.org/jira/browse/SPARK-6350 Project: Spark Issue Type: Improvement Components: Mesos Reporter: Jongyoul Lee Assignee: Jongyoul Lee Priority: Minor Fix For: 1.4.0, 1.3.1 When spark runs in mesos fine-grained mode, mesos slave launches executor with # of cpus and memories. By the way, # of executor's cores is always CPU_PER_TASKS as same as spark.task.cpus. If I set that values as 5 for running intensive task, mesos executor always consume 5 cores without any running task. This waste resources. We should set executor core as a configuration variable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6350) Make mesosExecutorCores configurable in mesos fine-grained mode
[ https://issues.apache.org/jira/browse/SPARK-6350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364661#comment-14364661 ] Apache Spark commented on SPARK-6350: - User 'jongyoul' has created a pull request for this issue: https://github.com/apache/spark/pull/5063 Make mesosExecutorCores configurable in mesos fine-grained mode - Key: SPARK-6350 URL: https://issues.apache.org/jira/browse/SPARK-6350 Project: Spark Issue Type: Improvement Components: Mesos Reporter: Jongyoul Lee Assignee: Jongyoul Lee Priority: Minor When spark runs in mesos fine-grained mode, mesos slave launches executor with # of cpus and memories. By the way, # of executor's cores is always CPU_PER_TASKS as same as spark.task.cpus. If I set that values as 5 for running intensive task, mesos executor always consume 5 cores without any running task. This waste resources. We should set executor core as a configuration variable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6304) Checkpointing doesn't retain driver port
[ https://issues.apache.org/jira/browse/SPARK-6304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364713#comment-14364713 ] Marius Soutier edited comment on SPARK-6304 at 3/17/15 7:36 AM: Got it, thanks. In my tests it was never set automatically, so this must be set at some later point. was (Author: msoutier): Got it, thanks. Checkpointing doesn't retain driver port Key: SPARK-6304 URL: https://issues.apache.org/jira/browse/SPARK-6304 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.2.1 Reporter: Marius Soutier In a check-pointed Streaming application running on a fixed driver port, the setting spark.driver.port is not loaded when recovering from a checkpoint. (The driver is then started on a random port.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6304) Checkpointing doesn't retain driver port
[ https://issues.apache.org/jira/browse/SPARK-6304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364713#comment-14364713 ] Marius Soutier commented on SPARK-6304: --- Got it, thanks. Checkpointing doesn't retain driver port Key: SPARK-6304 URL: https://issues.apache.org/jira/browse/SPARK-6304 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.2.1 Reporter: Marius Soutier In a check-pointed Streaming application running on a fixed driver port, the setting spark.driver.port is not loaded when recovering from a checkpoint. (The driver is then started on a random port.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5523) TaskMetrics and TaskInfo have innumerable copies of the hostname string
[ https://issues.apache.org/jira/browse/SPARK-5523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364712#comment-14364712 ] Apache Spark commented on SPARK-5523: - User 'jerryshao' has created a pull request for this issue: https://github.com/apache/spark/pull/5064 TaskMetrics and TaskInfo have innumerable copies of the hostname string --- Key: SPARK-5523 URL: https://issues.apache.org/jira/browse/SPARK-5523 Project: Spark Issue Type: Bug Components: Spark Core, Streaming Reporter: Tathagata Das TaskMetrics and TaskInfo objects have the hostname associated with the task. As these are created (directly or through deserialization of RPC messages), each of them have a separate String object for the hostname even though most of them have the same string data in them. This results in thousands of string objects, increasing memory requirement of the driver. This can be easily deduped when deserializing a TaskMetrics object, or when creating a TaskInfo object. This affects streaming particularly bad due to the rate of job/stage/task generation. For solution, see how this dedup is done for StorageLevel. https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/StorageLevel.scala#L226 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6390) Add MatrixUDT in PySpark
[ https://issues.apache.org/jira/browse/SPARK-6390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-6390: - Component/s: PySpark Add MatrixUDT in PySpark Key: SPARK-6390 URL: https://issues.apache.org/jira/browse/SPARK-6390 Project: Spark Issue Type: New Feature Components: MLlib, PySpark Reporter: Xiangrui Meng After SPARK-6309, we should support MatrixUDT in PySpark too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6390) Add MatrixUDT in PySpark
Xiangrui Meng created SPARK-6390: Summary: Add MatrixUDT in PySpark Key: SPARK-6390 URL: https://issues.apache.org/jira/browse/SPARK-6390 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng After SPARK-6309, we should support MatrixUDT in PySpark too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5836) Highlight in Spark documentation that by default Spark does not delete its temporary files
[ https://issues.apache.org/jira/browse/SPARK-5836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14365897#comment-14365897 ] Apache Spark commented on SPARK-5836: - User 'ilganeli' has created a pull request for this issue: https://github.com/apache/spark/pull/5074 Highlight in Spark documentation that by default Spark does not delete its temporary files -- Key: SPARK-5836 URL: https://issues.apache.org/jira/browse/SPARK-5836 Project: Spark Issue Type: Improvement Components: Documentation Reporter: Tomasz Dudziak We recently learnt the hard way (in a prod system) that Spark by default does not delete its temporary files until it is stopped. WIthin a relatively short time span of heavy Spark use the disk of our prod machine filled up completely because of multiple shuffle files written to it. We think there should be better documentation around the fact that after a job is finished it leaves a lot of rubbish behind so that this does not come as a surprise. Probably a good place to highlight that fact would be the documentation of {{spark.local.dir}} property, which controls where Spark temporary files are written. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6192) Enhance MLlib's Python API (GSoC 2015)
[ https://issues.apache.org/jira/browse/SPARK-6192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14365918#comment-14365918 ] Xiangrui Meng commented on SPARK-6192: -- [~MechCoder] Please be a little (but not too) specific in the proposal. For example, you should mention Python in the title of the proposal, which sets the theme of the project. Scala/Java will be definitely involved, but the goal is to have a better coverage of MLlib's Python API. This also helps reviewers understand the scope the proposal and rate it. You should also mention in the proposal that if the features are implemented by others, we will create new tasks within the theme of the project. So it is good for both MLlib and GSoC. Enhance MLlib's Python API (GSoC 2015) -- Key: SPARK-6192 URL: https://issues.apache.org/jira/browse/SPARK-6192 Project: Spark Issue Type: Umbrella Components: ML, MLlib, PySpark Reporter: Xiangrui Meng Assignee: Manoj Kumar Labels: gsoc, gsoc2015, mentor This is an umbrella JIRA for [~MechCoder]'s GSoC 2015 project. The main theme is to enhance MLlib's Python API, to make it on par with the Scala/Java API. The main tasks are: 1. For all models in MLlib, provide save/load method. This also includes save/load in Scala. 2. Python API for evaluation metrics. 3. Python API for streaming ML algorithms. 4. Python API for distributed linear algebra. 5. Simplify MLLibPythonAPI using DataFrames. Currently, we use customized serialization, making MLLibPythonAPI hard to maintain. It would be nice to use the DataFrames for serialization. I'll link the JIRAs for each of the tasks. Note that this doesn't mean all these JIRAs are pre-assigned to [~MechCoder]. The TODO list will be dynamic based on the backlog. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-6116) Making DataFrame API non-experimental
[ https://issues.apache.org/jira/browse/SPARK-6116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-6116: --- Comment: was deleted (was: User 'azagrebin' has created a pull request for this issue: https://github.com/apache/spark/pull/5069) Making DataFrame API non-experimental - Key: SPARK-6116 URL: https://issues.apache.org/jira/browse/SPARK-6116 Project: Spark Issue Type: Umbrella Components: SQL Reporter: Reynold Xin An umbrella ticket to track improvements and changes needed to make DataFrame API non-experimental. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6370) RDD sampling with replacement intermittently yields incorrect number of samples
[ https://issues.apache.org/jira/browse/SPARK-6370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marko Bonaci updated SPARK-6370: Priority: Minor (was: Major) Issue Type: Documentation (was: Bug) RDD sampling with replacement intermittently yields incorrect number of samples --- Key: SPARK-6370 URL: https://issues.apache.org/jira/browse/SPARK-6370 Project: Spark Issue Type: Documentation Components: Spark Core Affects Versions: 1.3.0, 1.2.1 Environment: Ubuntu 14.04 64-bit, spark-1.3.0-bin-hadoop2.4 Reporter: Marko Bonaci Priority: Minor Labels: PoissonSampler, sample, sampler Here's the repl output: {{code:java}} scala uniqueIds.collect res10: Array[String] = Array(4, 8, 21, 80, 20, 98, 42, 15, 48, 36, 90, 46, 55, 16, 31, 71, 9, 50, 28, 61, 68, 85, 12, 94, 38, 77, 2, 11, 10) scala val swr = uniqueIds.sample(true, 0.5) swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[22] at sample at console:27 scala swr.count res17: Long = 16 scala val swr = uniqueIds.sample(true, 0.5) swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[23] at sample at console:27 scala swr.count res18: Long = 8 scala val swr = uniqueIds.sample(true, 0.5) swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[24] at sample at console:27 scala swr.count res19: Long = 18 scala val swr = uniqueIds.sample(true, 0.5) swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[25] at sample at console:27 scala swr.count res20: Long = 15 scala val swr = uniqueIds.sample(true, 0.5) swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[26] at sample at console:27 scala swr.count res21: Long = 11 scala val swr = uniqueIds.sample(true, 0.5) swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[27] at sample at console:27 scala swr.count res22: Long = 10 {{code}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5954) Add topByKey to pair RDDs
[ https://issues.apache.org/jira/browse/SPARK-5954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14366081#comment-14366081 ] Apache Spark commented on SPARK-5954: - User 'coderxiang' has created a pull request for this issue: https://github.com/apache/spark/pull/5075 Add topByKey to pair RDDs - Key: SPARK-5954 URL: https://issues.apache.org/jira/browse/SPARK-5954 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Xiangrui Meng Assignee: Shuo Xiang `topByKey(num: Int): RDD[(K, V)]` finds the top-k values for each key in a pair RDD. This is used, e.g., in computing top recommendations. We can use the Guava implementation of finding top-k from an iterator. See also https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/Utils.scala. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5955) Add checkpointInterval to ALS
[ https://issues.apache.org/jira/browse/SPARK-5955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14366127#comment-14366127 ] Apache Spark commented on SPARK-5955: - User 'mengxr' has created a pull request for this issue: https://github.com/apache/spark/pull/5076 Add checkpointInterval to ALS - Key: SPARK-5955 URL: https://issues.apache.org/jira/browse/SPARK-5955 Project: Spark Issue Type: New Feature Components: ML, MLlib Reporter: Xiangrui Meng Assignee: Xiangrui Meng We should add checkpoint interval to ALS to prevent the following: 1. storing large shuffle files 2. stack overflow (SPARK-1106) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6382) withUDF(...) {...} for supporting temporary UDF definitions in the scope
[ https://issues.apache.org/jira/browse/SPARK-6382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jianshi Huang updated SPARK-6382: - Description: Currently the scope of UDF registration is global. It's unsuitable for libraries that's built on top of DataFrame, as many operations has to be done by registering a UDF first. Please provide a way for binding temporary UDFs. e.g. {code} withUDF((merge_map, (m1: Map[String, Double], m2: Map[String, Double]) = m2 ++ m2), ...) { sql(select merge_map(d1.map, d2.map) from d1, d2 where d1.id = d2.id) } {code} Also UDF registry is a mutable Hashmap, refactoring it to a immutable one makes more sense. Jianshi was: Currently the scope of UDF registration is global. It's unsuitable for libraries that's built on top of DataFrame, as many operations has to done by registering a UDF first. Please provide a way for binding temporary UDFs. e.g. {code} withUDF((merge_map, (m1: Map[String, Double], m2: Map[String, Double]) = m2 ++ m2), ...) { sql(select merge_map(d1.map, d2.map) from d1, d2 where d1.id = d2.id) } {code} Also UDF registry is a mutable Hashmap, refactoring it to a immutable one makes more sense. Jianshi withUDF(...) {...} for supporting temporary UDF definitions in the scope Key: SPARK-6382 URL: https://issues.apache.org/jira/browse/SPARK-6382 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0, 1.3.1 Reporter: Jianshi Huang Currently the scope of UDF registration is global. It's unsuitable for libraries that's built on top of DataFrame, as many operations has to be done by registering a UDF first. Please provide a way for binding temporary UDFs. e.g. {code} withUDF((merge_map, (m1: Map[String, Double], m2: Map[String, Double]) = m2 ++ m2), ...) { sql(select merge_map(d1.map, d2.map) from d1, d2 where d1.id = d2.id) } {code} Also UDF registry is a mutable Hashmap, refactoring it to a immutable one makes more sense. Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6382) withUDF(...) {...} for supporting temporary UDF definitions in the scope
[ https://issues.apache.org/jira/browse/SPARK-6382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jianshi Huang updated SPARK-6382: - Description: Currently the scope of UDF registration is global. It's unsuitable for libraries that are built on top of DataFrame, as many operations has to be done by registering a UDF first. Please provide a way for binding temporary UDFs. e.g. {code} withUDF((merge_map, (m1: Map[String, Double], m2: Map[String, Double]) = m2 ++ m2), ...) { sql(select merge_map(d1.map, d2.map) from d1, d2 where d1.id = d2.id) } {code} Also UDF registry is a mutable Hashmap, refactoring it to a immutable one makes more sense. Jianshi was: Currently the scope of UDF registration is global. It's unsuitable for libraries that's built on top of DataFrame, as many operations has to be done by registering a UDF first. Please provide a way for binding temporary UDFs. e.g. {code} withUDF((merge_map, (m1: Map[String, Double], m2: Map[String, Double]) = m2 ++ m2), ...) { sql(select merge_map(d1.map, d2.map) from d1, d2 where d1.id = d2.id) } {code} Also UDF registry is a mutable Hashmap, refactoring it to a immutable one makes more sense. Jianshi withUDF(...) {...} for supporting temporary UDF definitions in the scope Key: SPARK-6382 URL: https://issues.apache.org/jira/browse/SPARK-6382 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0, 1.3.1 Reporter: Jianshi Huang Currently the scope of UDF registration is global. It's unsuitable for libraries that are built on top of DataFrame, as many operations has to be done by registering a UDF first. Please provide a way for binding temporary UDFs. e.g. {code} withUDF((merge_map, (m1: Map[String, Double], m2: Map[String, Double]) = m2 ++ m2), ...) { sql(select merge_map(d1.map, d2.map) from d1, d2 where d1.id = d2.id) } {code} Also UDF registry is a mutable Hashmap, refactoring it to a immutable one makes more sense. Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6381) add Apriori algorithm to MLLib
[ https://issues.apache.org/jira/browse/SPARK-6381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6381: - Target Version/s: (was: 1.4.0) Affects Version/s: (was: 1.3.1) add Apriori algorithm to MLLib -- Key: SPARK-6381 URL: https://issues.apache.org/jira/browse/SPARK-6381 Project: Spark Issue Type: New Feature Components: MLlib Reporter: zhangyouhua [~mengxr] There are many algorithms about association rule mining,for example FPGrowth, Apriori and so on.these algorithms are classic algorithms in machine learning, and there are very much usefully in big data mining. Even the FPGrowth algorithm in spark 1.3 version have implementation to solution big big data set, but it need create FPTree before mining frequent item. so while transition data is smaller and the data is sparse and minSupport is bigger,wen can select Apriori algorithms. how Apriori algorithm parallelism? 1.Generates frequent items by filtering the input data using minimal support level. private def genFreqItems[Item: ClassTag]( data: RDD[Array[Item]],minCount: Long,partitioner: Partitioner): Array[Item] 2.Generate frequent itemSets by building apriori, the extraction is done on each partition. 2.1 create candidateSet by kFreqItems and k private def createCandidateSet[Item: ClassTag]( kFreqItems: Array[(Array[Item], Long)], k: Int) 2.2 create kFreqItems from candidateSet is generated by candidateSet private def scanDataSet[Item: ClassTag](dataSet: RDD[Array[Item]],candidateSet: Array[Array[Item]], minCount: Double): RDD[(Array[Item], Long)] 2.3 filter dataSet by candidateSet. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6381) add Apriori algorithm to MLLib
[ https://issues.apache.org/jira/browse/SPARK-6381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhangyouhua updated SPARK-6381: --- Target Version/s: 1.3.1 Affects Version/s: 1.3.0 Fix Version/s: 1.3.1 add Apriori algorithm to MLLib -- Key: SPARK-6381 URL: https://issues.apache.org/jira/browse/SPARK-6381 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.3.0 Reporter: zhangyouhua Fix For: 1.3.1 [~mengxr] There are many algorithms about association rule mining,for example FPGrowth, Apriori and so on.these algorithms are classic algorithms in machine learning, and there are very much usefully in big data mining. Even the FPGrowth algorithm in spark 1.3 version have implementation to solution big big data set, but it need create FPTree before mining frequent item. so while transition data is smaller and the data is sparse and minSupport is bigger,wen can select Apriori algorithms. how Apriori algorithm parallelism? 1.Generates frequent items by filtering the input data using minimal support level. private def genFreqItems[Item: ClassTag]( data: RDD[Array[Item]],minCount: Long,partitioner: Partitioner): Array[Item] 2.Generate frequent itemSets by building apriori, the extraction is done on each partition. 2.1 create candidateSet by kFreqItems and k private def createCandidateSet[Item: ClassTag]( kFreqItems: Array[(Array[Item], Long)], k: Int) 2.2 create kFreqItems from candidateSet is generated by candidateSet private def scanDataSet[Item: ClassTag](dataSet: RDD[Array[Item]],candidateSet: Array[Array[Item]], minCount: Double): RDD[(Array[Item], Long)] 2.3 filter dataSet by candidateSet. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6381) add Apriori algorithm to MLLib
[ https://issues.apache.org/jira/browse/SPARK-6381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364847#comment-14364847 ] Sean Owen commented on SPARK-6381: -- See SPARK-2432. I don't think apriori is as useful or used as FP-growth add Apriori algorithm to MLLib -- Key: SPARK-6381 URL: https://issues.apache.org/jira/browse/SPARK-6381 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.3.0 Reporter: zhangyouhua [~mengxr] There are many algorithms about association rule mining,for example FPGrowth, Apriori and so on.these algorithms are classic algorithms in machine learning, and there are very much usefully in big data mining. Even the FPGrowth algorithm in spark 1.3 version have implementation to solution big big data set, but it need create FPTree before mining frequent item. so while transition data is smaller and the data is sparse and minSupport is bigger,wen can select Apriori algorithms. how Apriori algorithm parallelism? 1.Generates frequent items by filtering the input data using minimal support level. private def genFreqItems[Item: ClassTag]( data: RDD[Array[Item]],minCount: Long,partitioner: Partitioner): Array[Item] 2.Generate frequent itemSets by building apriori, the extraction is done on each partition. 2.1 create candidateSet by kFreqItems and k private def createCandidateSet[Item: ClassTag]( kFreqItems: Array[(Array[Item], Long)], k: Int) 2.2 create kFreqItems from candidateSet is generated by candidateSet private def scanDataSet[Item: ClassTag](dataSet: RDD[Array[Item]],candidateSet: Array[Array[Item]], minCount: Double): RDD[(Array[Item], Long)] 2.3 filter dataSet by candidateSet. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6385) ISO 8601 timestamp parsing does not support arbitrary precision second fractions
Nick Bruun created SPARK-6385: - Summary: ISO 8601 timestamp parsing does not support arbitrary precision second fractions Key: SPARK-6385 URL: https://issues.apache.org/jira/browse/SPARK-6385 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.1 Reporter: Nick Bruun Priority: Minor The ISO 8601 timestamp parsing implemented as a resolution to SPARK-4149 does not support arbitrary precision fractions of seconds, only millisecond precision. Parsing {{2015-02-02T00:00:07.900GMT-00:00}} will succeed, while {{2015-02-02T00:00:07.9000GMT-00:00}} will fail. I'm willing to implement a fix, but pointers on the direction would be appreciated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6370) RDD sampling with replacement intermittently yields incorrect number of samples
[ https://issues.apache.org/jira/browse/SPARK-6370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364859#comment-14364859 ] Sean Owen commented on SPARK-6370: -- Ah. The docs don't explain this behavior indeed. {{fraction}} does also imply to me that it's the size of the sample as a fraction of the total. In fact it's the probability that each element is chosen when without replacement, and an expected number of times each element is chosen when with replacement. I think it needs a doc update. Would you like to open a PR to elaborate the javadoc / scaladoc / Python doc of all of the sample methods? wouldn't hurt to doc the {{RandomSampler}} subclasses too. RDD sampling with replacement intermittently yields incorrect number of samples --- Key: SPARK-6370 URL: https://issues.apache.org/jira/browse/SPARK-6370 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0, 1.2.1 Environment: Ubuntu 14.04 64-bit, spark-1.3.0-bin-hadoop2.4 Reporter: Marko Bonaci Labels: PoissonSampler, sample, sampler Here's the repl output: {{code:java}} scala uniqueIds.collect res10: Array[String] = Array(4, 8, 21, 80, 20, 98, 42, 15, 48, 36, 90, 46, 55, 16, 31, 71, 9, 50, 28, 61, 68, 85, 12, 94, 38, 77, 2, 11, 10) scala val swr = uniqueIds.sample(true, 0.5) swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[22] at sample at console:27 scala swr.count res17: Long = 16 scala val swr = uniqueIds.sample(true, 0.5) swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[23] at sample at console:27 scala swr.count res18: Long = 8 scala val swr = uniqueIds.sample(true, 0.5) swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[24] at sample at console:27 scala swr.count res19: Long = 18 scala val swr = uniqueIds.sample(true, 0.5) swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[25] at sample at console:27 scala swr.count res20: Long = 15 scala val swr = uniqueIds.sample(true, 0.5) swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[26] at sample at console:27 scala swr.count res21: Long = 11 scala val swr = uniqueIds.sample(true, 0.5) swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[27] at sample at console:27 scala swr.count res22: Long = 10 {{code}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4590) Early investigation of parameter server
[ https://issues.apache.org/jira/browse/SPARK-4590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364803#comment-14364803 ] Qiping Li commented on SPARK-4590: -- Hi, [~rezazadeh], thanks for your investigation. It is very helpful to have parameter server integrated into Spark. How is it going with this issue now? Early investigation of parameter server --- Key: SPARK-4590 URL: https://issues.apache.org/jira/browse/SPARK-4590 Project: Spark Issue Type: Brainstorming Components: ML, MLlib Reporter: Xiangrui Meng Assignee: Reza Zadeh In the currently implementation of GLM solvers, we save intermediate models on the driver node and update it through broadcast and aggregation. Even with torrent broadcast and tree aggregation added in 1.1, it is hard to go beyond ~10 million features. This JIRA is for investigating the parameter server approach, including algorithm, infrastructure, and dependencies. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6381) add Apriori algorithm to MLLib
[ https://issues.apache.org/jira/browse/SPARK-6381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-6381. -- Resolution: Duplicate Fix Version/s: (was: 1.4.0) (Don't set fix version, and 1.3.1 does not exist.) Search JIRA first please. This was already implemented in SPARK-4001 as FP-growth. See also SPARK-2432. add Apriori algorithm to MLLib -- Key: SPARK-6381 URL: https://issues.apache.org/jira/browse/SPARK-6381 Project: Spark Issue Type: New Feature Components: MLlib Reporter: zhangyouhua [~mengxr] There are many algorithms about association rule mining,for example FPGrowth, Apriori and so on.these algorithms are classic algorithms in machine learning, and there are very much usefully in big data mining. Even the FPGrowth algorithm in spark 1.3 version have implementation to solution big big data set, but it need create FPTree before mining frequent item. so while transition data is smaller and the data is sparse and minSupport is bigger,wen can select Apriori algorithms. how Apriori algorithm parallelism? 1.Generates frequent items by filtering the input data using minimal support level. private def genFreqItems[Item: ClassTag]( data: RDD[Array[Item]],minCount: Long,partitioner: Partitioner): Array[Item] 2.Generate frequent itemSets by building apriori, the extraction is done on each partition. 2.1 create candidateSet by kFreqItems and k private def createCandidateSet[Item: ClassTag]( kFreqItems: Array[(Array[Item], Long)], k: Int) 2.2 create kFreqItems from candidateSet is generated by candidateSet private def scanDataSet[Item: ClassTag](dataSet: RDD[Array[Item]],candidateSet: Array[Array[Item]], minCount: Double): RDD[(Array[Item], Long)] 2.3 filter dataSet by candidateSet. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6383) Few examples on Dataframe operation give compiler errors
[ https://issues.apache.org/jira/browse/SPARK-6383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364825#comment-14364825 ] Apache Spark commented on SPARK-6383: - User 'tijoparacka' has created a pull request for this issue: https://github.com/apache/spark/pull/5068 Few examples on Dataframe operation give compiler errors - Key: SPARK-6383 URL: https://issues.apache.org/jira/browse/SPARK-6383 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Tijo Thomas Priority: Trivial The below statements give compiler errors as a) the select method doesnot accept String, Column df.select(name, df(age) + 1).show() // Need to convert String to Column b) Filtering should be based on age not on name Column df.filter(df(name) 21).show() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6384) saveAsParquet doesn't clean up attempt_* folders
Rex Xiong created SPARK-6384: Summary: saveAsParquet doesn't clean up attempt_* folders Key: SPARK-6384 URL: https://issues.apache.org/jira/browse/SPARK-6384 Project: Spark Issue Type: Bug Components: SQL Reporter: Rex Xiong After calling SchemaRDD.saveAsParquet, it runs well and generate *.parquet, _SUCCESS, _common_metadata, _metadata files successfully. But sometimes, there will be some attempt_* folder (e.g. attempt_201503170229_0006_r_06_736, attempt_201503170229_0006_r_000404_416) under the same folder, it contains one parquet file, seems to be a working temp folder. It happens even though _SUCCESS file created. In this situation, SparkSQL throws exception when loading this parquet folder: Error: java.io.FileNotFoundException: Path is not a file: ../attempt_201503170229_0006_r_06_736 at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.ja va:69) at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.ja va:55) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations UpdateTimes(FSNamesystem.java:1728) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations Int(FSNamesystem.java:1671) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations (FSNamesystem.java:1651) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations (FSNamesystem.java:1625) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLoca tions(NameNodeRpcServer.java:503) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTra nslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:32 2) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$Cl ientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.cal l(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInforma tion.java:1594) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) (state=,co de=0) I'm not sure whether it's a Spark bug or a Parquet bug. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6384) saveAsParquet doesn't clean up attempt_* folders
[ https://issues.apache.org/jira/browse/SPARK-6384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rex Xiong updated SPARK-6384: - Description: After calling SchemaRDD.saveAsParquet, it runs well and generate *.parquet, _SUCCESS, _common_metadata, _metadata files successfully. But sometimes, there will be some attempt_* folder (e.g. attempt_201503170229_0006_r_06_736, attempt_201503170229_0006_r_000404_416) under the same folder, it contains one parquet file, seems to be a working temp folder. It happens even though _SUCCESS file created. In this situation, SparkSQL (Hive table) throws exception when loading this parquet folder: Error: java.io.FileNotFoundException: Path is not a file: ../attempt_201503170229_0006_r_06_736 at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.ja va:69) at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.ja va:55) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations UpdateTimes(FSNamesystem.java:1728) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations Int(FSNamesystem.java:1671) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations (FSNamesystem.java:1651) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations (FSNamesystem.java:1625) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLoca tions(NameNodeRpcServer.java:503) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTra nslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:32 2) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$Cl ientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.cal l(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInforma tion.java:1594) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) (state=,co de=0) I'm not sure whether it's a Spark bug or a Parquet bug. was: After calling SchemaRDD.saveAsParquet, it runs well and generate *.parquet, _SUCCESS, _common_metadata, _metadata files successfully. But sometimes, there will be some attempt_* folder (e.g. attempt_201503170229_0006_r_06_736, attempt_201503170229_0006_r_000404_416) under the same folder, it contains one parquet file, seems to be a working temp folder. It happens even though _SUCCESS file created. In this situation, SparkSQL throws exception when loading this parquet folder: Error: java.io.FileNotFoundException: Path is not a file: ../attempt_201503170229_0006_r_06_736 at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.ja va:69) at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.ja va:55) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations UpdateTimes(FSNamesystem.java:1728) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations Int(FSNamesystem.java:1671) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations (FSNamesystem.java:1651) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations (FSNamesystem.java:1625) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLoca tions(NameNodeRpcServer.java:503) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTra nslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:32 2) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$Cl ientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.cal l(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInforma tion.java:1594) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) (state=,co de=0) I'm not sure whether it's a Spark bug or a Parquet bug. saveAsParquet doesn't clean up attempt_* folders
[jira] [Created] (SPARK-6382) withUDF(...) {...} for supporting temporary UDF definitions in the scope
Jianshi Huang created SPARK-6382: Summary: withUDF(...) {...} for supporting temporary UDF definitions in the scope Key: SPARK-6382 URL: https://issues.apache.org/jira/browse/SPARK-6382 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0, 1.3.1 Reporter: Jianshi Huang Currently the scope of UDF registration is global. It's unsuitable for libraries that's built on top of DataFrame, as many operations has to done by registering a UDF first. Please provide a way for binding temporary UDFs. e.g. {code} withUDF((merge_map, (m1: Map[String, Double], m2: Map[String, Double]) = m2 ++ m2), ...) { sql(select merge_map(d1.map, d2.map) from d1, d2 where d1.id = d2.id) } {code} Also UDF registry is a mutable Hashmap, refactoring it to a immutable one makes more sense. Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6381) add Apriori algorithm to MLLib
[ https://issues.apache.org/jira/browse/SPARK-6381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6381: - Target Version/s: (was: 1.3.1) Fix Version/s: (was: 1.3.1) @zhangyouhua please do not set fix version, as this is not 'fixed', but a duplicate. Please don't set target version either as it is resolved. add Apriori algorithm to MLLib -- Key: SPARK-6381 URL: https://issues.apache.org/jira/browse/SPARK-6381 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.3.0 Reporter: zhangyouhua [~mengxr] There are many algorithms about association rule mining,for example FPGrowth, Apriori and so on.these algorithms are classic algorithms in machine learning, and there are very much usefully in big data mining. Even the FPGrowth algorithm in spark 1.3 version have implementation to solution big big data set, but it need create FPTree before mining frequent item. so while transition data is smaller and the data is sparse and minSupport is bigger,wen can select Apriori algorithms. how Apriori algorithm parallelism? 1.Generates frequent items by filtering the input data using minimal support level. private def genFreqItems[Item: ClassTag]( data: RDD[Array[Item]],minCount: Long,partitioner: Partitioner): Array[Item] 2.Generate frequent itemSets by building apriori, the extraction is done on each partition. 2.1 create candidateSet by kFreqItems and k private def createCandidateSet[Item: ClassTag]( kFreqItems: Array[(Array[Item], Long)], k: Int) 2.2 create kFreqItems from candidateSet is generated by candidateSet private def scanDataSet[Item: ClassTag](dataSet: RDD[Array[Item]],candidateSet: Array[Array[Item]], minCount: Double): RDD[(Array[Item], Long)] 2.3 filter dataSet by candidateSet. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-6370) RDD sampling with replacement intermittently yields incorrect number of samples
[ https://issues.apache.org/jira/browse/SPARK-6370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6370: - Comment: was deleted (was: What's the bug? Each element is sampled with probability 0.5. I think the expected size is 14 but not all samples would be that size. ) RDD sampling with replacement intermittently yields incorrect number of samples --- Key: SPARK-6370 URL: https://issues.apache.org/jira/browse/SPARK-6370 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0, 1.2.1 Environment: Ubuntu 14.04 64-bit, spark-1.3.0-bin-hadoop2.4 Reporter: Marko Bonaci Labels: PoissonSampler, sample, sampler Here's the repl output: {{code:java}} scala uniqueIds.collect res10: Array[String] = Array(4, 8, 21, 80, 20, 98, 42, 15, 48, 36, 90, 46, 55, 16, 31, 71, 9, 50, 28, 61, 68, 85, 12, 94, 38, 77, 2, 11, 10) scala val swr = uniqueIds.sample(true, 0.5) swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[22] at sample at console:27 scala swr.count res17: Long = 16 scala val swr = uniqueIds.sample(true, 0.5) swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[23] at sample at console:27 scala swr.count res18: Long = 8 scala val swr = uniqueIds.sample(true, 0.5) swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[24] at sample at console:27 scala swr.count res19: Long = 18 scala val swr = uniqueIds.sample(true, 0.5) swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[25] at sample at console:27 scala swr.count res20: Long = 15 scala val swr = uniqueIds.sample(true, 0.5) swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[26] at sample at console:27 scala swr.count res21: Long = 11 scala val swr = uniqueIds.sample(true, 0.5) swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[27] at sample at console:27 scala swr.count res22: Long = 10 {{code}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6377) Set the number of shuffle partitions automatically based on the size of input tables and the reduce-side operation.
[ https://issues.apache.org/jira/browse/SPARK-6377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364901#comment-14364901 ] Sean Owen commented on SPARK-6377: -- I think this is the same basic suggestion as SPARK-4630? Set the number of shuffle partitions automatically based on the size of input tables and the reduce-side operation. --- Key: SPARK-6377 URL: https://issues.apache.org/jira/browse/SPARK-6377 Project: Spark Issue Type: Improvement Components: SQL Reporter: Yin Huai It will be helpful to automatically set the number of shuffle partitions based on the size of input tables and the operation at the reduce side for an Exchange operator. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6381) add Apriori algorithm to MLLib
zhangyouhua created SPARK-6381: -- Summary: add Apriori algorithm to MLLib Key: SPARK-6381 URL: https://issues.apache.org/jira/browse/SPARK-6381 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.3.1 Reporter: zhangyouhua Fix For: 1.4.0 [~mengxr] There are many algorithms about association rule mining,for example FPGrowth, Apriori and so on.these algorithms are classic algorithms in machine learning, and there are very much usefully in big data mining. Even the FPGrowth algorithm in spark 1.3 version have implementation to solution big big data set, but it need create FPTree before mining frequent item. so while transition data is smaller and the data is sparse and minSupport is bigger,wen can select Apriori algorithms. how Apriori algorithm parallelism? 1.Generates frequent items by filtering the input data using minimal support level. private def genFreqItems[Item: ClassTag]( data: RDD[Array[Item]],minCount: Long,partitioner: Partitioner): Array[Item] 2.Generate frequent itemSets by building apriori, the extraction is done on each partition. 2.1 create candidateSet by kFreqItems and k private def createCandidateSet[Item: ClassTag]( kFreqItems: Array[(Array[Item], Long)], k: Int) 2.2 create kFreqItems from candidateSet is generated by candidateSet private def scanDataSet[Item: ClassTag](dataSet: RDD[Array[Item]],candidateSet: Array[Array[Item]], minCount: Double): RDD[(Array[Item], Long)] 2.3 filter dataSet by candidateSet. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6383) Few examples on Dataframe operation give compiler errors
Tijo Thomas created SPARK-6383: -- Summary: Few examples on Dataframe operation give compiler errors Key: SPARK-6383 URL: https://issues.apache.org/jira/browse/SPARK-6383 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Tijo Thomas Priority: Trivial The below statements give compiler errors as a) the select method doesnot accept String, Column df.select(name, df(age) + 1).show() // Need to convert String to Column b) Filtering should be based on age not on name Column df.filter(df(name) 21).show() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6370) RDD sampling with replacement intermittently yields incorrect number of samples
[ https://issues.apache.org/jira/browse/SPARK-6370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364859#comment-14364859 ] Sean Owen edited comment on SPARK-6370 at 3/17/15 10:09 AM: Ah. The docs don't explain this behavior indeed. {{fraction}} does also imply to me that it's the size of the sample as a fraction of the total. In fact it's the probability that each element is chosen when without replacement, and an expected number of times each element is chosen when with replacement. EDIT: ... which TBC in both cases also means that the *expected* size of the sample is the given fraction of the input size. I think it needs a doc update. Would you like to open a PR to elaborate the javadoc / scaladoc / Python doc of all of the sample methods? wouldn't hurt to doc the {{RandomSampler}} subclasses too. was (Author: srowen): Ah. The docs don't explain this behavior indeed. {{fraction}} does also imply to me that it's the size of the sample as a fraction of the total. In fact it's the probability that each element is chosen when without replacement, and an expected number of times each element is chosen when with replacement. I think it needs a doc update. Would you like to open a PR to elaborate the javadoc / scaladoc / Python doc of all of the sample methods? wouldn't hurt to doc the {{RandomSampler}} subclasses too. RDD sampling with replacement intermittently yields incorrect number of samples --- Key: SPARK-6370 URL: https://issues.apache.org/jira/browse/SPARK-6370 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0, 1.2.1 Environment: Ubuntu 14.04 64-bit, spark-1.3.0-bin-hadoop2.4 Reporter: Marko Bonaci Labels: PoissonSampler, sample, sampler Here's the repl output: {{code:java}} scala uniqueIds.collect res10: Array[String] = Array(4, 8, 21, 80, 20, 98, 42, 15, 48, 36, 90, 46, 55, 16, 31, 71, 9, 50, 28, 61, 68, 85, 12, 94, 38, 77, 2, 11, 10) scala val swr = uniqueIds.sample(true, 0.5) swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[22] at sample at console:27 scala swr.count res17: Long = 16 scala val swr = uniqueIds.sample(true, 0.5) swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[23] at sample at console:27 scala swr.count res18: Long = 8 scala val swr = uniqueIds.sample(true, 0.5) swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[24] at sample at console:27 scala swr.count res19: Long = 18 scala val swr = uniqueIds.sample(true, 0.5) swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[25] at sample at console:27 scala swr.count res20: Long = 15 scala val swr = uniqueIds.sample(true, 0.5) swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[26] at sample at console:27 scala swr.count res21: Long = 11 scala val swr = uniqueIds.sample(true, 0.5) swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[27] at sample at console:27 scala swr.count res22: Long = 10 {{code}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6384) saveAsParquet doesn't clean up attempt_* folders
[ https://issues.apache.org/jira/browse/SPARK-6384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rex Xiong updated SPARK-6384: - Affects Version/s: 1.2.1 saveAsParquet doesn't clean up attempt_* folders Key: SPARK-6384 URL: https://issues.apache.org/jira/browse/SPARK-6384 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.1 Reporter: Rex Xiong After calling SchemaRDD.saveAsParquet, it runs well and generate *.parquet, _SUCCESS, _common_metadata, _metadata files successfully. But sometimes, there will be some attempt_* folder (e.g. attempt_201503170229_0006_r_06_736, attempt_201503170229_0006_r_000404_416) under the same folder, it contains one parquet file, seems to be a working temp folder. It happens even though _SUCCESS file created. In this situation, SparkSQL throws exception when loading this parquet folder: Error: java.io.FileNotFoundException: Path is not a file: ../attempt_201503170229_0006_r_06_736 at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.ja va:69) at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.ja va:55) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations UpdateTimes(FSNamesystem.java:1728) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations Int(FSNamesystem.java:1671) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations (FSNamesystem.java:1651) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations (FSNamesystem.java:1625) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLoca tions(NameNodeRpcServer.java:503) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTra nslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:32 2) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$Cl ientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.cal l(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInforma tion.java:1594) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) (state=,co de=0) I'm not sure whether it's a Spark bug or a Parquet bug. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6349) Add probability estimates in SVMModel predict result
[ https://issues.apache.org/jira/browse/SPARK-6349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364888#comment-14364888 ] Sean Owen commented on SPARK-6349: -- One is just a monotonic function of the other. It doesn't change the problem at all in that respect. Add probability estimates in SVMModel predict result Key: SPARK-6349 URL: https://issues.apache.org/jira/browse/SPARK-6349 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.2.1 Reporter: tanyinyan Original Estimate: 168h Remaining Estimate: 168h In SVMModel, predictPoint method output raw margin(threshold not set) or 1/0 label(threshold set). when SVM are used as a classifier, it's hard to find a good threshold,and the raw margin is hard to understand. when I am using SVM on dataset(https://www.kaggle.com/c/avazu-ctr-prediction/data), train on the first day's dataset(ignore field id/device_id/device_ip, all remaining fields are concidered as categorical variable, and sparsed before SVM) and predict on the same data with threshold cleared, the predict result are all negative. I have to set threshold to -1 to get a reasonable confusion matrix. So, I suggest to provide probability predict result in SVMModel as in libSVM(Platt's binary SVM Probablistic Output) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5081) Shuffle write increases
[ https://issues.apache.org/jira/browse/SPARK-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364813#comment-14364813 ] Dr. Christian Betz commented on SPARK-5081: --- [~pwendell] Hi, obviously there's nobody looking into that issue. Could you please clarify, assign, or whatever it takes to get this issue handled in a future version of spark? That'd be great! Thanks!!! :) Shuffle write increases --- Key: SPARK-5081 URL: https://issues.apache.org/jira/browse/SPARK-5081 Project: Spark Issue Type: Bug Components: Shuffle Affects Versions: 1.2.0 Reporter: Kevin Jung Priority: Critical Attachments: Spark_Debug.pdf, diff.txt The size of shuffle write showing in spark web UI is much different when I execute same spark job with same input data in both spark 1.1 and spark 1.2. At sortBy stage, the size of shuffle write is 98.1MB in spark 1.1 but 146.9MB in spark 1.2. I set spark.shuffle.manager option to hash because it's default value is changed but spark 1.2 still writes shuffle output more than spark 1.1. It can increase disk I/O overhead exponentially as the input file gets bigger and it causes the jobs take more time to complete. In the case of about 100GB input, for example, the size of shuffle write is 39.7GB in spark 1.1 but 91.0GB in spark 1.2. spark 1.1 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write|| |9|saveAsTextFile| |1169.4KB| | |12|combineByKey| |1265.4KB|1275.0KB| |6|sortByKey| |1276.5KB| | |8|mapPartitions| |91.0MB|1383.1KB| |4|apply| |89.4MB| | |5|sortBy|155.6MB| |98.1MB| |3|sortBy|155.6MB| | | |1|collect| |2.1MB| | |2|mapValues|155.6MB| |2.2MB| |0|first|184.4KB| | | spark 1.2 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write|| |12|saveAsTextFile| |1170.2KB| | |11|combineByKey| |1264.5KB|1275.0KB| |8|sortByKey| |1273.6KB| | |7|mapPartitions| |134.5MB|1383.1KB| |5|zipWithIndex| |132.5MB| | |4|sortBy|155.6MB| |146.9MB| |3|sortBy|155.6MB| | | |2|collect| |2.0MB| | |1|mapValues|155.6MB| |2.2MB| |0|first|184.4KB| | | -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5332) Efficient way to deal with ExecutorLost
[ https://issues.apache.org/jira/browse/SPARK-5332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-5332. -- Resolution: Won't Fix Efficient way to deal with ExecutorLost --- Key: SPARK-5332 URL: https://issues.apache.org/jira/browse/SPARK-5332 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Liang-Chi Hsieh Priority: Minor Currently, the handler of the case when an executor being lost in DAGScheduler (handleExecutorLost) looks not efficient. This pr tries to add a bit of extra information to Stage class to improve that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6386) add association rule mining algorithm to MLLib
zhangyouhua created SPARK-6386: -- Summary: add association rule mining algorithm to MLLib Key: SPARK-6386 URL: https://issues.apache.org/jira/browse/SPARK-6386 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.3.0 Reporter: zhangyouhua [~mengxr] association rule algorithm is find frequent items which are association,while given transition data set and minSupport and minConf. we can use FPGrowth algorithm mining frequent pattern item,but can not explain each other. so we should add association rule algorithm. for example: data set: A B C A C A D B E F minSupport :0.5 minConf:0.5 the frequent items- support A -0.75 B -0.5 C -0.5 A C -0.5 use minSupport calculate minConf: A - C: support {A C}/support {A} = 0.67 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6117) describe function for summary statistics
[ https://issues.apache.org/jira/browse/SPARK-6117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14365781#comment-14365781 ] Apache Spark commented on SPARK-6117: - User 'azagrebin' has created a pull request for this issue: https://github.com/apache/spark/pull/5073 describe function for summary statistics Key: SPARK-6117 URL: https://issues.apache.org/jira/browse/SPARK-6117 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Labels: starter DataFrame.describe should return a DataFrame with summary statistics. {code} def describe(cols: String*): DataFrame {code} If cols is empty, then run describe on all numeric columns. The returned DataFrame should have 5 rows (count, mean, stddev, min, max) and n + 1 columns. The 1st column is the name of the aggregate function, and the next n columns are the numeric columns of interest in the input DataFrame. Similar to Pandas (but removing percentile since accurate percentiles are too expensive to compute for Big Data) {code} In [19]: df.describe() Out[19]: A B C D count 6.00 6.00 6.00 6.00 mean 0.073711 -0.431125 -0.687758 -0.233103 std0.843157 0.922818 0.779887 0.973118 min -0.861849 -2.104569 -1.509059 -1.135632 max1.212112 0.567020 0.276232 1.071804 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6336) LBFGS should document what convergenceTol means
[ https://issues.apache.org/jira/browse/SPARK-6336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-6336: - Target Version/s: 1.4.0, 1.3.1 (was: 1.4.0) LBFGS should document what convergenceTol means --- Key: SPARK-6336 URL: https://issues.apache.org/jira/browse/SPARK-6336 Project: Spark Issue Type: Documentation Components: Documentation, MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Kai Sasaki Priority: Trivial Fix For: 1.4.0, 1.3.1 LBFGS uses Breeze's LBFGS, which uses relative convergence tolerance. We should document that convergenceTol is relative and ensure in a unit test that this behavior does not change in Breeze without us realizing it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6388) Spark 1.3 + Hadoop 2.6 Can't work on Java 8_40
[ https://issues.apache.org/jira/browse/SPARK-6388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14365758#comment-14365758 ] Michael Malak commented on SPARK-6388: -- Isn't it Hadoop 2.7 that is supposed to provide Java 8 compatibility? Spark 1.3 + Hadoop 2.6 Can't work on Java 8_40 -- Key: SPARK-6388 URL: https://issues.apache.org/jira/browse/SPARK-6388 Project: Spark Issue Type: Bug Components: Block Manager, Spark Submit, YARN Affects Versions: 1.3.0 Environment: 1. Linux version 3.16.0-30-generic (buildd@komainu) (gcc version 4.9.1 (Ubuntu 4.9.1-16ubuntu6) ) #40-Ubuntu SMP Mon Jan 12 22:06:37 UTC 2015 2. Oracle Java 8 update 40 for Linux X64 3. Scala 2.10.5 4. Hadoop 2.6 (pre-build version) Reporter: John Original Estimate: 24h Remaining Estimate: 24h I build Apache Spark 1.3 munally. --- JAVA_HOME=PATH_TO_JAVA8 mvn clean package -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 -DskipTests --- Something goes wrong, akka always tell me --- 15/03/17 21:28:10 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkYarnAM@Server2:42161] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. --- I build another version of Spark 1.3 + Hadoop 2.6 under Java 7. Everything goes well. Logs --- 15/03/17 21:27:06 INFO spark.SparkContext: Running Spark version 1.3.0 15/03/17 21:27:07 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 15/03/17 21:27:08 INFO spark.SecurityManager: Changing view Servers to: hduser 15/03/17 21:27:08 INFO spark.SecurityManager: Changing modify Servers to: hduser 15/03/17 21:27:08 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui Servers disabled; users with view permissions: Set(hduser); users with modify permissions: Set(hduser) 15/03/17 21:27:08 INFO slf4j.Slf4jLogger: Slf4jLogger started 15/03/17 21:27:08 INFO Remoting: Starting remoting 15/03/17 21:27:09 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@Server3:37951] 15/03/17 21:27:09 INFO util.Utils: Successfully started service 'sparkDriver' on port 37951. 15/03/17 21:27:09 INFO spark.SparkEnv: Registering MapOutputTracker 15/03/17 21:27:09 INFO spark.SparkEnv: Registering BlockManagerMaster 15/03/17 21:27:09 INFO storage.DiskBlockManager: Created local directory at /tmp/spark-0db692bb-cd02-40c8-a8f0-3813c6da18e2/blockmgr-a1d0ad23-ab76-4177-80a0-a6f982a64d80 15/03/17 21:27:09 INFO storage.MemoryStore: MemoryStore started with capacity 265.1 MB 15/03/17 21:27:09 INFO spark.HttpFileServer: HTTP File server directory is /tmp/spark-502ef3f8-b8cd-45cf-b1df-97df297cdb35/httpd-6303e24d-4b2b-4614-bb1d-74e8d331189b 15/03/17 21:27:09 INFO spark.HttpServer: Starting HTTP Server 15/03/17 21:27:09 INFO server.Server: jetty-8.y.z-SNAPSHOT 15/03/17 21:27:10 INFO server.AbstractConnector: Started SocketConnector@0.0.0.0:48000 15/03/17 21:27:10 INFO util.Utils: Successfully started service 'HTTP file server' on port 48000. 15/03/17 21:27:10 INFO spark.SparkEnv: Registering OutputCommitCoordinator 15/03/17 21:27:10 INFO server.Server: jetty-8.y.z-SNAPSHOT 15/03/17 21:27:10 INFO server.AbstractConnector: Started SelectChannelConnector@0.0.0.0:4040 15/03/17 21:27:10 INFO util.Utils: Successfully started service 'SparkUI' on port 4040. 15/03/17 21:27:10 INFO ui.SparkUI: Started SparkUI at http://Server3:4040 15/03/17 21:27:10 INFO spark.SparkContext: Added JAR file:/home/hduser/spark-java2.jar at http://192.168.11.42:48000/jars/spark-java2.jar with timestamp 1426598830307 15/03/17 21:27:10 INFO client.RMProxy: Connecting to ResourceManager at Server3/192.168.11.42:8050 15/03/17 21:27:11 INFO yarn.Client: Requesting a new application from cluster with 3 NodeManagers 15/03/17 21:27:11 INFO yarn.Client: Verifying our application has not requested more than the maximum memory capability of the cluster (8192 MB per container) 15/03/17 21:27:11 INFO yarn.Client: Will allocate AM container, with 896 MB memory including 384 MB overhead 15/03/17 21:27:11 INFO yarn.Client: Setting up container launch context for our AM 15/03/17 21:27:11 INFO yarn.Client: Preparing resources for our AM container 15/03/17 21:27:12 INFO yarn.Client: Uploading resource file:/home/hduser/spark-1.3.0/assembly/target/scala-2.10/spark-assembly-1.3.0-hadoop2.6.0.jar - hdfs://Server3:9000/user/hduser/.sparkStaging/application_1426595477608_0002/spark-assembly-1.3.0-hadoop2.6.0.jar 15/03/17 21:27:21 INFO yarn.Client: Setting up the launch environment for our
[jira] [Closed] (SPARK-6358) Spark-submit error when using PYSPARK_PYTHON enviromnental variable
[ https://issues.apache.org/jira/browse/SPARK-6358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dustin davidson closed SPARK-6358. -- Resolution: Invalid Found a work around. The issue was with group permissions on the '/home' directory. Not necessarily a bug for Spark. Spark-submit error when using PYSPARK_PYTHON enviromnental variable --- Key: SPARK-6358 URL: https://issues.apache.org/jira/browse/SPARK-6358 Project: Spark Issue Type: Bug Affects Versions: 1.2.0 Environment: Using CDH5.3 with Spark 1.2.0 Reporter: dustin davidson When using spark-submit the PYSPARK_PYTHON setting throws an error. I can run the pyspark repl while setting PYSPARK_PYTHON, but spark-submit does not work. Error received when running spark-submit: 15/03/12 15:25:22 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, hadoop-manager): java.io.IOException: Cannot run program /home/hadmin/testenv/bin/python: error=13, Permission denied at java.lang.ProcessBuilder.start(ProcessBuilder.java:1047) at org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:160) at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:86) at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62) at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:102) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: error=13, Permission denied at java.lang.UNIXProcess.forkAndExec(Native Method) at java.lang.UNIXProcess.init(UNIXProcess.java:186) at java.lang.ProcessImpl.start(ProcessImpl.java:130) at java.lang.ProcessBuilder.start(ProcessBuilder.java:1028) ... 13 more -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6250) Types are now reserved words in DDL parser.
[ https://issues.apache.org/jira/browse/SPARK-6250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14365735#comment-14365735 ] Yin Huai commented on SPARK-6250: - Update: Seems this problem affects struct fields when a field name is using a reserved type keyword as the name. For top level column, it is fine. Types are now reserved words in DDL parser. --- Key: SPARK-6250 URL: https://issues.apache.org/jira/browse/SPARK-6250 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Yin Huai Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5313) Create simple framework for highlighting changes introduced in a PR
[ https://issues.apache.org/jira/browse/SPARK-5313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14365561#comment-14365561 ] Apache Spark commented on SPARK-5313: - User 'brennonyork' has created a pull request for this issue: https://github.com/apache/spark/pull/5072 Create simple framework for highlighting changes introduced in a PR --- Key: SPARK-5313 URL: https://issues.apache.org/jira/browse/SPARK-5313 Project: Spark Issue Type: Improvement Components: Project Infra Reporter: Nicholas Chammas Priority: Minor For any given PR, we may want to run a bunch of checks along the following lines: * Show property X of {{master}} * Show the same property X of this PR * Call out any differences on the GitHub page It might be helpful to write a simple function that takes any check -- itself represented as a function -- as input, runs the check on master and the PR, and outputs the diff. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6363) make scala 2.11 default language
[ https://issues.apache.org/jira/browse/SPARK-6363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14365726#comment-14365726 ] Jianshi Huang commented on SPARK-6363: -- My two cents. The only module that's not working for 2.11 is thriftserver and kafka-streaming, as for Kafka they only started to support 2.11 from 0.8.2 and Spark kafka module depends on 0.8.0. It should be just a change in Kafka version to make it ready for 2.11. AFAIK, 0.8.2 client works well with previous 0.8.x server. And we need to fix thriftserver build issue. Jianshi make scala 2.11 default language Key: SPARK-6363 URL: https://issues.apache.org/jira/browse/SPARK-6363 Project: Spark Issue Type: Improvement Components: Build Reporter: antonkulaga Priority: Minor Labels: scala Mostly all libraries already moved to 2.11 and many are starting to drop 2.10 support. So, it will be better if Spark binaries would be build with Scala 2.11 by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-6204) GenerateProjection's equals should check length equality
[ https://issues.apache.org/jira/browse/SPARK-6204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust closed SPARK-6204. --- Resolution: Won't Fix GenerateProjection's equals should check length equality Key: SPARK-6204 URL: https://issues.apache.org/jira/browse/SPARK-6204 Project: Spark Issue Type: Bug Components: SQL Reporter: Liang-Chi Hsieh GenerateProjection's equals now only checks column equality. It should also check length equality. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6392) [SQL]class not found exception thows when `add jar` use spark cli
jeanlyn created SPARK-6392: -- Summary: [SQL]class not found exception thows when `add jar` use spark cli Key: SPARK-6392 URL: https://issues.apache.org/jira/browse/SPARK-6392 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: jeanlyn Priority: Minor When we use spark cli to add jar dynamic,we will get the *java.lang.ClassNotFoundException* when we use the class of jar to create udf.For example: {noformat} spark-sql add jar /home/jeanlyn/hello.jar; spark-sqlcreate temporary function hello as 'hello'; spark-sqlselect hello(name) from person; Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1, localhost): java.lang.ClassNotFoundException: hello {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5651) Support 'create db.table' in HiveContext
[ https://issues.apache.org/jira/browse/SPARK-5651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-5651. - Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 4427 [https://github.com/apache/spark/pull/4427] Support 'create db.table' in HiveContext Key: SPARK-5651 URL: https://issues.apache.org/jira/browse/SPARK-5651 Project: Spark Issue Type: Bug Components: SQL Reporter: Yadong Qi Fix For: 1.4.0 Now spark version is only support ```create table table_in_database_creation.test1 as select * from src limit 1;``` in HiveContext. This patch is used to support ```create table `table_in_database_creation.test2` as select * from src limit 1;``` in HiveContext. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6366) In Python API, the default save mode for save and saveAsTable should be error instead of append.
[ https://issues.apache.org/jira/browse/SPARK-6366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-6366: -- Fix Version/s: (was: 1.3.0) 1.3.1 In Python API, the default save mode for save and saveAsTable should be error instead of append. Key: SPARK-6366 URL: https://issues.apache.org/jira/browse/SPARK-6366 Project: Spark Issue Type: Bug Components: SQL Reporter: Yin Huai Assignee: Yin Huai Priority: Blocker Fix For: 1.3.1 If a user want to append data, he/she should explicitly specify the save mode. Also, in Scala and Java, the default save mode is ErrorIfExists. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5404) Statistic of Logical Plan is too aggresive
[ https://issues.apache.org/jira/browse/SPARK-5404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-5404. - Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 4914 [https://github.com/apache/spark/pull/4914] Statistic of Logical Plan is too aggresive -- Key: SPARK-5404 URL: https://issues.apache.org/jira/browse/SPARK-5404 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Hao Fix For: 1.4.0 The statistic number of a logical plan is quite helpful while do optimization like join reordering, however, the default algorithm is too aggressive, which probably lead to a totally wrong join order. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6247) Certain self joins cannot be analyzed
[ https://issues.apache.org/jira/browse/SPARK-6247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-6247. - Resolution: Fixed Fix Version/s: 1.3.1 Issue resolved by pull request 5062 [https://github.com/apache/spark/pull/5062] Certain self joins cannot be analyzed - Key: SPARK-6247 URL: https://issues.apache.org/jira/browse/SPARK-6247 Project: Spark Issue Type: Bug Components: SQL Reporter: Yin Huai Assignee: Michael Armbrust Priority: Blocker Fix For: 1.3.1 When you try the following code {code} val df = (1 to 10) .map(i = (i, i.toDouble, i.toLong, i.toString, i.toString)) .toDF(intCol, doubleCol, longCol, stringCol1, stringCol2) df.registerTempTable(test) sql( |SELECT x.stringCol2, avg(y.intCol), sum(x.doubleCol) |FROM test x JOIN test y ON (x.stringCol1 = y.stringCol1) |GROUP BY x.stringCol2 .stripMargin).explain() {code} The following exception will be thrown. {code} [info] java.util.NoSuchElementException: next on empty iterator [info] at scala.collection.Iterator$$anon$2.next(Iterator.scala:39) [info] at scala.collection.Iterator$$anon$2.next(Iterator.scala:37) [info] at scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:64) [info] at scala.collection.IterableLike$class.head(IterableLike.scala:91) [info] at scala.collection.mutable.ArrayBuffer.scala$collection$IndexedSeqOptimized$$super$head(ArrayBuffer.scala:47) [info] at scala.collection.IndexedSeqOptimized$class.head(IndexedSeqOptimized.scala:120) [info] at scala.collection.mutable.ArrayBuffer.head(ArrayBuffer.scala:47) [info] at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$7.applyOrElse(Analyzer.scala:247) [info] at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$7.applyOrElse(Analyzer.scala:197) [info] at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250) [info] at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250) [info] at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:50) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:249) [info] at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:263) [info] at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) [info] at scala.collection.Iterator$class.foreach(Iterator.scala:727) [info] at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) [info] at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) [info] at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) [info] at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) [info] at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) [info] at scala.collection.AbstractIterator.to(Iterator.scala:1157) [info] at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) [info] at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) [info] at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) [info] at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenUp(TreeNode.scala:292) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:247) [info] at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:197) [info] at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:196) [info] at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61) [info] at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59) [info] at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111) [info] at scala.collection.immutable.List.foldLeft(List.scala:84) [info] at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59) [info] at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51) [info] at scala.collection.immutable.List.foreach(List.scala:318) [info] at org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51) [info] at org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:1071) [info] at
[jira] [Commented] (SPARK-6392) [SQL]class not found exception thows when `add jar` use spark cli
[ https://issues.apache.org/jira/browse/SPARK-6392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14366506#comment-14366506 ] Apache Spark commented on SPARK-6392: - User 'jeanlyn' has created a pull request for this issue: https://github.com/apache/spark/pull/5079 [SQL]class not found exception thows when `add jar` use spark cli -- Key: SPARK-6392 URL: https://issues.apache.org/jira/browse/SPARK-6392 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: jeanlyn Priority: Minor When we use spark cli to add jar dynamic,we will get the *java.lang.ClassNotFoundException* when we use the class of jar to create udf.For example: {noformat} spark-sql add jar /home/jeanlyn/hello.jar; spark-sqlcreate temporary function hello as 'hello'; spark-sqlselect hello(name) from person; Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1, localhost): java.lang.ClassNotFoundException: hello {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6393) Extra RPC to the AM during killExecutor invocation
Sandy Ryza created SPARK-6393: - Summary: Extra RPC to the AM during killExecutor invocation Key: SPARK-6393 URL: https://issues.apache.org/jira/browse/SPARK-6393 Project: Spark Issue Type: Improvement Components: Spark Core, YARN Affects Versions: 1.3.1 Reporter: Sandy Ryza This was introduced by SPARK-6325 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6200) Support dialect in SQL
[ https://issues.apache.org/jira/browse/SPARK-6200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14366569#comment-14366569 ] haiyang commented on SPARK-6200: Thank you for your comments. The core idea of this implementation is to provide two interface {{Dialect}} and {{DialectManager}}.Every dialect must implement the {{Dialect}} interface if they want to use their own dialects.{{DialectManager}} is to manager all kinds of {{Dialect}}, and there is a default implementation of this interface, even we can open a api that allows users to provide their own {{DialectManager}} if they want to.If you can't aggree with this core idea, I will close this. Other questions as to what you say,I think I misunderstood the original design intent,but is just {{DefaultDialectManager}} implementation problem.At a high level,in other words, it is just different {{DialectManager}} implementation problem. We can simply modify little code to achive the following goals: 1. always parses Spark SQL its own DDL first: a. add DDLParser filed in {{DefaultDialectManager}} b. change its parse method like this: {{ddlParser(sqlText, false).getOrElse(currentDialect.parse(sqlText))}} 2. switch dialect through the open api {{SET spark.sql.dialect}} : drop curDialect field in {{DefaultDialectManager}}, use {{sqlContext.conf.dialect}} to read and switch dialect. 3. drop dialect commands to make code simpler. Support dialect in SQL -- Key: SPARK-6200 URL: https://issues.apache.org/jira/browse/SPARK-6200 Project: Spark Issue Type: Improvement Components: SQL Reporter: haiyang Created a new dialect manager,support dialect command and add new dialect use sql statement etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6366) In Python API, the default save mode for save and saveAsTable should be error instead of append.
[ https://issues.apache.org/jira/browse/SPARK-6366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-6366. --- Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 5053 [https://github.com/apache/spark/pull/5053] In Python API, the default save mode for save and saveAsTable should be error instead of append. Key: SPARK-6366 URL: https://issues.apache.org/jira/browse/SPARK-6366 Project: Spark Issue Type: Bug Components: SQL Reporter: Yin Huai Assignee: Yin Huai Priority: Blocker Fix For: 1.3.0 If a user want to append data, he/she should explicitly specify the save mode. Also, in Scala and Java, the default save mode is ErrorIfExists. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6383) Few examples on Dataframe operation give compiler errors
[ https://issues.apache.org/jira/browse/SPARK-6383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-6383. Resolution: Fixed Fix Version/s: 1.3.0 Few examples on Dataframe operation give compiler errors - Key: SPARK-6383 URL: https://issues.apache.org/jira/browse/SPARK-6383 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Tijo Thomas Priority: Trivial Fix For: 1.3.0 The below statements give compiler errors as a) the select method doesnot accept String, Column df.select(name, df(age) + 1).show() // Need to convert String to Column b) Filtering should be based on age not on name Column df.filter(df(name) 21).show() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6383) Few examples on Dataframe operation give compiler errors
[ https://issues.apache.org/jira/browse/SPARK-6383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-6383: --- Fix Version/s: (was: 1.3.0) 1.3.1 1.4.0 Few examples on Dataframe operation give compiler errors - Key: SPARK-6383 URL: https://issues.apache.org/jira/browse/SPARK-6383 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Tijo Thomas Priority: Trivial Labels: DataFrame Fix For: 1.4.0, 1.3.1 The below statements give compiler errors as a) the select method doesnot accept String, Column df.select(name, df(age) + 1).show() // Need to convert String to Column b) Filtering should be based on age not on name Column df.filter(df(name) 21).show() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6383) Few examples on Dataframe operation give compiler errors
[ https://issues.apache.org/jira/browse/SPARK-6383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-6383: --- Labels: DataFrame (was: ) Few examples on Dataframe operation give compiler errors - Key: SPARK-6383 URL: https://issues.apache.org/jira/browse/SPARK-6383 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Tijo Thomas Priority: Trivial Labels: DataFrame Fix For: 1.4.0, 1.3.1 The below statements give compiler errors as a) the select method doesnot accept String, Column df.select(name, df(age) + 1).show() // Need to convert String to Column b) Filtering should be based on age not on name Column df.filter(df(name) 21).show() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5908) Hive udtf with single alias should be resolved correctly
[ https://issues.apache.org/jira/browse/SPARK-5908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-5908. - Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 4692 [https://github.com/apache/spark/pull/4692] Hive udtf with single alias should be resolved correctly Key: SPARK-5908 URL: https://issues.apache.org/jira/browse/SPARK-5908 Project: Spark Issue Type: Bug Components: SQL Reporter: Liang-Chi Hsieh Fix For: 1.4.0 ResolveUdtfsAlias in hiveUdfs only considers the HiveGenericUdtf with multiple alias. When only single alias is used with HiveGenericUdtf, the alias is not working. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6356) Support the ROLLUP/CUBE/GROUPING SETS/grouping() in SQLContext
[ https://issues.apache.org/jira/browse/SPARK-6356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14366551#comment-14366551 ] Apache Spark commented on SPARK-6356: - User 'watermen' has created a pull request for this issue: https://github.com/apache/spark/pull/5080 Support the ROLLUP/CUBE/GROUPING SETS/grouping() in SQLContext -- Key: SPARK-6356 URL: https://issues.apache.org/jira/browse/SPARK-6356 Project: Spark Issue Type: New Feature Components: SQL Reporter: Yadong Qi Support for the expression below: ``` GROUP BY expression list WITH ROLLUP GROUP BY expression list WITH CUBE GROUP BY expression list GROUPING SETS(expression list2) ``` and ``` GROUP BY ROLLUP(expression list) GROUP BY CUBE(expression list) GROUP BY GROUPING SETS(expression list) ``` and ``` GROUPING (expression list) ``` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6330) newParquetRelation gets incorrect FileSystem
[ https://issues.apache.org/jira/browse/SPARK-6330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14366394#comment-14366394 ] Apache Spark commented on SPARK-6330: - User 'ypcat' has created a pull request for this issue: https://github.com/apache/spark/pull/5039 newParquetRelation gets incorrect FileSystem Key: SPARK-6330 URL: https://issues.apache.org/jira/browse/SPARK-6330 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Volodymyr Lyubinets Assignee: Volodymyr Lyubinets Fix For: 1.4.0, 1.3.1 Here's a snippet from newParquet.scala: def refresh(): Unit = { val fs = FileSystem.get(sparkContext.hadoopConfiguration) // Support either reading a collection of raw Parquet part-files, or a collection of folders // containing Parquet files (e.g. partitioned Parquet table). val baseStatuses = paths.distinct.map { p = val qualified = fs.makeQualified(new Path(p)) if (!fs.exists(qualified) maybeSchema.isDefined) { fs.mkdirs(qualified) prepareMetadata(qualified, maybeSchema.get, sparkContext.hadoopConfiguration) } fs.getFileStatus(qualified) }.toArray If we are running this locally and path points to S3, fs would be incorrect. A fix is to construct fs for each file separately. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5654) Integrate SparkR into Apache Spark
[ https://issues.apache.org/jira/browse/SPARK-5654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14366237#comment-14366237 ] Apache Spark commented on SPARK-5654: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/5077 Integrate SparkR into Apache Spark -- Key: SPARK-5654 URL: https://issues.apache.org/jira/browse/SPARK-5654 Project: Spark Issue Type: New Feature Components: Project Infra Reporter: Shivaram Venkataraman The SparkR project [1] provides a light-weight frontend to launch Spark jobs from R. The project was started at the AMPLab around a year ago and has been incubated as its own project to make sure it can be easily merged into upstream Spark, i.e. not introduce any external dependencies etc. SparkR’s goals are similar to PySpark and shares a similar design pattern as described in our meetup talk[2], Spark Summit presentation[3]. Integrating SparkR into the Apache project will enable R users to use Spark out of the box and given R’s large user base, it will help the Spark project reach more users. Additionally, work in progress features like providing R integration with ML Pipelines and Dataframes can be better achieved by development in a unified code base. SparkR is available under the Apache 2.0 License and does not have any external dependencies other than requiring users to have R and Java installed on their machines. SparkR’s developers come from many organizations including UC Berkeley, Alteryx, Intel and we will support future development, maintenance after the integration. [1] https://github.com/amplab-extras/SparkR-pkg [2] http://files.meetup.com/3138542/SparkR-meetup.pdf [3] http://spark-summit.org/2014/talk/sparkr-interactive-r-programs-at-scale-2 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6250) Types are now reserved words in DDL parser.
[ https://issues.apache.org/jira/browse/SPARK-6250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14366269#comment-14366269 ] Apache Spark commented on SPARK-6250: - User 'yhuai' has created a pull request for this issue: https://github.com/apache/spark/pull/5078 Types are now reserved words in DDL parser. --- Key: SPARK-6250 URL: https://issues.apache.org/jira/browse/SPARK-6250 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Yin Huai Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6250) Types are now reserved words in DDL parser.
[ https://issues.apache.org/jira/browse/SPARK-6250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14366284#comment-14366284 ] Yin Huai commented on SPARK-6250: - [~nitay] Can you try https://github.com/apache/spark/pull/5078 and see if it works for your case? Types are now reserved words in DDL parser. --- Key: SPARK-6250 URL: https://issues.apache.org/jira/browse/SPARK-6250 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Yin Huai Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6248) LocalRelation needs to implement statistics
[ https://issues.apache.org/jira/browse/SPARK-6248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-6248: Target Version/s: 1.3.1 (was: 1.4.0) LocalRelation needs to implement statistics --- Key: SPARK-6248 URL: https://issues.apache.org/jira/browse/SPARK-6248 Project: Spark Issue Type: Bug Components: SQL Reporter: Yin Huai We need to implement statistics for LocalRelation. Otherwise, we cannot join a LocalRelation with other tables. The following exception will be thrown. {code} java.lang.UnsupportedOperationException: LeafNode LocalRelation must implement statistics. {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6248) LocalRelation needs to implement statistics
[ https://issues.apache.org/jira/browse/SPARK-6248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14366333#comment-14366333 ] Yin Huai commented on SPARK-6248: - https://github.com/apache/spark/pull/5062 will also fix it. LocalRelation needs to implement statistics --- Key: SPARK-6248 URL: https://issues.apache.org/jira/browse/SPARK-6248 Project: Spark Issue Type: Bug Components: SQL Reporter: Yin Huai We need to implement statistics for LocalRelation. Otherwise, we cannot join a LocalRelation with other tables. The following exception will be thrown. {code} java.lang.UnsupportedOperationException: LeafNode LocalRelation must implement statistics. {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6154) Support Kafka, JDBC in Scala 2.11
[ https://issues.apache.org/jira/browse/SPARK-6154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14366295#comment-14366295 ] Olivier Toupin commented on SPARK-6154: --- I made an unperfect fix for hive-thriftserver. https://github.com/oliviertoupin/spark/compare/v1.3.0...oliviertoupin:olivier-sql-cli-2.11 To test: Download 1.3.0. dev/change-version-to-2.11.sh mvn -Dscala-2.11 -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive -Phive-thriftserver -DskipTests clean package This basically shade jline for the spark repl, and the hive jline version for sql-cli to avoid conflict. For me everything seems to work fine, but feel free to test further. One problem: When running spark-shell I get this error: Failed to created JLineReader: java.lang.NoClassDefFoundError: jline/console/completer/Completer Falling back to SimpleReader. This prevent autocompletion from working. So it might not be a mergeable patch for now, but might be helpful for people that, like us, want to run 2.11 and spark-sql. You lose autocompletion (you were using it?), but both components work with 2.11. Support Kafka, JDBC in Scala 2.11 - Key: SPARK-6154 URL: https://issues.apache.org/jira/browse/SPARK-6154 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 1.3.0 Reporter: Jianshi Huang Build v1.3.0-rc2 with Scala 2.11 using instructions in the documentation failed when -Phive-thriftserver is enabled. [info] Compiling 9 Scala sources to /home/hjs/workspace/spark/sql/hive-thriftserver/target/scala-2.11/classes... [error] /home/hjs/workspace/spark/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala:2 5: object ConsoleReader is not a member of package jline [error] import jline.{ConsoleReader, History} [error]^ [warn] Class jline.Completor not found - continuing with a stub. [warn] Class jline.ConsoleReader not found - continuing with a stub. [error] /home/hjs/workspace/spark/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala:1 65: not found: type ConsoleReader [error] val reader = new ConsoleReader() Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6248) LocalRelation needs to implement statistics
[ https://issues.apache.org/jira/browse/SPARK-6248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-6248: Description: We need to implement statistics for LocalRelation. Otherwise, we cannot join a LocalRelation with other tables. The following exception will be thrown. {code} java.lang.UnsupportedOperationException: LeafNode LocalRelation must implement statistics. {code} was: We need to implement statistics for LocalRelation. Otherwise, we cannot join a LocalRelation with other tables. The following exception will be thrown. {code} java.lang.UnsupportedOperationException: LeafNode LocalRelation must implement statistics. {code] LocalRelation needs to implement statistics --- Key: SPARK-6248 URL: https://issues.apache.org/jira/browse/SPARK-6248 Project: Spark Issue Type: Bug Components: SQL Reporter: Yin Huai We need to implement statistics for LocalRelation. Otherwise, we cannot join a LocalRelation with other tables. The following exception will be thrown. {code} java.lang.UnsupportedOperationException: LeafNode LocalRelation must implement statistics. {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6391) Update Tachyon version compatibility documentation
Calvin Jia created SPARK-6391: - Summary: Update Tachyon version compatibility documentation Key: SPARK-6391 URL: https://issues.apache.org/jira/browse/SPARK-6391 Project: Spark Issue Type: Documentation Components: Spark Core Reporter: Calvin Jia Tachyon v0.6 has an api change in the client, it would be helpful to document the Tachyon-Spark compatibility across versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6391) Update Tachyon version compatibility documentation
[ https://issues.apache.org/jira/browse/SPARK-6391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haoyuan Li updated SPARK-6391: -- Component/s: (was: Spark Core) Documentation Update Tachyon version compatibility documentation -- Key: SPARK-6391 URL: https://issues.apache.org/jira/browse/SPARK-6391 Project: Spark Issue Type: Documentation Components: Documentation Reporter: Calvin Jia Tachyon v0.6 has an api change in the client, it would be helpful to document the Tachyon-Spark compatibility across versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6391) Update Tachyon version compatibility documentation
[ https://issues.apache.org/jira/browse/SPARK-6391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haoyuan Li updated SPARK-6391: -- Fix Version/s: 1.4.0 Update Tachyon version compatibility documentation -- Key: SPARK-6391 URL: https://issues.apache.org/jira/browse/SPARK-6391 Project: Spark Issue Type: Documentation Components: Documentation Affects Versions: 1.3.0 Reporter: Calvin Jia Fix For: 1.4.0 Tachyon v0.6 has an api change in the client, it would be helpful to document the Tachyon-Spark compatibility across versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6391) Update Tachyon version compatibility documentation
[ https://issues.apache.org/jira/browse/SPARK-6391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haoyuan Li updated SPARK-6391: -- Affects Version/s: 1.3.0 Update Tachyon version compatibility documentation -- Key: SPARK-6391 URL: https://issues.apache.org/jira/browse/SPARK-6391 Project: Spark Issue Type: Documentation Components: Documentation Affects Versions: 1.3.0 Reporter: Calvin Jia Fix For: 1.4.0 Tachyon v0.6 has an api change in the client, it would be helpful to document the Tachyon-Spark compatibility across versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6334) spark-local dir not getting cleared during ALS
[ https://issues.apache.org/jira/browse/SPARK-6334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14366021#comment-14366021 ] Xiangrui Meng commented on SPARK-6334: -- Couple suggestions before SPARK-5955 is implemented: 1. Upgrade to Spark 1.3. ALS receives a new implementation in 1.3, where the shuffle size is reduced. 2. Use less number of blocks, even you have more CPU cores. There is a trade-off between communication and computation. With k = 50, I think the communication still dominates. 3. Minor. Build Spark with -Pnetlib-lgpl to include native BLAS/LAPACK libraries. spark-local dir not getting cleared during ALS -- Key: SPARK-6334 URL: https://issues.apache.org/jira/browse/SPARK-6334 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.2.0 Reporter: Antony Mayi Attachments: als-diskusage.png when running bigger ALS training spark spills loads of temp data into the local-dir (in my case yarn/local/usercache/antony.mayi/appcache/... - running on YARN from cdh 5.3.2) eventually causing all the disks of all nodes running out of space (in my case I have 12TB of available disk capacity before kicking off the ALS but it all gets used (and yarn kills the containers when reaching 90%). even with all recommended options (configuring checkpointing and forcing GC when possible) it still doesn't get cleared. here is my (pseudo)code (pyspark): {code} sc.setCheckpointDir('/tmp') training = sc.pickleFile('/tmp/dataset').repartition(768).persist(StorageLevel.MEMORY_AND_DISK) model = ALS.trainImplicit(training, 50, 15, lambda_=0.1, blocks=-1, alpha=40) sc._jvm.System.gc() {code} the training RDD has about 3.5 billions of items (~60GB on disk). after about 6 hours the ALS will consume all 12TB of disk space in local-dir data and gets killed. my cluster has 192 cores, 1.5TB RAM and for this task I am using 37 executors of 4 cores/28+4GB RAM each. this is the graph of disk consumption pattern showing the space being all eaten from 7% to 90% during the ALS (90% is when YARN kills the container): !als-diskusage.png! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4590) Early investigation of parameter server
[ https://issues.apache.org/jira/browse/SPARK-4590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14365581#comment-14365581 ] Reza Zadeh commented on SPARK-4590: --- Hi Qiping, We are waiting for IndexedRDD to be optimized and merged into Spark master. Early investigation of parameter server --- Key: SPARK-4590 URL: https://issues.apache.org/jira/browse/SPARK-4590 Project: Spark Issue Type: Brainstorming Components: ML, MLlib Reporter: Xiangrui Meng Assignee: Reza Zadeh In the currently implementation of GLM solvers, we save intermediate models on the driver node and update it through broadcast and aggregation. Even with torrent broadcast and tree aggregation added in 1.1, it is hard to go beyond ~10 million features. This JIRA is for investigating the parameter server approach, including algorithm, infrastructure, and dependencies. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-3351) Yarn YarnRMClientImpl.shutdown can be called before register - NPE
[ https://issues.apache.org/jira/browse/SPARK-3351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves closed SPARK-3351. Resolution: Invalid Yarn YarnRMClientImpl.shutdown can be called before register - NPE -- Key: SPARK-3351 URL: https://issues.apache.org/jira/browse/SPARK-3351 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.2.0 Reporter: Thomas Graves If the SparkContext exits while its in the applicationmaster.waitForSparkContextInitialized then the YarnRMClientImpl.shutdown can be called before register and you get a null pointer exception on the uihistoryAddress. 14/09/02 18:59:21 INFO ApplicationMaster: Finishing ApplicationMaster with FAILED (diag message: Timed out waiting for SparkContext.) Exception in thread main java.lang.NullPointerException at org.apache.hadoop.yarn.proto.YarnServiceProtos$FinishApplicationMasterRequestProto$Builder.setTrackingUrl(YarnServiceProtos.java:2312) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.FinishApplicationMasterRequestPBImpl.setTrackingUrl(FinishApplicationMasterRequestPBImpl.java:121) at org.apache.spark.deploy.yarn.YarnRMClientImpl.shutdown(YarnRMClientImpl.scala:73) at org.apache.spark.deploy.yarn.ApplicationMaster.finish(ApplicationMaster.scala:140) at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:178) at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:113) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3351) Yarn YarnRMClientImpl.shutdown can be called before register - NPE
[ https://issues.apache.org/jira/browse/SPARK-3351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14365607#comment-14365607 ] Thomas Graves commented on SPARK-3351: -- ok lets close this and reopen if we see it again. Yarn YarnRMClientImpl.shutdown can be called before register - NPE -- Key: SPARK-3351 URL: https://issues.apache.org/jira/browse/SPARK-3351 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.2.0 Reporter: Thomas Graves If the SparkContext exits while its in the applicationmaster.waitForSparkContextInitialized then the YarnRMClientImpl.shutdown can be called before register and you get a null pointer exception on the uihistoryAddress. 14/09/02 18:59:21 INFO ApplicationMaster: Finishing ApplicationMaster with FAILED (diag message: Timed out waiting for SparkContext.) Exception in thread main java.lang.NullPointerException at org.apache.hadoop.yarn.proto.YarnServiceProtos$FinishApplicationMasterRequestProto$Builder.setTrackingUrl(YarnServiceProtos.java:2312) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.FinishApplicationMasterRequestPBImpl.setTrackingUrl(FinishApplicationMasterRequestPBImpl.java:121) at org.apache.spark.deploy.yarn.YarnRMClientImpl.shutdown(YarnRMClientImpl.scala:73) at org.apache.spark.deploy.yarn.ApplicationMaster.finish(ApplicationMaster.scala:140) at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:178) at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:113) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6336) LBFGS should document what convergenceTol means
[ https://issues.apache.org/jira/browse/SPARK-6336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-6336: - Fix Version/s: (was: 1.3.0) 1.3.1 1.4.0 LBFGS should document what convergenceTol means --- Key: SPARK-6336 URL: https://issues.apache.org/jira/browse/SPARK-6336 Project: Spark Issue Type: Documentation Components: Documentation, MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Priority: Trivial Fix For: 1.4.0, 1.3.1 LBFGS uses Breeze's LBFGS, which uses relative convergence tolerance. We should document that convergenceTol is relative and ensure in a unit test that this behavior does not change in Breeze without us realizing it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6336) LBFGS should document what convergenceTol means
[ https://issues.apache.org/jira/browse/SPARK-6336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-6336. -- Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 5033 [https://github.com/apache/spark/pull/5033] LBFGS should document what convergenceTol means --- Key: SPARK-6336 URL: https://issues.apache.org/jira/browse/SPARK-6336 Project: Spark Issue Type: Documentation Components: Documentation, MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Priority: Trivial Fix For: 1.3.0 LBFGS uses Breeze's LBFGS, which uses relative convergence tolerance. We should document that convergenceTol is relative and ensure in a unit test that this behavior does not change in Breeze without us realizing it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6336) LBFGS should document what convergenceTol means
[ https://issues.apache.org/jira/browse/SPARK-6336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-6336: - Assignee: Kai Sasaki LBFGS should document what convergenceTol means --- Key: SPARK-6336 URL: https://issues.apache.org/jira/browse/SPARK-6336 Project: Spark Issue Type: Documentation Components: Documentation, MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Kai Sasaki Priority: Trivial Fix For: 1.4.0, 1.3.1 LBFGS uses Breeze's LBFGS, which uses relative convergence tolerance. We should document that convergenceTol is relative and ensure in a unit test that this behavior does not change in Breeze without us realizing it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4416) Support Mesos framework authentication
[ https://issues.apache.org/jira/browse/SPARK-4416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14366011#comment-14366011 ] Adam B commented on SPARK-4416: --- Can close as duplicate of SPARK-6284 Support Mesos framework authentication -- Key: SPARK-4416 URL: https://issues.apache.org/jira/browse/SPARK-4416 Project: Spark Issue Type: Improvement Components: Mesos Reporter: Timothy Chen Support Mesos framework authentication -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-5955) Add checkpointInterval to ALS
[ https://issues.apache.org/jira/browse/SPARK-5955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng reassigned SPARK-5955: Assignee: Xiangrui Meng Add checkpointInterval to ALS - Key: SPARK-5955 URL: https://issues.apache.org/jira/browse/SPARK-5955 Project: Spark Issue Type: New Feature Components: ML, MLlib Reporter: Xiangrui Meng Assignee: Xiangrui Meng We should add checkpoint interval to ALS to prevent the following: 1. storing large shuffle files 2. stack overflow (SPARK-1106) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5750) Document that ordering of elements in shuffled partitions is not deterministic across runs
[ https://issues.apache.org/jira/browse/SPARK-5750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14365895#comment-14365895 ] Apache Spark commented on SPARK-5750: - User 'ilganeli' has created a pull request for this issue: https://github.com/apache/spark/pull/5074 Document that ordering of elements in shuffled partitions is not deterministic across runs -- Key: SPARK-5750 URL: https://issues.apache.org/jira/browse/SPARK-5750 Project: Spark Issue Type: Improvement Components: Documentation Reporter: Josh Rosen The ordering of elements in shuffled partitions is not deterministic across runs. For instance, consider the following example: {code} val largeFiles = sc.textFile(...) val airlines = largeFiles.repartition(2000).cache() println(airlines.first) {code} If this code is run twice, then each run will output a different result. There is non-determinism in the shuffle read code that accounts for this: Spark's shuffle read path processes blocks as soon as they are fetched Spark uses [ShuffleBlockFetcherIterator|https://github.com/apache/spark/blob/v1.2.1/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala] to fetch shuffle data from mappers. In this code, requests for multiple blocks from the same host are batched together, so nondeterminism in where tasks are run means that the set of requests can vary across runs. In addition, there's an [explicit call|https://github.com/apache/spark/blob/v1.2.1/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala#L256] to randomize the order of the batched fetch requests. As a result, shuffle operations cannot be guaranteed to produce the same ordering of the elements in their partitions. Therefore, Spark should update its docs to clarify that the ordering of elements in shuffle RDDs' partitions is non-deterministic. Note, however, that the _set_ of elements in each partition will be deterministic: if we used {{mapPartitions}} to sort each partition, then the {{first()}} call above would produce a deterministic result. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6226) Support model save/load in Python's KMeans
[ https://issues.apache.org/jira/browse/SPARK-6226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-6226. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 5049 [https://github.com/apache/spark/pull/5049] Support model save/load in Python's KMeans -- Key: SPARK-6226 URL: https://issues.apache.org/jira/browse/SPARK-6226 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Xusen Yin Fix For: 1.4.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3441) Explain in docs that repartitionAndSortWithinPartitions enacts Hadoop style shuffle
[ https://issues.apache.org/jira/browse/SPARK-3441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14365896#comment-14365896 ] Apache Spark commented on SPARK-3441: - User 'ilganeli' has created a pull request for this issue: https://github.com/apache/spark/pull/5074 Explain in docs that repartitionAndSortWithinPartitions enacts Hadoop style shuffle --- Key: SPARK-3441 URL: https://issues.apache.org/jira/browse/SPARK-3441 Project: Spark Issue Type: Improvement Components: Documentation, Spark Core Reporter: Patrick Wendell Assignee: Sandy Ryza I think it would be good to say something like this in the doc for repartitionAndSortWithinPartitions and add also maybe in the doc for groupBy: {code} This can be used to enact a Hadoop Style shuffle along with a call to mapPartitions, e.g.: rdd.repartitionAndSortWithinPartitions(part).mapPartitions(...) {code} It might also be nice to add a version that doesn't take a partitioner and/or to mention this in the groupBy javadoc. I guess it depends a bit whether we consider this to be an API we want people to use more widely or whether we just consider it a narrow stable API mostly for Hive-on-Spark. If we want people to consider this API when porting workloads from Hadoop, then it might be worth documenting better. What do you think [~rxin] and [~matei]? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org