[jira] [Assigned] (SPARK-15931) SparkR tests failing on R 3.3.0
[ https://issues.apache.org/jira/browse/SPARK-15931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15931: Assignee: Apache Spark > SparkR tests failing on R 3.3.0 > --- > > Key: SPARK-15931 > URL: https://issues.apache.org/jira/browse/SPARK-15931 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Apache Spark > > Environment: > # Spark master Git revision: > [f5d38c39255cc75325c6639561bfec1bc051f788|https://github.com/apache/spark/tree/f5d38c39255cc75325c6639561bfec1bc051f788] > # R version: 3.3.0 > To reproduce this, just build Spark with {{-Psparkr}} and run the tests. > Relevant log lines: > {noformat} > ... > Failed > - > 1. Failure: Check masked functions (@test_context.R#44) > > length(maskedCompletely) not equal to length(namesOfMaskedCompletely). > 1/1 mismatches > [1] 3 - 5 == -2 > 2. Failure: Check masked functions (@test_context.R#45) > > sort(maskedCompletely) not equal to sort(namesOfMaskedCompletely). > Lengths differ: 3 vs 5 > ... > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15931) SparkR tests failing on R 3.3.0
[ https://issues.apache.org/jira/browse/SPARK-15931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15931: Assignee: (was: Apache Spark) > SparkR tests failing on R 3.3.0 > --- > > Key: SPARK-15931 > URL: https://issues.apache.org/jira/browse/SPARK-15931 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.0.0 >Reporter: Cheng Lian > > Environment: > # Spark master Git revision: > [f5d38c39255cc75325c6639561bfec1bc051f788|https://github.com/apache/spark/tree/f5d38c39255cc75325c6639561bfec1bc051f788] > # R version: 3.3.0 > To reproduce this, just build Spark with {{-Psparkr}} and run the tests. > Relevant log lines: > {noformat} > ... > Failed > - > 1. Failure: Check masked functions (@test_context.R#44) > > length(maskedCompletely) not equal to length(namesOfMaskedCompletely). > 1/1 mismatches > [1] 3 - 5 == -2 > 2. Failure: Check masked functions (@test_context.R#45) > > sort(maskedCompletely) not equal to sort(namesOfMaskedCompletely). > Lengths differ: 3 vs 5 > ... > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15931) SparkR tests failing on R 3.3.0
[ https://issues.apache.org/jira/browse/SPARK-15931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328984#comment-15328984 ] Apache Spark commented on SPARK-15931: -- User 'felixcheung' has created a pull request for this issue: https://github.com/apache/spark/pull/13636 > SparkR tests failing on R 3.3.0 > --- > > Key: SPARK-15931 > URL: https://issues.apache.org/jira/browse/SPARK-15931 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.0.0 >Reporter: Cheng Lian > > Environment: > # Spark master Git revision: > [f5d38c39255cc75325c6639561bfec1bc051f788|https://github.com/apache/spark/tree/f5d38c39255cc75325c6639561bfec1bc051f788] > # R version: 3.3.0 > To reproduce this, just build Spark with {{-Psparkr}} and run the tests. > Relevant log lines: > {noformat} > ... > Failed > - > 1. Failure: Check masked functions (@test_context.R#44) > > length(maskedCompletely) not equal to length(namesOfMaskedCompletely). > 1/1 mismatches > [1] 3 - 5 == -2 > 2. Failure: Check masked functions (@test_context.R#45) > > sort(maskedCompletely) not equal to sort(namesOfMaskedCompletely). > Lengths differ: 3 vs 5 > ... > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15937) Spark declares a succeeding job to be failed in yarn-cluster mode if the job takes very small time (~ < 10 seconds) to finish
Subroto Sanyal created SPARK-15937: -- Summary: Spark declares a succeeding job to be failed in yarn-cluster mode if the job takes very small time (~ < 10 seconds) to finish Key: SPARK-15937 URL: https://issues.apache.org/jira/browse/SPARK-15937 Project: Spark Issue Type: Bug Affects Versions: 1.6.1 Reporter: Subroto Sanyal h5. Problem: Spark Job fails in yarn-cluster mode if the job takes less time than 10 seconds. The job execution here is successful but, spark framework declares it failed. {noformat} 16/06/13 10:50:29 INFO yarn.ApplicationMaster: Registered signal handlers for [TERM, HUP, INT] 16/06/13 10:50:30 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 16/06/13 10:50:31 INFO yarn.ApplicationMaster: ApplicationAttemptId: appattempt_1465791692084_0078_01 16/06/13 10:50:32 INFO spark.SecurityManager: Changing view acls to: subroto 16/06/13 10:50:32 INFO spark.SecurityManager: Changing modify acls to: subroto 16/06/13 10:50:32 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(subroto); users with modify permissions: Set(subroto) 16/06/13 10:50:32 INFO yarn.ApplicationMaster: Starting the user application in a separate Thread 16/06/13 10:50:32 INFO yarn.ApplicationMaster: Waiting for spark context initialization 16/06/13 10:50:32 INFO yarn.ApplicationMaster: Waiting for spark context initialization ... 16/06/13 10:50:33 INFO graphv2.ClusterTaskRuntime: Initializing plugin registry on cluster... 16/06/13 10:50:33 INFO util.DefaultTimeZone: Loading default time zone of US/Eastern 16/06/13 10:50:33 INFO graphv2.ClusterTaskRuntime: Setting system property das.big-decimal.precision=32 16/06/13 10:50:33 INFO graphv2.ClusterTaskRuntime: Setting system property das.default-timezone=US/Eastern 16/06/13 10:50:33 INFO graphv2.ClusterTaskRuntime: Setting system property das.security.conductor.properties.keysLocation=etc/securePropertiesKeys 16/06/13 10:50:33 INFO util.DefaultTimeZone: Changing default time zone of from US/Eastern to US/Eastern 16/06/13 10:50:34 INFO job.PluginRegistryImpl: --- JVM Information --- 16/06/13 10:50:34 INFO job.PluginRegistryImpl: JVM: Java HotSpot(TM) 64-Bit Server VM, 1.7 (Oracle Corporation) 16/06/13 10:50:34 INFO job.PluginRegistryImpl: JVM arguments: -Xmx1024m -Djava.io.tmpdir=/mnt/hadoop/yarn/usercache/subroto/appcache/application_1465791692084_0078/container_1465791692084_0078_01_01/tmp -Dspark.yarn.app.container.log.dir=/var/log/hadoop/yarn/application_1465791692084_0078/container_1465791692084_0078_01_01 -XX:MaxPermSize=256m 16/06/13 10:50:34 INFO job.PluginRegistryImpl: Log4j: 'file:/mnt/hadoop/yarn/usercache/subroto/filecache/103/__spark_conf__6826322497897602970.zip/log4j.properties' (default classpath) 16/06/13 10:50:34 INFO job.PluginRegistryImpl: Max memory : 910.5 MB 16/06/13 10:50:34 INFO job.PluginRegistryImpl: Free memory: 831.8 MB, before Plugin Registry start-up: 847.5 MB 16/06/13 10:50:34 INFO job.PluginRegistryImpl: - 16/06/13 10:50:34 INFO graphv2.ClusterTaskRuntime: Initializing cluster task configuration... 16/06/13 10:50:34 INFO util.LoggingUtil: Setting root logger level for hadoop task to DEBUG: 16/06/13 10:50:35 INFO cluster.JobProcessor: Processing JobInput{_jobName=Import job (76): BookorderHS2ImportJob_SparkCluster#import(Identity)} 16/06/13 10:50:35 DEBUG security.UserGroupInformation: hadoop login 16/06/13 10:50:35 INFO cluster.JobProcessor: Writing job output to hdfs://ip-10-195-43-46.eu-west-1.compute.internal:8020/user/subroto/dap1/temp/Output-19017846-059d-4bf1-a95d-1063fe6c1827. 16/06/13 10:50:35 DEBUG hdfs.DFSClient: /user/subroto/dap1/temp/Output-19017846-059d-4bf1-a95d-1063fe6c1827: masked=rw-r--r-- 16/06/13 10:50:35 DEBUG ipc.Client: IPC Client (841703792) connection to ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto sending #2 16/06/13 10:50:35 DEBUG ipc.Client: IPC Client (841703792) connection to ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto got value #2 16/06/13 10:50:35 DEBUG ipc.ProtobufRpcEngine: Call: create took 2ms 16/06/13 10:50:35 DEBUG hdfs.DFSClient: computePacketChunkSize: src=/user/subroto/dap1/temp/Output-19017846-059d-4bf1-a95d-1063fe6c1827, chunkSize=516, chunksPerPacket=127, packetSize=65532 16/06/13 10:50:35 DEBUG hdfs.LeaseRenewer: Lease renewer daemon for [DFSClient_NONMAPREDUCE_1004172348_1] with renew id 1 started 16/06/13 10:50:35 DEBUG hdfs.DFSClient: DFSClient writeChunk allocating new packet seqno=0, src=/user/subroto/dap1/temp/Output-19017846-059d-4bf1-a95d-1063fe6c1827, packetSize=65532, chunksPerPacket=127, bytesCurBlock=0 16/06/13 10:50:35 DEBUG hdfs.DFSClient: Queued packet 0 16/06/13 10:50:35 DEBUG hdfs.DF
[jira] [Commented] (SPARK-15908) Add varargs-type dropDuplicates() function in SparkR
[ https://issues.apache.org/jira/browse/SPARK-15908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328949#comment-15328949 ] Dongjoon Hyun commented on SPARK-15908: --- Hi, [~sunrui]. I did SPARK-15807. If you didn't start yet, may I do this too? > Add varargs-type dropDuplicates() function in SparkR > > > Key: SPARK-15908 > URL: https://issues.apache.org/jira/browse/SPARK-15908 > Project: Spark > Issue Type: New Feature > Components: SparkR >Affects Versions: 1.6.1 >Reporter: Sun Rui > > This is for API parity of Scala API. Refer to > https://issues.apache.org/jira/browse/SPARK-15807 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14351) Optimize ImpurityAggregator for decision trees
[ https://issues.apache.org/jira/browse/SPARK-14351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328941#comment-15328941 ] Manoj Kumar commented on SPARK-14351: - I can try working on this. > Optimize ImpurityAggregator for decision trees > -- > > Key: SPARK-14351 > URL: https://issues.apache.org/jira/browse/SPARK-14351 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley > > {{RandomForest.binsToBestSplit}} currently takes a large amount of time. > Based on some quick profiling, I believe a big chunk of this is spent in > {{ImpurityAggregator.getCalculator}} (which seems to make unnecessary Array > copies) and {{RandomForest.calculateImpurityStats}}. > This JIRA is for: > * Doing more profiling to confirm that unnecessary time is being spent in > some of these methods. > * Optimizing the implementation > * Profiling again to confirm the speedups > Local profiling for large enough examples should suffice, especially since > the optimizations should not need to change the amount of data communicated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15932) document the contract of encoder serializer expressions
[ https://issues.apache.org/jira/browse/SPARK-15932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-15932. - Resolution: Fixed Fix Version/s: 2.0.0 > document the contract of encoder serializer expressions > --- > > Key: SPARK-15932 > URL: https://issues.apache.org/jira/browse/SPARK-15932 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3155) Support DecisionTree pruning
[ https://issues.apache.org/jira/browse/SPARK-3155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328939#comment-15328939 ] Manoj Kumar edited comment on SPARK-3155 at 6/14/16 5:01 AM: - 1. I agree that the use cases are limited to single trees. You kind of lose interpretability if you train the tree to maximum depth. It helps in improving interpretability while also improving on generalization performance. 3. It is intuitive to prune the tree during training (i.e stop training after the validation error increases) . However this is very similar to just having a stopping criterion such as maximum depth, minimum samples in each node (except that the stopping criteria is dependent on validation data) And is quite uncommon to do it. The standard practise (at least according to my lectures) is to train the tree to full depth and remove the leaves according to validation data. However, if you feel that #14351 is more important, I can focus on that. was (Author: mechcoder): 1. I agree that the use cases are limited to single trees. You kind of lose interpretability if you train the tree to maximum depth. It helps in improving interpretability while also improving on generalization performance. 3. It is intuitive to prune the tree during training (i.e stop training after the validation error increases) . However this is very similar to just having a stopping criterion such as maximum depth, minimum samples in each node (except that the stopping criteria is dependent on validation data) And is quite uncommon to do it. The standard practise (at least according to my lectures) is to train the train to full depth and remove the leaves according to validation data. However, if you feel that #14351 is more important, I can focus on that. > Support DecisionTree pruning > > > Key: SPARK-3155 > URL: https://issues.apache.org/jira/browse/SPARK-3155 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Joseph K. Bradley > > Improvement: accuracy, computation > Summary: Pruning is a common method for preventing overfitting with decision > trees. A smart implementation can prune the tree during training in order to > avoid training parts of the tree which would be pruned eventually anyways. > DecisionTree does not currently support pruning. > Pruning: A “pruning” of a tree is a subtree with the same root node, but > with zero or more branches removed. > A naive implementation prunes as follows: > (1) Train a depth K tree using a training set. > (2) Compute the optimal prediction at each node (including internal nodes) > based on the training set. > (3) Take a held-out validation set, and use the tree to make predictions for > each validation example. This allows one to compute the validation error > made at each node in the tree (based on the predictions computed in step (2).) > (4) For each pair of leafs with the same parent, compare the total error on > the validation set made by the leafs’ predictions with the error made by the > parent’s predictions. Remove the leafs if the parent has lower error. > A smarter implementation prunes during training, computing the error on the > validation set made by each node as it is trained. Whenever two children > increase the validation error, they are pruned, and no more training is > required on that branch. > It is common to use about 1/3 of the data for pruning. Note that pruning is > important when using a tree directly for prediction. It is less important > when combining trees via ensemble methods. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3155) Support DecisionTree pruning
[ https://issues.apache.org/jira/browse/SPARK-3155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328939#comment-15328939 ] Manoj Kumar commented on SPARK-3155: 1. I agree that the use cases are limited to single trees. You kind of lose interpretability if you train the tree to maximum depth. It helps in improving interpretability while also improving on generalization performance. 3. It is intuitive to prune the tree during training (i.e stop training after the validation error increases) . However this is very similar to just having a stopping criterion such as maximum depth, minimum samples in each node (except that the stopping criteria is dependent on validation data) And is quite uncommon to do it. The standard practise (at least according to my lectures) is to train the train to full depth and remove the leaves according to validation data. However, if you feel that #14351 is more important, I can focus on that. > Support DecisionTree pruning > > > Key: SPARK-3155 > URL: https://issues.apache.org/jira/browse/SPARK-3155 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Joseph K. Bradley > > Improvement: accuracy, computation > Summary: Pruning is a common method for preventing overfitting with decision > trees. A smart implementation can prune the tree during training in order to > avoid training parts of the tree which would be pruned eventually anyways. > DecisionTree does not currently support pruning. > Pruning: A “pruning” of a tree is a subtree with the same root node, but > with zero or more branches removed. > A naive implementation prunes as follows: > (1) Train a depth K tree using a training set. > (2) Compute the optimal prediction at each node (including internal nodes) > based on the training set. > (3) Take a held-out validation set, and use the tree to make predictions for > each validation example. This allows one to compute the validation error > made at each node in the tree (based on the predictions computed in step (2).) > (4) For each pair of leafs with the same parent, compare the total error on > the validation set made by the leafs’ predictions with the error made by the > parent’s predictions. Remove the leafs if the parent has lower error. > A smarter implementation prunes during training, computing the error on the > validation set made by each node as it is trained. Whenever two children > increase the validation error, they are pruned, and no more training is > required on that branch. > It is common to use about 1/3 of the data for pruning. Note that pruning is > important when using a tree directly for prediction. It is less important > when combining trees via ensemble methods. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.10 Consumer API
[ https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328894#comment-15328894 ] Sean McKibben commented on SPARK-12177: --- Unfortunately I can't contribute what I would like to, but I wholeheartedly agree with Cody that waiting for 0.11 doesn't make sense. Kafka is a lynchpin of many production scenarios and skipping 0.9 was bad enough for Spark. Waiting longer will make Spark overall less competitive in the streaming/fast data landscape, and the community will have much harder choices between Kafka Streams, Akka Streams, and Spark if another version of Kafka is omitted. > Update KafkaDStreams to new Kafka 0.10 Consumer API > --- > > Key: SPARK-12177 > URL: https://issues.apache.org/jira/browse/SPARK-12177 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.6.0 >Reporter: Nikita Tarasenko > Labels: consumer, kafka > > Kafka 0.9 already released and it introduce new consumer API that not > compatible with old one. So, I added new consumer api. I made separate > classes in package org.apache.spark.streaming.kafka.v09 with changed API. I > didn't remove old classes for more backward compatibility. User will not need > to change his old spark applications when he uprgade to new Spark version. > Please rewiew my changes -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14351) Optimize ImpurityAggregator for decision trees
[ https://issues.apache.org/jira/browse/SPARK-14351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-14351: -- Priority: Major (was: Minor) > Optimize ImpurityAggregator for decision trees > -- > > Key: SPARK-14351 > URL: https://issues.apache.org/jira/browse/SPARK-14351 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley > > {{RandomForest.binsToBestSplit}} currently takes a large amount of time. > Based on some quick profiling, I believe a big chunk of this is spent in > {{ImpurityAggregator.getCalculator}} (which seems to make unnecessary Array > copies) and {{RandomForest.calculateImpurityStats}}. > This JIRA is for: > * Doing more profiling to confirm that unnecessary time is being spent in > some of these methods. > * Optimizing the implementation > * Profiling again to confirm the speedups > Local profiling for large enough examples should suffice, especially since > the optimizations should not need to change the amount of data communicated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10835) Change Output of NGram to Array(String, True)
[ https://issues.apache.org/jira/browse/SPARK-10835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328883#comment-15328883 ] Hansa Nanayakkara commented on SPARK-10835: --- Although problem is solved for the Tokenizer it persists in NGram class > Change Output of NGram to Array(String, True) > - > > Key: SPARK-10835 > URL: https://issues.apache.org/jira/browse/SPARK-10835 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Sumit Chawla >Assignee: yuhao yang >Priority: Minor > > Currently output type of NGram is Array(String, false), which is not > compatible with LDA since their input type is Array(String, true). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3155) Support DecisionTree pruning
[ https://issues.apache.org/jira/browse/SPARK-3155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328880#comment-15328880 ] Joseph K. Bradley commented on SPARK-3155: -- A few thoughts: (1) I'm less sure about the priority of this task now. I've had a hard time identifying use cases. Few people train single trees. For forests, people generally want to overfit each tree a bit, not prune. For boosting, people generally use shallow trees so that there is no need for pruning. It would be useful to identify real use cases before we implement this feature. (2) I agree the 2 args are validation data + error tolerance. (3) Will most users want to prune during or after training? * During training: More efficient * After training: Allows multiple prunings using different error tolerances I'd say that [SPARK-14351] is the highest priority single-tree improvement I know of right now. > Support DecisionTree pruning > > > Key: SPARK-3155 > URL: https://issues.apache.org/jira/browse/SPARK-3155 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Joseph K. Bradley > > Improvement: accuracy, computation > Summary: Pruning is a common method for preventing overfitting with decision > trees. A smart implementation can prune the tree during training in order to > avoid training parts of the tree which would be pruned eventually anyways. > DecisionTree does not currently support pruning. > Pruning: A “pruning” of a tree is a subtree with the same root node, but > with zero or more branches removed. > A naive implementation prunes as follows: > (1) Train a depth K tree using a training set. > (2) Compute the optimal prediction at each node (including internal nodes) > based on the training set. > (3) Take a held-out validation set, and use the tree to make predictions for > each validation example. This allows one to compute the validation error > made at each node in the tree (based on the predictions computed in step (2).) > (4) For each pair of leafs with the same parent, compare the total error on > the validation set made by the leafs’ predictions with the error made by the > parent’s predictions. Remove the leafs if the parent has lower error. > A smarter implementation prunes during training, computing the error on the > validation set made by each node as it is trained. Whenever two children > increase the validation error, they are pruned, and no more training is > required on that branch. > It is common to use about 1/3 of the data for pruning. Note that pruning is > important when using a tree directly for prediction. It is less important > when combining trees via ensemble methods. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15364) Implement Python picklers for ml.Vector and ml.Matrix under spark.ml.python
[ https://issues.apache.org/jira/browse/SPARK-15364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-15364: -- Assignee: Liang-Chi Hsieh > Implement Python picklers for ml.Vector and ml.Matrix under spark.ml.python > --- > > Key: SPARK-15364 > URL: https://issues.apache.org/jira/browse/SPARK-15364 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Liang-Chi Hsieh > Fix For: 2.0.0 > > > Now picklers for both new and old vectors are implemented under > PythonMLlibAPI. To separate spark.mllib from spark.ml, we should implement > them under `spark.ml.python` instead. I set the target to 2.1 since those are > private APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15364) Implement Python picklers for ml.Vector and ml.Matrix under spark.ml.python
[ https://issues.apache.org/jira/browse/SPARK-15364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-15364: -- Target Version/s: 2.0.0 (was: 2.1.0) > Implement Python picklers for ml.Vector and ml.Matrix under spark.ml.python > --- > > Key: SPARK-15364 > URL: https://issues.apache.org/jira/browse/SPARK-15364 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng > Fix For: 2.0.0 > > > Now picklers for both new and old vectors are implemented under > PythonMLlibAPI. To separate spark.mllib from spark.ml, we should implement > them under `spark.ml.python` instead. I set the target to 2.1 since those are > private APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15364) Implement Python picklers for ml.Vector and ml.Matrix under spark.ml.python
[ https://issues.apache.org/jira/browse/SPARK-15364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-15364. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 13219 [https://github.com/apache/spark/pull/13219] > Implement Python picklers for ml.Vector and ml.Matrix under spark.ml.python > --- > > Key: SPARK-15364 > URL: https://issues.apache.org/jira/browse/SPARK-15364 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng > Fix For: 2.0.0 > > > Now picklers for both new and old vectors are implemented under > PythonMLlibAPI. To separate spark.mllib from spark.ml, we should implement > them under `spark.ml.python` instead. I set the target to 2.1 since those are > private APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15757) Error occurs when using Spark sql "select" statement on orc file after hive sql "insert overwrite tb1 select * from sourcTb" has been executed on this orc file
[ https://issues.apache.org/jira/browse/SPARK-15757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328860#comment-15328860 ] marymwu commented on SPARK-15757: - Any update? > Error occurs when using Spark sql "select" statement on orc file after hive > sql "insert overwrite tb1 select * from sourcTb" has been executed on this > orc file > --- > > Key: SPARK-15757 > URL: https://issues.apache.org/jira/browse/SPARK-15757 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: marymwu > Attachments: Result.png > > > Error occurs when using Spark sql "select" statement on orc file after hive > sql "insert overwrite tb1 select * from sourcTb" has been executed > 0: jdbc:hive2://172.19.200.158:40099/default> select * from inventory; > Error: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 0 in stage 7.0 failed 8 times, most recent failure: Lost task 0.7 in > stage 7.0 (TID 2532, smokeslave5.avatar.lenovomm.com): > java.lang.IllegalArgumentException: Field "inv_date_sk" does not exist. > at > org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:252) > at > org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:252) > at scala.collection.MapLike$class.getOrElse(MapLike.scala:128) > at scala.collection.AbstractMap.getOrElse(Map.scala:59) > at > org.apache.spark.sql.types.StructType.fieldIndex(StructType.scala:251) > at > org.apache.spark.sql.hive.orc.OrcRelation$$anonfun$10.apply(OrcRelation.scala:361) > at > org.apache.spark.sql.hive.orc.OrcRelation$$anonfun$10.apply(OrcRelation.scala:361) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at org.apache.spark.sql.types.StructType.foreach(StructType.scala:94) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at org.apache.spark.sql.types.StructType.map(StructType.scala:94) > at > org.apache.spark.sql.hive.orc.OrcRelation$.setRequiredColumns(OrcRelation.scala:361) > at > org.apache.spark.sql.hive.orc.DefaultSource$$anonfun$buildReader$2.apply(OrcRelation.scala:123) > at > org.apache.spark.sql.hive.orc.DefaultSource$$anonfun$buildReader$2.apply(OrcRelation.scala:112) > at > org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(fileSourceInterfaces.scala:278) > at > org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(fileSourceInterfaces.scala:262) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:114) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$7$$anon$1.hasNext(WholeStageCodegenExec.scala:357) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$3.apply(SparkPlan.scala:246) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$3.apply(SparkPlan.scala:240) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$23.apply(RDD.scala:774) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$23.apply(RDD.scala:774) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:282) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Driver stacktrace: (state=,code=0) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To u
[jira] [Commented] (SPARK-15918) unionAll returns wrong result when two dataframes has schema in different order
[ https://issues.apache.org/jira/browse/SPARK-15918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328851#comment-15328851 ] Hyukjin Kwon commented on SPARK-15918: -- Actually, I met this case before and was thinking it might be an issue. However, I realised it seems actually not after executing the equelvant quries in other several DBMS and checking some documentations such as https://msdn.microsoft.com/en-us/library/ms180026.aspx and http://www.w3schools.com/sql/sql_union.asp that say {quote} the columns in each SELECT statement must be in the same order {quote} I haven't read about the official SQL standard though, I am also pretty sure that this is not an issue. > unionAll returns wrong result when two dataframes has schema in different > order > --- > > Key: SPARK-15918 > URL: https://issues.apache.org/jira/browse/SPARK-15918 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 > Environment: CentOS >Reporter: Prabhu Joseph > > On applying unionAll operation between A and B dataframes, they both has same > schema but in different order and hence the result has column value mapping > changed. > Repro: > {code} > A.show() > +---++---+--+--+-++---+--+---+---+-+ > |tag|year_day|tm_hour|tm_min|tm_sec|dtype|time|tm_mday|tm_mon|tm_yday|tm_year|value| > +---++---+--+--+-++---+--+---+---+-+ > +---++---+--+--+-++---+--+---+---+-+ > B.show() > +-+---+--+---+---+--+--+--+---+---+--++ > |dtype|tag| > time|tm_hour|tm_mday|tm_min|tm_mon|tm_sec|tm_yday|tm_year| value|year_day| > +-+---+--+---+---+--+--+--+---+---+--++ > |F|C_FNHXUT701Z.CNSTLO|1443790800| 13| 2| 0|10| 0| > 275| 2015|1.2345| 2015275| > |F|C_FNHXUDP713.CNSTHI|1443790800| 13| 2| 0|10| 0| > 275| 2015|1.2345| 2015275| > |F| C_FNHXUT718.CNSTHI|1443790800| 13| 2| 0|10| 0| > 275| 2015|1.2345| 2015275| > |F|C_FNHXUT703Z.CNSTLO|1443790800| 13| 2| 0|10| 0| > 275| 2015|1.2345| 2015275| > |F|C_FNHXUR716A.CNSTLO|1443790800| 13| 2| 0|10| 0| > 275| 2015|1.2345| 2015275| > |F|C_FNHXUT803Z.CNSTHI|1443790800| 13| 2| 0|10| 0| > 275| 2015|1.2345| 2015275| > |F| C_FNHXUT728.CNSTHI|1443790800| 13| 2| 0|10| 0| > 275| 2015|1.2345| 2015275| > |F| C_FNHXUR806.CNSTHI|1443790800| 13| 2| 0|10| 0| > 275| 2015|1.2345| 2015275| > +-+---+--+---+---+--+--+--+---+---+--++ > A = A.unionAll(B) > A.show() > +---+---+--+--+--+-++---+--+---+---+-+ > |tag| year_day| > tm_hour|tm_min|tm_sec|dtype|time|tm_mday|tm_mon|tm_yday|tm_year|value| > +---+---+--+--+--+-++---+--+---+---+-+ > | F|C_FNHXUT701Z.CNSTLO|1443790800|13| 2|0| 10| 0| 275| > 2015| 1.2345|2015275.0| > | F|C_FNHXUDP713.CNSTHI|1443790800|13| 2|0| 10| 0| 275| > 2015| 1.2345|2015275.0| > | F| C_FNHXUT718.CNSTHI|1443790800|13| 2|0| 10| 0| 275| > 2015| 1.2345|2015275.0| > | F|C_FNHXUT703Z.CNSTLO|1443790800|13| 2|0| 10| 0| 275| > 2015| 1.2345|2015275.0| > | F|C_FNHXUR716A.CNSTLO|1443790800|13| 2|0| 10| 0| 275| > 2015| 1.2345|2015275.0| > | F|C_FNHXUT803Z.CNSTHI|1443790800|13| 2|0| 10| 0| 275| > 2015| 1.2345|2015275.0| > | F| C_FNHXUT728.CNSTHI|1443790800|13| 2|0| 10| 0| 275| > 2015| 1.2345|2015275.0| > | F| C_FNHXUR806.CNSTHI|1443790800|13| 2|0| 10| 0| 275| > 2015| 1.2345|2015275.0| > +---+---+--+--+--+-++---+--+---+---+-+ > {code} > On changing the schema of A according to B and doing unionAll works fine > {code} > C = > A.select("dtype","tag","time","tm_hour","tm_mday","tm_min",”tm_mon”,"tm_sec","tm_yday","tm_year","value","year_day") > A = C.unionAll(B) > A.show() > +-+---+--+---+---+--+--+--+---+---+--++ > |dtype|tag| > time|tm_hour|tm_mday|tm_min|tm_mon|tm_sec|tm_yday|tm_year| value|year_day| > +-+---+--+---+---+--+--+
[jira] [Comment Edited] (SPARK-15930) Add Row count property to FPGrowth model
[ https://issues.apache.org/jira/browse/SPARK-15930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328802#comment-15328802 ] yuhao yang edited comment on SPARK-15930 at 6/14/16 2:46 AM: - That looks reasonable. +1. [~JohnDA] I would wait for one or two days to collect more opinions before sending a patch. You're welcome to send a pull request if you're interested. was (Author: yuhaoyan): That looks reasonable. +1. [~John Aherne] I would wait for one or two days to collect more opinions before sending a patch. You're welcome to send a pull request if you're interested. > Add Row count property to FPGrowth model > > > Key: SPARK-15930 > URL: https://issues.apache.org/jira/browse/SPARK-15930 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.6.1 >Reporter: John Aherne >Priority: Minor > Labels: fp-growth, mllib > > Add a row count property to MLlib's FPGrowth model. > When using the model from FPGrowth, a count of the total number of records is > often necessary. > It appears that the function already calculates that value when training the > model, so it would save time not having to do it again outside the model. > Sorry if this is the wrong place for this kind of stuff. I am new to Jira, > Spark, and making suggestions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15808) Wrong Results or Strange Errors In Append-mode DataFrame Writing Due to Mismatched File Formats
[ https://issues.apache.org/jira/browse/SPARK-15808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-15808: - Assignee: Xiao Li > Wrong Results or Strange Errors In Append-mode DataFrame Writing Due to > Mismatched File Formats > --- > > Key: SPARK-15808 > URL: https://issues.apache.org/jira/browse/SPARK-15808 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li >Assignee: Xiao Li > Fix For: 2.0.0 > > > Example 1: PARQUET -> CSV > {noformat} > createDF(0, 9).write.format("parquet").saveAsTable("appendParquetToOrc") > createDF(10, > 19).write.mode(SaveMode.Append).format("orc").saveAsTable("appendParquetToOrc") > {noformat} > Error we got: > {noformat} > Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most > recent failure: Lost task 0.0 in stage 2.0 (TID 2, localhost): > java.lang.RuntimeException: > file:/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/warehouse-bc8fedf2-aa6a-4002-a18b-524c6ac859d4/appendorctoparquet/part-r-0-c0e3f365-1d46-4df5-a82c-b47d7af9feb9.snappy.orc > is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but > found [79, 82, 67, 23] > {noformat} > Example 2: Json -> CSV > {noformat} > createDF(0, 9).write.format("json").saveAsTable("appendJsonToCSV") > createDF(10, > 19).write.mode(SaveMode.Append).format("parquet").saveAsTable("appendJsonToCSV") > {noformat} > No exception, but wrong results: > {noformat} > +++ > | c1| c2| > +++ > |null|null| > |null|null| > |null|null| > |null|null| > | 0|str0| > | 1|str1| > | 2|str2| > | 3|str3| > | 4|str4| > | 5|str5| > | 6|str6| > | 7|str7| > | 8|str8| > | 9|str9| > +++ > {noformat} > Example 3: Json -> Text > {noformat} > createDF(0, 9).write.format("json").saveAsTable("appendJsonToText") > createDF(10, > 19).write.mode(SaveMode.Append).format("text").saveAsTable("appendJsonToText") > {noformat} > Error we got: > {noformat} > Text data source supports only a single column, and you have 2 columns. > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15808) Wrong Results or Strange Errors In Append-mode DataFrame Writing Due to Mismatched File Formats
[ https://issues.apache.org/jira/browse/SPARK-15808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-15808. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 13546 [https://github.com/apache/spark/pull/13546] > Wrong Results or Strange Errors In Append-mode DataFrame Writing Due to > Mismatched File Formats > --- > > Key: SPARK-15808 > URL: https://issues.apache.org/jira/browse/SPARK-15808 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > Fix For: 2.0.0 > > > Example 1: PARQUET -> CSV > {noformat} > createDF(0, 9).write.format("parquet").saveAsTable("appendParquetToOrc") > createDF(10, > 19).write.mode(SaveMode.Append).format("orc").saveAsTable("appendParquetToOrc") > {noformat} > Error we got: > {noformat} > Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most > recent failure: Lost task 0.0 in stage 2.0 (TID 2, localhost): > java.lang.RuntimeException: > file:/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/warehouse-bc8fedf2-aa6a-4002-a18b-524c6ac859d4/appendorctoparquet/part-r-0-c0e3f365-1d46-4df5-a82c-b47d7af9feb9.snappy.orc > is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but > found [79, 82, 67, 23] > {noformat} > Example 2: Json -> CSV > {noformat} > createDF(0, 9).write.format("json").saveAsTable("appendJsonToCSV") > createDF(10, > 19).write.mode(SaveMode.Append).format("parquet").saveAsTable("appendJsonToCSV") > {noformat} > No exception, but wrong results: > {noformat} > +++ > | c1| c2| > +++ > |null|null| > |null|null| > |null|null| > |null|null| > | 0|str0| > | 1|str1| > | 2|str2| > | 3|str3| > | 4|str4| > | 5|str5| > | 6|str6| > | 7|str7| > | 8|str8| > | 9|str9| > +++ > {noformat} > Example 3: Json -> Text > {noformat} > createDF(0, 9).write.format("json").saveAsTable("appendJsonToText") > createDF(10, > 19).write.mode(SaveMode.Append).format("text").saveAsTable("appendJsonToText") > {noformat} > Error we got: > {noformat} > Text data source supports only a single column, and you have 2 columns. > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15690) Fast single-node (single-process) in-memory shuffle
[ https://issues.apache.org/jira/browse/SPARK-15690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328815#comment-15328815 ] Reynold Xin commented on SPARK-15690: - Definitely no serialization/deserialization. > Fast single-node (single-process) in-memory shuffle > --- > > Key: SPARK-15690 > URL: https://issues.apache.org/jira/browse/SPARK-15690 > Project: Spark > Issue Type: New Feature > Components: Shuffle, SQL >Reporter: Reynold Xin > > Spark's current shuffle implementation sorts all intermediate data by their > partition id, and then write the data to disk. This is not a big bottleneck > because the network throughput on commodity clusters tend to be low. However, > an increasing number of Spark users are using the system to process data on a > single-node. When in a single node operating against intermediate data that > fits in memory, the existing shuffle code path can become a big bottleneck. > The goal of this ticket is to change Spark so it can use in-memory radix sort > to do data shuffling on a single node, and still gracefully fallback to disk > if the data size does not fit in memory. Given the number of partitions is > usually small (say less than 256), it'd require only a single pass do to the > radix sort with pretty decent CPU efficiency. > Note that there have been many in-memory shuffle attempts in the past. This > ticket has a smaller scope (single-process), and aims to actually > productionize this code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15934) Return binary mode in ThriftServer
[ https://issues.apache.org/jira/browse/SPARK-15934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328814#comment-15328814 ] Egor Pahomov commented on SPARK-15934: -- Sure, let me create a pull request tomorrow. I would test, that everything working with all tools I mentioned - Tableau, DataGrip, Squirrel. > Return binary mode in ThriftServer > -- > > Key: SPARK-15934 > URL: https://issues.apache.org/jira/browse/SPARK-15934 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Egor Pahomov > > In spark-2.0.0 preview binary mode was turned off (SPARK-15095). > It was greatly irresponsible step due to the fact, that in 1.6.1 binary mode > was default and it turned off in 2.0.0. > Just to describe magnitude of harm not fixing this bug would do in my > organization: > * Tableau works only though Thrift Server and only with binary format. > Tableau would not work with spark-2.0.0 at all! > * I have bunch of analysts in my organization with configured sql > clients(DataGrip and Squirrel). I would need to go one by one to change > connection string for them(DataGrip). Squirrel simply do not work with http - > some jar hell in my case. > * let me not mention all other stuff which connects to our data > infrastructure through ThriftServer as gateway. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15930) Add Row count property to FPGrowth model
[ https://issues.apache.org/jira/browse/SPARK-15930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328802#comment-15328802 ] yuhao yang commented on SPARK-15930: That looks reasonable. +1. [~John Aherne] I would wait for one or two days to collect more opinions before sending a patch. You're welcome to send a pull request if you're interested. > Add Row count property to FPGrowth model > > > Key: SPARK-15930 > URL: https://issues.apache.org/jira/browse/SPARK-15930 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.6.1 >Reporter: John Aherne >Priority: Minor > Labels: fp-growth, mllib > > Add a row count property to MLlib's FPGrowth model. > When using the model from FPGrowth, a count of the total number of records is > often necessary. > It appears that the function already calculates that value when training the > model, so it would save time not having to do it again outside the model. > Sorry if this is the wrong place for this kind of stuff. I am new to Jira, > Spark, and making suggestions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-15930) Add Row count property to FPGrowth model
[ https://issues.apache.org/jira/browse/SPARK-15930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328786#comment-15328786 ] John Aherne edited comment on SPARK-15930 at 6/14/16 2:04 AM: -- The row count would be the number of rows in the dataset that was supplied to the train function. Edited: I put in the wrong answer/ was (Author: johnda): In your example, the row count would be 4. > Add Row count property to FPGrowth model > > > Key: SPARK-15930 > URL: https://issues.apache.org/jira/browse/SPARK-15930 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.6.1 >Reporter: John Aherne >Priority: Minor > Labels: fp-growth, mllib > > Add a row count property to MLlib's FPGrowth model. > When using the model from FPGrowth, a count of the total number of records is > often necessary. > It appears that the function already calculates that value when training the > model, so it would save time not having to do it again outside the model. > Sorry if this is the wrong place for this kind of stuff. I am new to Jira, > Spark, and making suggestions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15930) Add Row count property to FPGrowth model
[ https://issues.apache.org/jira/browse/SPARK-15930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328786#comment-15328786 ] John Aherne commented on SPARK-15930: - In your example, the row count would be 4. > Add Row count property to FPGrowth model > > > Key: SPARK-15930 > URL: https://issues.apache.org/jira/browse/SPARK-15930 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.6.1 >Reporter: John Aherne >Priority: Minor > Labels: fp-growth, mllib > > Add a row count property to MLlib's FPGrowth model. > When using the model from FPGrowth, a count of the total number of records is > often necessary. > It appears that the function already calculates that value when training the > model, so it would save time not having to do it again outside the model. > Sorry if this is the wrong place for this kind of stuff. I am new to Jira, > Spark, and making suggestions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-15930) Add Row count property to FPGrowth model
[ https://issues.apache.org/jira/browse/SPARK-15930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328759#comment-15328759 ] yuhao yang edited comment on SPARK-15930 at 6/14/16 1:30 AM: - || items|| freq|| |[27]|5| |[27, 18]|2| |[27, 18, 12]|2| |[27, 18, 12, 17]|1| Hi [~JohnDA] can you please specify what's the row count you expected with the model above? is it the total number of 5 + 2 + 2 + 4? was (Author: yuhaoyan): || items|| freq|| |[27]|5| |[27, 18]|2| |[27, 18, 12]|2| |[27, 18, 12, 17]|4| Hi [~JohnDA] can you please specify what's the row count you expected with the model above? is it the total number of 5 + 2 + 2 + 4? > Add Row count property to FPGrowth model > > > Key: SPARK-15930 > URL: https://issues.apache.org/jira/browse/SPARK-15930 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.6.1 >Reporter: John Aherne >Priority: Minor > Labels: fp-growth, mllib > > Add a row count property to MLlib's FPGrowth model. > When using the model from FPGrowth, a count of the total number of records is > often necessary. > It appears that the function already calculates that value when training the > model, so it would save time not having to do it again outside the model. > Sorry if this is the wrong place for this kind of stuff. I am new to Jira, > Spark, and making suggestions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15930) Add Row count property to FPGrowth model
[ https://issues.apache.org/jira/browse/SPARK-15930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328759#comment-15328759 ] yuhao yang commented on SPARK-15930: || items|| freq|| |[27]|5| |[27, 18]|2| |[27, 18, 12]|2| |[27, 18, 12, 17]|4| Hi [~JohnDA] can you please specify what's the row count you expected with the model above? is it the total number of 5 + 2 + 2 + 4? > Add Row count property to FPGrowth model > > > Key: SPARK-15930 > URL: https://issues.apache.org/jira/browse/SPARK-15930 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.6.1 >Reporter: John Aherne >Priority: Minor > Labels: fp-growth, mllib > > Add a row count property to MLlib's FPGrowth model. > When using the model from FPGrowth, a count of the total number of records is > often necessary. > It appears that the function already calculates that value when training the > model, so it would save time not having to do it again outside the model. > Sorry if this is the wrong place for this kind of stuff. I am new to Jira, > Spark, and making suggestions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15934) Return binary mode in ThriftServer
[ https://issues.apache.org/jira/browse/SPARK-15934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328756#comment-15328756 ] Reynold Xin commented on SPARK-15934: - [~epahomov] do you want to create a pr to revert the change? > Return binary mode in ThriftServer > -- > > Key: SPARK-15934 > URL: https://issues.apache.org/jira/browse/SPARK-15934 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Egor Pahomov > > In spark-2.0.0 preview binary mode was turned off (SPARK-15095). > It was greatly irresponsible step due to the fact, that in 1.6.1 binary mode > was default and it turned off in 2.0.0. > Just to describe magnitude of harm not fixing this bug would do in my > organization: > * Tableau works only though Thrift Server and only with binary format. > Tableau would not work with spark-2.0.0 at all! > * I have bunch of analysts in my organization with configured sql > clients(DataGrip and Squirrel). I would need to go one by one to change > connection string for them(DataGrip). Squirrel simply do not work with http - > some jar hell in my case. > * let me not mention all other stuff which connects to our data > infrastructure through ThriftServer as gateway. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15935) Enable test for sql/streaming.py and fix these tests
[ https://issues.apache.org/jira/browse/SPARK-15935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15935: Assignee: Apache Spark (was: Shixiong Zhu) > Enable test for sql/streaming.py and fix these tests > > > Key: SPARK-15935 > URL: https://issues.apache.org/jira/browse/SPARK-15935 > Project: Spark > Issue Type: Bug > Components: PySpark >Reporter: Shixiong Zhu >Assignee: Apache Spark > > Right now tests sql/streaming.py are disabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15935) Enable test for sql/streaming.py and fix these tests
[ https://issues.apache.org/jira/browse/SPARK-15935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15935: Assignee: Shixiong Zhu (was: Apache Spark) > Enable test for sql/streaming.py and fix these tests > > > Key: SPARK-15935 > URL: https://issues.apache.org/jira/browse/SPARK-15935 > Project: Spark > Issue Type: Bug > Components: PySpark >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > > Right now tests sql/streaming.py are disabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15935) Enable test for sql/streaming.py and fix these tests
[ https://issues.apache.org/jira/browse/SPARK-15935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328736#comment-15328736 ] Apache Spark commented on SPARK-15935: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/13655 > Enable test for sql/streaming.py and fix these tests > > > Key: SPARK-15935 > URL: https://issues.apache.org/jira/browse/SPARK-15935 > Project: Spark > Issue Type: Bug > Components: PySpark >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > > Right now tests sql/streaming.py are disabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15936) CLONE - Add class weights to Random Forest
Yuewei Na created SPARK-15936: - Summary: CLONE - Add class weights to Random Forest Key: SPARK-15936 URL: https://issues.apache.org/jira/browse/SPARK-15936 Project: Spark Issue Type: Improvement Components: ML, MLlib Affects Versions: 1.4.1 Reporter: Yuewei Na Currently, this implementation of random forest does not support class weights. Class weights are important when there is imbalanced training data or the evaluation metric of a classifier is imbalanced (e.g. true positive rate at some false positive threshold). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15868) Executors table in Executors tab should sort Executor IDs in numerical order (not alphabetical order)
[ https://issues.apache.org/jira/browse/SPARK-15868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328723#comment-15328723 ] Apache Spark commented on SPARK-15868: -- User 'ajbozarth' has created a pull request for this issue: https://github.com/apache/spark/pull/13654 > Executors table in Executors tab should sort Executor IDs in numerical order > (not alphabetical order) > - > > Key: SPARK-15868 > URL: https://issues.apache.org/jira/browse/SPARK-15868 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.0.0 >Reporter: Jacek Laskowski >Priority: Minor > Attachments: spark-webui-executors-sorting-2.png, > spark-webui-executors-sorting.png > > > It _appears_ that Executors table in Executors tab sorts Executor IDs in > alphabetical order while it should in numerical. It does sorting in a more > "friendly" way yet driver executor appears between 0 and 1? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15868) Executors table in Executors tab should sort Executor IDs in numerical order (not alphabetical order)
[ https://issues.apache.org/jira/browse/SPARK-15868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15868: Assignee: (was: Apache Spark) > Executors table in Executors tab should sort Executor IDs in numerical order > (not alphabetical order) > - > > Key: SPARK-15868 > URL: https://issues.apache.org/jira/browse/SPARK-15868 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.0.0 >Reporter: Jacek Laskowski >Priority: Minor > Attachments: spark-webui-executors-sorting-2.png, > spark-webui-executors-sorting.png > > > It _appears_ that Executors table in Executors tab sorts Executor IDs in > alphabetical order while it should in numerical. It does sorting in a more > "friendly" way yet driver executor appears between 0 and 1? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15868) Executors table in Executors tab should sort Executor IDs in numerical order (not alphabetical order)
[ https://issues.apache.org/jira/browse/SPARK-15868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15868: Assignee: Apache Spark > Executors table in Executors tab should sort Executor IDs in numerical order > (not alphabetical order) > - > > Key: SPARK-15868 > URL: https://issues.apache.org/jira/browse/SPARK-15868 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.0.0 >Reporter: Jacek Laskowski >Assignee: Apache Spark >Priority: Minor > Attachments: spark-webui-executors-sorting-2.png, > spark-webui-executors-sorting.png > > > It _appears_ that Executors table in Executors tab sorts Executor IDs in > alphabetical order while it should in numerical. It does sorting in a more > "friendly" way yet driver executor appears between 0 and 1? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15910) Schema is not checked when converting DataFrame to Dataset using Kryo encoder
[ https://issues.apache.org/jira/browse/SPARK-15910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-15910: Assignee: Sean Owen > Schema is not checked when converting DataFrame to Dataset using Kryo encoder > - > > Key: SPARK-15910 > URL: https://issues.apache.org/jira/browse/SPARK-15910 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Sean Zhong >Assignee: Sean Owen > Fix For: 2.0.0 > > > Here is the case to reproduce it: > {code} > scala> import org.apache.spark.sql.Encoders._ > scala> import org.apache.spark.sql.Encoders > scala> import org.apache.spark.sql.Encoder > scala> case class B(b: Int) > scala> implicit val encoder = Encoders.kryo[B] > encoder: org.apache.spark.sql.Encoder[B] = class[value[0]: binary] > scala> val ds = Seq((1)).toDF("b").as[B].map(identity) > ds: org.apache.spark.sql.Dataset[B] = [value: binary] > scala> ds.show() > 16/06/10 13:46:51 ERROR CodeGenerator: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 45, Column 168: No applicable constructor/method found for actual parameters > "int"; candidates are: "public static java.nio.ByteBuffer > java.nio.ByteBuffer.wrap(byte[])", "public static java.nio.ByteBuffer > java.nio.ByteBuffer.wrap(byte[], int, int)" > ... > {code} > The expected behavior is to report schema check failure earlier when creating > Dataset using {code}dataFrame.as[B]{code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15910) Schema is not checked when converting DataFrame to Dataset using Kryo encoder
[ https://issues.apache.org/jira/browse/SPARK-15910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-15910: Assignee: Sean Zhong (was: Sean Owen) > Schema is not checked when converting DataFrame to Dataset using Kryo encoder > - > > Key: SPARK-15910 > URL: https://issues.apache.org/jira/browse/SPARK-15910 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Sean Zhong >Assignee: Sean Zhong > Fix For: 2.0.0 > > > Here is the case to reproduce it: > {code} > scala> import org.apache.spark.sql.Encoders._ > scala> import org.apache.spark.sql.Encoders > scala> import org.apache.spark.sql.Encoder > scala> case class B(b: Int) > scala> implicit val encoder = Encoders.kryo[B] > encoder: org.apache.spark.sql.Encoder[B] = class[value[0]: binary] > scala> val ds = Seq((1)).toDF("b").as[B].map(identity) > ds: org.apache.spark.sql.Dataset[B] = [value: binary] > scala> ds.show() > 16/06/10 13:46:51 ERROR CodeGenerator: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 45, Column 168: No applicable constructor/method found for actual parameters > "int"; candidates are: "public static java.nio.ByteBuffer > java.nio.ByteBuffer.wrap(byte[])", "public static java.nio.ByteBuffer > java.nio.ByteBuffer.wrap(byte[], int, int)" > ... > {code} > The expected behavior is to report schema check failure earlier when creating > Dataset using {code}dataFrame.as[B]{code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15910) Schema is not checked when converting DataFrame to Dataset using Kryo encoder
[ https://issues.apache.org/jira/browse/SPARK-15910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-15910. - Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 13632 [https://github.com/apache/spark/pull/13632] > Schema is not checked when converting DataFrame to Dataset using Kryo encoder > - > > Key: SPARK-15910 > URL: https://issues.apache.org/jira/browse/SPARK-15910 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Sean Zhong > Fix For: 2.0.0 > > > Here is the case to reproduce it: > {code} > scala> import org.apache.spark.sql.Encoders._ > scala> import org.apache.spark.sql.Encoders > scala> import org.apache.spark.sql.Encoder > scala> case class B(b: Int) > scala> implicit val encoder = Encoders.kryo[B] > encoder: org.apache.spark.sql.Encoder[B] = class[value[0]: binary] > scala> val ds = Seq((1)).toDF("b").as[B].map(identity) > ds: org.apache.spark.sql.Dataset[B] = [value: binary] > scala> ds.show() > 16/06/10 13:46:51 ERROR CodeGenerator: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 45, Column 168: No applicable constructor/method found for actual parameters > "int"; candidates are: "public static java.nio.ByteBuffer > java.nio.ByteBuffer.wrap(byte[])", "public static java.nio.ByteBuffer > java.nio.ByteBuffer.wrap(byte[], int, int)" > ... > {code} > The expected behavior is to report schema check failure earlier when creating > Dataset using {code}dataFrame.as[B]{code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15935) Enable test for sql/streaming.py and fix these tests
Shixiong Zhu created SPARK-15935: Summary: Enable test for sql/streaming.py and fix these tests Key: SPARK-15935 URL: https://issues.apache.org/jira/browse/SPARK-15935 Project: Spark Issue Type: Bug Components: PySpark Reporter: Shixiong Zhu Assignee: Shixiong Zhu Right now tests sql/streaming.py are disabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15690) Fast single-node (single-process) in-memory shuffle
[ https://issues.apache.org/jira/browse/SPARK-15690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328717#comment-15328717 ] Shivaram Venkataraman commented on SPARK-15690: --- Yeah I dont think you'll see much improvement from avoiding the DAGScheduler. One more thing to try here is to avoid serialization / deserialization unless you are going to spill to disk. That'll save a lot of time inside a single node. > Fast single-node (single-process) in-memory shuffle > --- > > Key: SPARK-15690 > URL: https://issues.apache.org/jira/browse/SPARK-15690 > Project: Spark > Issue Type: New Feature > Components: Shuffle, SQL >Reporter: Reynold Xin > > Spark's current shuffle implementation sorts all intermediate data by their > partition id, and then write the data to disk. This is not a big bottleneck > because the network throughput on commodity clusters tend to be low. However, > an increasing number of Spark users are using the system to process data on a > single-node. When in a single node operating against intermediate data that > fits in memory, the existing shuffle code path can become a big bottleneck. > The goal of this ticket is to change Spark so it can use in-memory radix sort > to do data shuffling on a single node, and still gracefully fallback to disk > if the data size does not fit in memory. Given the number of partitions is > usually small (say less than 256), it'd require only a single pass do to the > radix sort with pretty decent CPU efficiency. > Note that there have been many in-memory shuffle attempts in the past. This > ticket has a smaller scope (single-process), and aims to actually > productionize this code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.10 Consumer API
[ https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328707#comment-15328707 ] Cody Koeninger commented on SPARK-12177: I don't think waiting for 0.11 makes sense. > Update KafkaDStreams to new Kafka 0.10 Consumer API > --- > > Key: SPARK-12177 > URL: https://issues.apache.org/jira/browse/SPARK-12177 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.6.0 >Reporter: Nikita Tarasenko > Labels: consumer, kafka > > Kafka 0.9 already released and it introduce new consumer API that not > compatible with old one. So, I added new consumer api. I made separate > classes in package org.apache.spark.streaming.kafka.v09 with changed API. I > didn't remove old classes for more backward compatibility. User will not need > to change his old spark applications when he uprgade to new Spark version. > Please rewiew my changes -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15934) Return binary mode in ThriftServer
Egor Pahomov created SPARK-15934: Summary: Return binary mode in ThriftServer Key: SPARK-15934 URL: https://issues.apache.org/jira/browse/SPARK-15934 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Egor Pahomov In spark-2.0.0 preview binary mode was turned off (SPARK-15095). It was greatly irresponsible step due to the fact, that in 1.6.1 binary mode was default and it turned off in 2.0.0. Just to describe magnitude of harm not fixing this bug would do in my organization: * Tableau works only though Thrift Server and only with binary format. Tableau would not work with spark-2.0.0 at all! * I have bunch of analysts in my organization with configured sql clients(DataGrip and Squirrel). I would need to go one by one to change connection string for them(DataGrip). Squirrel simply do not work with http - some jar hell in my case. * let me not mention all other stuff which connects to our data infrastructure through ThriftServer as gateway. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.10 Consumer API
[ https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328699#comment-15328699 ] Mark Grover commented on SPARK-12177: - Hi Ismael and Cody, My personal opinion was to hold off because a) The new consumer API was still marked as beta, and so I wasn't sure of the compatibility guarantees, which Kafka did seem to break a little (as discussed [here|http://mail-archives.apache.org/mod_mbox/kafka-dev/201605.mbox/%3CCAKm=r7v5jgg9qxgjioczdph9vej57m46ngy_626kiq-ovdx...@mail.gmail.com%3E]) b) the real benefit is security - I am personally a little more biased towards authentication (Kerberos) than encryption, so I was just waiting for delegation tokens to land. Now, that 0.10.0 is released, there's a good chance delegation tokens would land in Kafka 0.11.0, and the new consumer API is marked stable, I am more open to this PR being merged, it's been around for too long anyways. Cody, what do you say? Any reason you'd want to wait? If not, we can make a case for this going in now. As far the logistics of whether this belongs in Apache Bahir or not - today, I don't have a strong opinion on where kafka integration should reside. What I do feel strongly about, like Cody said, is that the old consumer API integration and new consumer API integration should reside in the same place. Since the old integration is in Spark, that's where the new makes sense. If a vote on Apache Spark results in Kafka integration to be taken out, both the new and the old in Apache Bahir would make sense. > Update KafkaDStreams to new Kafka 0.10 Consumer API > --- > > Key: SPARK-12177 > URL: https://issues.apache.org/jira/browse/SPARK-12177 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.6.0 >Reporter: Nikita Tarasenko > Labels: consumer, kafka > > Kafka 0.9 already released and it introduce new consumer API that not > compatible with old one. So, I added new consumer api. I made separate > classes in package org.apache.spark.streaming.kafka.v09 with changed API. I > didn't remove old classes for more backward compatibility. User will not need > to change his old spark applications when he uprgade to new Spark version. > Please rewiew my changes -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15929) DataFrameSuite path globbing error message tests are not fully portable
[ https://issues.apache.org/jira/browse/SPARK-15929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-15929. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 13649 [https://github.com/apache/spark/pull/13649] > DataFrameSuite path globbing error message tests are not fully portable > --- > > Key: SPARK-15929 > URL: https://issues.apache.org/jira/browse/SPARK-15929 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Josh Rosen >Assignee: Josh Rosen > Fix For: 2.0.0 > > > The DataFrameSuite regression tests for SPARK-13774 fail in my environment > because they attempt to glob over all of {{/mnt}} and some of the > subdirectories in there have restrictive permissions which cause the test to > fail. I think we should rewrite this test to not depend existing / OS paths. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15905) Driver hung while writing to console progress bar
[ https://issues.apache.org/jira/browse/SPARK-15905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328648#comment-15328648 ] Shixiong Zhu commented on SPARK-15905: -- The last time I encounter FileOutputStream.writeBytes hangs is because I created a Process in Java but didn't consume its input stream and error stream. Finally, the underlying buffer was full and blocked the Process. > Driver hung while writing to console progress bar > - > > Key: SPARK-15905 > URL: https://issues.apache.org/jira/browse/SPARK-15905 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1 >Reporter: Tejas Patil >Priority: Minor > > This leads to driver being not able to get heartbeats from its executors and > job being stuck. After looking at the locking dependency amongst the driver > threads per the jstack, this is where the driver seems to be stuck. > {noformat} > "refresh progress" #113 daemon prio=5 os_prio=0 tid=0x7f7986cbc800 > nid=0x7887d runnable [0x7f6d3507a000] >java.lang.Thread.State: RUNNABLE > at java.io.FileOutputStream.writeBytes(Native Method) > at java.io.FileOutputStream.write(FileOutputStream.java:326) > at > java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82) > at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140) > - locked <0x7f6eb81dd290> (a java.io.BufferedOutputStream) > at java.io.PrintStream.write(PrintStream.java:482) >- locked <0x7f6eb81dd258> (a java.io.PrintStream) > at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221) > at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291) > at sun.nio.cs.StreamEncoder.flushBuffer(StreamEncoder.java:104) > - locked <0x7f6eb81dd400> (a java.io.OutputStreamWriter) > at java.io.OutputStreamWriter.flushBuffer(OutputStreamWriter.java:185) > at java.io.PrintStream.write(PrintStream.java:527) > - locked <0x7f6eb81dd258> (a java.io.PrintStream) > at java.io.PrintStream.print(PrintStream.java:669) > at > org.apache.spark.ui.ConsoleProgressBar.show(ConsoleProgressBar.scala:99) > at > org.apache.spark.ui.ConsoleProgressBar.org$apache$spark$ui$ConsoleProgressBar$$refresh(ConsoleProgressBar.scala:69) > - locked <0x7f6ed33b48a0> (a > org.apache.spark.ui.ConsoleProgressBar) > at > org.apache.spark.ui.ConsoleProgressBar$$anon$1.run(ConsoleProgressBar.scala:53) > at java.util.TimerThread.mainLoop(Timer.java:555) > at java.util.TimerThread.run(Timer.java:505) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15905) Driver hung while writing to console progress bar
[ https://issues.apache.org/jira/browse/SPARK-15905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328640#comment-15328640 ] Shixiong Zhu commented on SPARK-15905: -- By the way, how did you use Spark? Did you just run it or call it via some Process APIs? > Driver hung while writing to console progress bar > - > > Key: SPARK-15905 > URL: https://issues.apache.org/jira/browse/SPARK-15905 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1 >Reporter: Tejas Patil >Priority: Minor > > This leads to driver being not able to get heartbeats from its executors and > job being stuck. After looking at the locking dependency amongst the driver > threads per the jstack, this is where the driver seems to be stuck. > {noformat} > "refresh progress" #113 daemon prio=5 os_prio=0 tid=0x7f7986cbc800 > nid=0x7887d runnable [0x7f6d3507a000] >java.lang.Thread.State: RUNNABLE > at java.io.FileOutputStream.writeBytes(Native Method) > at java.io.FileOutputStream.write(FileOutputStream.java:326) > at > java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82) > at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140) > - locked <0x7f6eb81dd290> (a java.io.BufferedOutputStream) > at java.io.PrintStream.write(PrintStream.java:482) >- locked <0x7f6eb81dd258> (a java.io.PrintStream) > at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221) > at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291) > at sun.nio.cs.StreamEncoder.flushBuffer(StreamEncoder.java:104) > - locked <0x7f6eb81dd400> (a java.io.OutputStreamWriter) > at java.io.OutputStreamWriter.flushBuffer(OutputStreamWriter.java:185) > at java.io.PrintStream.write(PrintStream.java:527) > - locked <0x7f6eb81dd258> (a java.io.PrintStream) > at java.io.PrintStream.print(PrintStream.java:669) > at > org.apache.spark.ui.ConsoleProgressBar.show(ConsoleProgressBar.scala:99) > at > org.apache.spark.ui.ConsoleProgressBar.org$apache$spark$ui$ConsoleProgressBar$$refresh(ConsoleProgressBar.scala:69) > - locked <0x7f6ed33b48a0> (a > org.apache.spark.ui.ConsoleProgressBar) > at > org.apache.spark.ui.ConsoleProgressBar$$anon$1.run(ConsoleProgressBar.scala:53) > at java.util.TimerThread.mainLoop(Timer.java:555) > at java.util.TimerThread.run(Timer.java:505) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15905) Driver hung while writing to console progress bar
[ https://issues.apache.org/jira/browse/SPARK-15905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328638#comment-15328638 ] Shixiong Zhu commented on SPARK-15905: -- Oh, the thread state is `RUNNABLE`. So not a deadlock. Could you check you disk? Maybe some bad disks cause the hang. > Driver hung while writing to console progress bar > - > > Key: SPARK-15905 > URL: https://issues.apache.org/jira/browse/SPARK-15905 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1 >Reporter: Tejas Patil >Priority: Minor > > This leads to driver being not able to get heartbeats from its executors and > job being stuck. After looking at the locking dependency amongst the driver > threads per the jstack, this is where the driver seems to be stuck. > {noformat} > "refresh progress" #113 daemon prio=5 os_prio=0 tid=0x7f7986cbc800 > nid=0x7887d runnable [0x7f6d3507a000] >java.lang.Thread.State: RUNNABLE > at java.io.FileOutputStream.writeBytes(Native Method) > at java.io.FileOutputStream.write(FileOutputStream.java:326) > at > java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82) > at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140) > - locked <0x7f6eb81dd290> (a java.io.BufferedOutputStream) > at java.io.PrintStream.write(PrintStream.java:482) >- locked <0x7f6eb81dd258> (a java.io.PrintStream) > at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221) > at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291) > at sun.nio.cs.StreamEncoder.flushBuffer(StreamEncoder.java:104) > - locked <0x7f6eb81dd400> (a java.io.OutputStreamWriter) > at java.io.OutputStreamWriter.flushBuffer(OutputStreamWriter.java:185) > at java.io.PrintStream.write(PrintStream.java:527) > - locked <0x7f6eb81dd258> (a java.io.PrintStream) > at java.io.PrintStream.print(PrintStream.java:669) > at > org.apache.spark.ui.ConsoleProgressBar.show(ConsoleProgressBar.scala:99) > at > org.apache.spark.ui.ConsoleProgressBar.org$apache$spark$ui$ConsoleProgressBar$$refresh(ConsoleProgressBar.scala:69) > - locked <0x7f6ed33b48a0> (a > org.apache.spark.ui.ConsoleProgressBar) > at > org.apache.spark.ui.ConsoleProgressBar$$anon$1.run(ConsoleProgressBar.scala:53) > at java.util.TimerThread.mainLoop(Timer.java:555) > at java.util.TimerThread.run(Timer.java:505) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-15905) Driver hung while writing to console progress bar
[ https://issues.apache.org/jira/browse/SPARK-15905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328627#comment-15328627 ] Shixiong Zhu edited comment on SPARK-15905 at 6/13/16 11:42 PM: Do you have the whole jstack output? I guess some place holds the lock of `System.err` but needs the whole output for all threads to find the place. was (Author: zsxwing): Do you have the whole jstack output? I guess some places holds the lock of `System.err` but needs the whole output for all threads to find the place. > Driver hung while writing to console progress bar > - > > Key: SPARK-15905 > URL: https://issues.apache.org/jira/browse/SPARK-15905 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1 >Reporter: Tejas Patil >Priority: Minor > > This leads to driver being not able to get heartbeats from its executors and > job being stuck. After looking at the locking dependency amongst the driver > threads per the jstack, this is where the driver seems to be stuck. > {noformat} > "refresh progress" #113 daemon prio=5 os_prio=0 tid=0x7f7986cbc800 > nid=0x7887d runnable [0x7f6d3507a000] >java.lang.Thread.State: RUNNABLE > at java.io.FileOutputStream.writeBytes(Native Method) > at java.io.FileOutputStream.write(FileOutputStream.java:326) > at > java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82) > at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140) > - locked <0x7f6eb81dd290> (a java.io.BufferedOutputStream) > at java.io.PrintStream.write(PrintStream.java:482) >- locked <0x7f6eb81dd258> (a java.io.PrintStream) > at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221) > at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291) > at sun.nio.cs.StreamEncoder.flushBuffer(StreamEncoder.java:104) > - locked <0x7f6eb81dd400> (a java.io.OutputStreamWriter) > at java.io.OutputStreamWriter.flushBuffer(OutputStreamWriter.java:185) > at java.io.PrintStream.write(PrintStream.java:527) > - locked <0x7f6eb81dd258> (a java.io.PrintStream) > at java.io.PrintStream.print(PrintStream.java:669) > at > org.apache.spark.ui.ConsoleProgressBar.show(ConsoleProgressBar.scala:99) > at > org.apache.spark.ui.ConsoleProgressBar.org$apache$spark$ui$ConsoleProgressBar$$refresh(ConsoleProgressBar.scala:69) > - locked <0x7f6ed33b48a0> (a > org.apache.spark.ui.ConsoleProgressBar) > at > org.apache.spark.ui.ConsoleProgressBar$$anon$1.run(ConsoleProgressBar.scala:53) > at java.util.TimerThread.mainLoop(Timer.java:555) > at java.util.TimerThread.run(Timer.java:505) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15905) Driver hung while writing to console progress bar
[ https://issues.apache.org/jira/browse/SPARK-15905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328627#comment-15328627 ] Shixiong Zhu commented on SPARK-15905: -- Do you have the whole jstack output? I guess some places holds the lock of `System.err` but needs the whole output for all threads to find the place. > Driver hung while writing to console progress bar > - > > Key: SPARK-15905 > URL: https://issues.apache.org/jira/browse/SPARK-15905 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1 >Reporter: Tejas Patil >Priority: Minor > > This leads to driver being not able to get heartbeats from its executors and > job being stuck. After looking at the locking dependency amongst the driver > threads per the jstack, this is where the driver seems to be stuck. > {noformat} > "refresh progress" #113 daemon prio=5 os_prio=0 tid=0x7f7986cbc800 > nid=0x7887d runnable [0x7f6d3507a000] >java.lang.Thread.State: RUNNABLE > at java.io.FileOutputStream.writeBytes(Native Method) > at java.io.FileOutputStream.write(FileOutputStream.java:326) > at > java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82) > at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140) > - locked <0x7f6eb81dd290> (a java.io.BufferedOutputStream) > at java.io.PrintStream.write(PrintStream.java:482) >- locked <0x7f6eb81dd258> (a java.io.PrintStream) > at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221) > at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291) > at sun.nio.cs.StreamEncoder.flushBuffer(StreamEncoder.java:104) > - locked <0x7f6eb81dd400> (a java.io.OutputStreamWriter) > at java.io.OutputStreamWriter.flushBuffer(OutputStreamWriter.java:185) > at java.io.PrintStream.write(PrintStream.java:527) > - locked <0x7f6eb81dd258> (a java.io.PrintStream) > at java.io.PrintStream.print(PrintStream.java:669) > at > org.apache.spark.ui.ConsoleProgressBar.show(ConsoleProgressBar.scala:99) > at > org.apache.spark.ui.ConsoleProgressBar.org$apache$spark$ui$ConsoleProgressBar$$refresh(ConsoleProgressBar.scala:69) > - locked <0x7f6ed33b48a0> (a > org.apache.spark.ui.ConsoleProgressBar) > at > org.apache.spark.ui.ConsoleProgressBar$$anon$1.run(ConsoleProgressBar.scala:53) > at java.util.TimerThread.mainLoop(Timer.java:555) > at java.util.TimerThread.run(Timer.java:505) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15933) Refactor reader-writer interface for streaming DFs to use DataStreamReader/Writer
[ https://issues.apache.org/jira/browse/SPARK-15933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15933: Assignee: Tathagata Das (was: Apache Spark) > Refactor reader-writer interface for streaming DFs to use > DataStreamReader/Writer > - > > Key: SPARK-15933 > URL: https://issues.apache.org/jira/browse/SPARK-15933 > Project: Spark > Issue Type: Bug > Components: SQL, Streaming >Reporter: Tathagata Das >Assignee: Tathagata Das > > Currently, the DataFrameReader/Writer has method that are needed for > streaming and non-streaming DFs. This is quite awkward because each method in > them through runtime exception for one case or the other. So rather having > half the methods throw runtime exceptions, its just better to have a > different reader/writer API for streams. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15933) Refactor reader-writer interface for streaming DFs to use DataStreamReader/Writer
[ https://issues.apache.org/jira/browse/SPARK-15933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15933: Assignee: Apache Spark (was: Tathagata Das) > Refactor reader-writer interface for streaming DFs to use > DataStreamReader/Writer > - > > Key: SPARK-15933 > URL: https://issues.apache.org/jira/browse/SPARK-15933 > Project: Spark > Issue Type: Bug > Components: SQL, Streaming >Reporter: Tathagata Das >Assignee: Apache Spark > > Currently, the DataFrameReader/Writer has method that are needed for > streaming and non-streaming DFs. This is quite awkward because each method in > them through runtime exception for one case or the other. So rather having > half the methods throw runtime exceptions, its just better to have a > different reader/writer API for streams. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15933) Refactor reader-writer interface for streaming DFs to use DataStreamReader/Writer
[ https://issues.apache.org/jira/browse/SPARK-15933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328609#comment-15328609 ] Apache Spark commented on SPARK-15933: -- User 'tdas' has created a pull request for this issue: https://github.com/apache/spark/pull/13653 > Refactor reader-writer interface for streaming DFs to use > DataStreamReader/Writer > - > > Key: SPARK-15933 > URL: https://issues.apache.org/jira/browse/SPARK-15933 > Project: Spark > Issue Type: Bug > Components: SQL, Streaming >Reporter: Tathagata Das >Assignee: Tathagata Das > > Currently, the DataFrameReader/Writer has method that are needed for > streaming and non-streaming DFs. This is quite awkward because each method in > them through runtime exception for one case or the other. So rather having > half the methods throw runtime exceptions, its just better to have a > different reader/writer API for streams. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15933) Refactor reader-writer interface for streaming DFs to use DataStreamReader/Writer
Tathagata Das created SPARK-15933: - Summary: Refactor reader-writer interface for streaming DFs to use DataStreamReader/Writer Key: SPARK-15933 URL: https://issues.apache.org/jira/browse/SPARK-15933 Project: Spark Issue Type: Bug Components: SQL, Streaming Reporter: Tathagata Das Assignee: Tathagata Das Currently, the DataFrameReader/Writer has method that are needed for streaming and non-streaming DFs. This is quite awkward because each method in them through runtime exception for one case or the other. So rather having half the methods throw runtime exceptions, its just better to have a different reader/writer API for streams. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15905) Driver hung while writing to console progress bar
[ https://issues.apache.org/jira/browse/SPARK-15905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328608#comment-15328608 ] Tejas Patil commented on SPARK-15905: - Another instance but this time not via console progress bar. This job has been stuck for 15+ hours. {noformat} "dispatcher-event-loop-23" #60 daemon prio=5 os_prio=0 tid=0x7f981e206000 nid=0x685f8 runnable [0x7f8c0f1ef000] java.lang.Thread.State: RUNNABLE at java.io.FileOutputStream.writeBytes(Native Method) at java.io.FileOutputStream.write(FileOutputStream.java:326) at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122) - locked <0x7f8d48167058> (a java.io.BufferedOutputStream) at java.io.PrintStream.write(PrintStream.java:480) - locked <0x7f8d48167020> (a java.io.PrintStream) at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221) at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291) at sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:295) at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:141) - locked <0x7f8d48237680> (a java.io.OutputStreamWriter) at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:229) at org.apache.log4j.helpers.QuietWriter.flush(QuietWriter.java:59) at org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:324) at org.apache.log4j.WriterAppender.append(WriterAppender.java:162) at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251) - locked <0x7f8d48235ee0> (a org.apache.log4j.ConsoleAppender) at org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66) at org.apache.log4j.Category.callAppenders(Category.java:206) - locked <0x7f8d481bf1e8> (a org.apache.log4j.spi.RootLogger) at org.apache.log4j.Category.forcedLog(Category.java:391) at org.apache.log4j.Category.log(Category.java:856) at org.slf4j.impl.Log4jLoggerAdapter.warn(Log4jLoggerAdapter.java:400) at org.apache.spark.Logging$class.logWarning(Logging.scala:70) at org.apache.spark.scheduler.TaskSetManager.logWarning(TaskSetManager.scala:52) at org.apache.spark.scheduler.TaskSetManager.handleFailedTask(TaskSetManager.scala:721) at org.apache.spark.scheduler.TaskSetManager$$anonfun$executorLost$6.apply(TaskSetManager.scala:813) at org.apache.spark.scheduler.TaskSetManager$$anonfun$executorLost$6.apply(TaskSetManager.scala:807) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39) at scala.collection.mutable.HashMap.foreach(HashMap.scala:98) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) at org.apache.spark.scheduler.TaskSetManager.executorLost(TaskSetManager.scala:807) at org.apache.spark.scheduler.Pool$$anonfun$executorLost$1.apply(Pool.scala:87) at org.apache.spark.scheduler.Pool$$anonfun$executorLost$1.apply(Pool.scala:87) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at scala.collection.AbstractIterable.foreach(Iterable.scala:54) at org.apache.spark.scheduler.Pool.executorLost(Pool.scala:87) at org.apache.spark.scheduler.TaskSchedulerImpl.removeExecutor(TaskSchedulerImpl.scala:536) at org.apache.spark.scheduler.TaskSchedulerImpl.executorLost(TaskSchedulerImpl.scala:474) - locked <0x7f8d5850e1e0> (a org.apache.spark.scheduler.TaskSchedulerImpl) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.removeExecutor(CoarseGrainedSchedulerBackend.scala:263) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$onDisconnected$1.apply(CoarseGrainedSchedulerBackend.scala:202) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$onDisconnected$1.apply(CoarseGrainedSchedulerBackend.scala:202) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.onDisconnected(CoarseGrainedSchedulerBackend.scala:202) at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:142) at org.apache.sp
[jira] [Commented] (SPARK-15905) Driver hung while writing to console progress bar
[ https://issues.apache.org/jira/browse/SPARK-15905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328597#comment-15328597 ] Shixiong Zhu commented on SPARK-15905: -- [~tejasp] Probably some deadlock in Spark. It would be great if you can provide the full jstack output. > Driver hung while writing to console progress bar > - > > Key: SPARK-15905 > URL: https://issues.apache.org/jira/browse/SPARK-15905 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1 >Reporter: Tejas Patil >Priority: Minor > > This leads to driver being not able to get heartbeats from its executors and > job being stuck. After looking at the locking dependency amongst the driver > threads per the jstack, this is where the driver seems to be stuck. > {noformat} > "refresh progress" #113 daemon prio=5 os_prio=0 tid=0x7f7986cbc800 > nid=0x7887d runnable [0x7f6d3507a000] >java.lang.Thread.State: RUNNABLE > at java.io.FileOutputStream.writeBytes(Native Method) > at java.io.FileOutputStream.write(FileOutputStream.java:326) > at > java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82) > at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140) > - locked <0x7f6eb81dd290> (a java.io.BufferedOutputStream) > at java.io.PrintStream.write(PrintStream.java:482) >- locked <0x7f6eb81dd258> (a java.io.PrintStream) > at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221) > at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291) > at sun.nio.cs.StreamEncoder.flushBuffer(StreamEncoder.java:104) > - locked <0x7f6eb81dd400> (a java.io.OutputStreamWriter) > at java.io.OutputStreamWriter.flushBuffer(OutputStreamWriter.java:185) > at java.io.PrintStream.write(PrintStream.java:527) > - locked <0x7f6eb81dd258> (a java.io.PrintStream) > at java.io.PrintStream.print(PrintStream.java:669) > at > org.apache.spark.ui.ConsoleProgressBar.show(ConsoleProgressBar.scala:99) > at > org.apache.spark.ui.ConsoleProgressBar.org$apache$spark$ui$ConsoleProgressBar$$refresh(ConsoleProgressBar.scala:69) > - locked <0x7f6ed33b48a0> (a > org.apache.spark.ui.ConsoleProgressBar) > at > org.apache.spark.ui.ConsoleProgressBar$$anon$1.run(ConsoleProgressBar.scala:53) > at java.util.TimerThread.mainLoop(Timer.java:555) > at java.util.TimerThread.run(Timer.java:505) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3155) Support DecisionTree pruning
[ https://issues.apache.org/jira/browse/SPARK-3155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328592#comment-15328592 ] Manoj Kumar commented on SPARK-3155: I would like to add support for pruning DecisionTrees as part of my internship. Some API related questions: Support for DecisionTree pruning in R is done in this way: prune(fit, cp=) A very straightforward extension would be to start would be to: model.prune(validationData, errorTol=) where model is a fit DecisionTreeRegressionModel would stop pruning when the improvement in error is not above a certain tolerance. Does that sound like a good idea? > Support DecisionTree pruning > > > Key: SPARK-3155 > URL: https://issues.apache.org/jira/browse/SPARK-3155 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Joseph K. Bradley > > Improvement: accuracy, computation > Summary: Pruning is a common method for preventing overfitting with decision > trees. A smart implementation can prune the tree during training in order to > avoid training parts of the tree which would be pruned eventually anyways. > DecisionTree does not currently support pruning. > Pruning: A “pruning” of a tree is a subtree with the same root node, but > with zero or more branches removed. > A naive implementation prunes as follows: > (1) Train a depth K tree using a training set. > (2) Compute the optimal prediction at each node (including internal nodes) > based on the training set. > (3) Take a held-out validation set, and use the tree to make predictions for > each validation example. This allows one to compute the validation error > made at each node in the tree (based on the predictions computed in step (2).) > (4) For each pair of leafs with the same parent, compare the total error on > the validation set made by the leafs’ predictions with the error made by the > parent’s predictions. Remove the leafs if the parent has lower error. > A smarter implementation prunes during training, computing the error on the > validation set made by each node as it is trained. Whenever two children > increase the validation error, they are pruned, and no more training is > required on that branch. > It is common to use about 1/3 of the data for pruning. Note that pruning is > important when using a tree directly for prediction. It is less important > when combining trees via ensemble methods. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15905) Driver hung while writing to console progress bar
[ https://issues.apache.org/jira/browse/SPARK-15905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328591#comment-15328591 ] Tejas Patil commented on SPARK-15905: - [~zsxwing] : This does not repro consistently but happens one off cases.. that too over different jobs. I have seen this 3-4 times in last week. The type of jobs I was running were pure SQL queries with SELECT, JOINs and GROUP BY. Sorry I cannot share the exact query neither the data. But I am quite positive that this problem would have nothing to do with the query being ran. > Driver hung while writing to console progress bar > - > > Key: SPARK-15905 > URL: https://issues.apache.org/jira/browse/SPARK-15905 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1 >Reporter: Tejas Patil >Priority: Minor > > This leads to driver being not able to get heartbeats from its executors and > job being stuck. After looking at the locking dependency amongst the driver > threads per the jstack, this is where the driver seems to be stuck. > {noformat} > "refresh progress" #113 daemon prio=5 os_prio=0 tid=0x7f7986cbc800 > nid=0x7887d runnable [0x7f6d3507a000] >java.lang.Thread.State: RUNNABLE > at java.io.FileOutputStream.writeBytes(Native Method) > at java.io.FileOutputStream.write(FileOutputStream.java:326) > at > java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82) > at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140) > - locked <0x7f6eb81dd290> (a java.io.BufferedOutputStream) > at java.io.PrintStream.write(PrintStream.java:482) >- locked <0x7f6eb81dd258> (a java.io.PrintStream) > at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221) > at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291) > at sun.nio.cs.StreamEncoder.flushBuffer(StreamEncoder.java:104) > - locked <0x7f6eb81dd400> (a java.io.OutputStreamWriter) > at java.io.OutputStreamWriter.flushBuffer(OutputStreamWriter.java:185) > at java.io.PrintStream.write(PrintStream.java:527) > - locked <0x7f6eb81dd258> (a java.io.PrintStream) > at java.io.PrintStream.print(PrintStream.java:669) > at > org.apache.spark.ui.ConsoleProgressBar.show(ConsoleProgressBar.scala:99) > at > org.apache.spark.ui.ConsoleProgressBar.org$apache$spark$ui$ConsoleProgressBar$$refresh(ConsoleProgressBar.scala:69) > - locked <0x7f6ed33b48a0> (a > org.apache.spark.ui.ConsoleProgressBar) > at > org.apache.spark.ui.ConsoleProgressBar$$anon$1.run(ConsoleProgressBar.scala:53) > at java.util.TimerThread.mainLoop(Timer.java:555) > at java.util.TimerThread.run(Timer.java:505) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-3155) Support DecisionTree pruning
[ https://issues.apache.org/jira/browse/SPARK-3155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manoj Kumar updated SPARK-3155: --- Comment: was deleted (was: I would like to add support for pruning DecisionTrees as part of my internship. Some API related questions: Support for DecisionTree pruning in R is done in this way: prune(fit, cp=) A very straightforward extension would be to start would be to: model.prune(validationData, errorTol=) where model is a fit DecisionTreeRegressionModel would stop pruning when the improvement in error is not above a certain tolerance. Does that sound like a good idea? ) > Support DecisionTree pruning > > > Key: SPARK-3155 > URL: https://issues.apache.org/jira/browse/SPARK-3155 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Joseph K. Bradley > > Improvement: accuracy, computation > Summary: Pruning is a common method for preventing overfitting with decision > trees. A smart implementation can prune the tree during training in order to > avoid training parts of the tree which would be pruned eventually anyways. > DecisionTree does not currently support pruning. > Pruning: A “pruning” of a tree is a subtree with the same root node, but > with zero or more branches removed. > A naive implementation prunes as follows: > (1) Train a depth K tree using a training set. > (2) Compute the optimal prediction at each node (including internal nodes) > based on the training set. > (3) Take a held-out validation set, and use the tree to make predictions for > each validation example. This allows one to compute the validation error > made at each node in the tree (based on the predictions computed in step (2).) > (4) For each pair of leafs with the same parent, compare the total error on > the validation set made by the leafs’ predictions with the error made by the > parent’s predictions. Remove the leafs if the parent has lower error. > A smarter implementation prunes during training, computing the error on the > validation set made by each node as it is trained. Whenever two children > increase the validation error, they are pruned, and no more training is > required on that branch. > It is common to use about 1/3 of the data for pruning. Note that pruning is > important when using a tree directly for prediction. It is less important > when combining trees via ensemble methods. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15932) document the contract of encoder serializer expressions
[ https://issues.apache.org/jira/browse/SPARK-15932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15932: Assignee: Wenchen Fan (was: Apache Spark) > document the contract of encoder serializer expressions > --- > > Key: SPARK-15932 > URL: https://issues.apache.org/jira/browse/SPARK-15932 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15932) document the contract of encoder serializer expressions
[ https://issues.apache.org/jira/browse/SPARK-15932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15932: Assignee: Apache Spark (was: Wenchen Fan) > document the contract of encoder serializer expressions > --- > > Key: SPARK-15932 > URL: https://issues.apache.org/jira/browse/SPARK-15932 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Wenchen Fan >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15932) document the contract of encoder serializer expressions
[ https://issues.apache.org/jira/browse/SPARK-15932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328586#comment-15328586 ] Apache Spark commented on SPARK-15932: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/13648 > document the contract of encoder serializer expressions > --- > > Key: SPARK-15932 > URL: https://issues.apache.org/jira/browse/SPARK-15932 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15914) Add deprecated method back to SQLContext for source code backward compatiblity
[ https://issues.apache.org/jira/browse/SPARK-15914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Zhong updated SPARK-15914: --- Description: We removed some deprecated method in SQLContext in branch Spark 2.0. For example: {code} @deprecated("Use read.json(). This will be removed in Spark 2.0.", "1.4.0") def jsonFile(path: String): DataFrame = { read.json(path) } {code} These deprecated method may be used by existing third party data source. We probably want to add them back to remain source code level backward compatibility. was: We removed some deprecated method in SQLContext in branch Spark 2.0. For example: {code} @deprecated("Use read.json(). This will be removed in Spark 2.0.", "1.4.0") def jsonFile(path: String): DataFrame = { read.json(path) } {code} These deprecated method may be used by existing third party data source. We probably want to add them back to remain backward-compatibiity. > Add deprecated method back to SQLContext for source code backward compatiblity > -- > > Key: SPARK-15914 > URL: https://issues.apache.org/jira/browse/SPARK-15914 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Sean Zhong > > We removed some deprecated method in SQLContext in branch Spark 2.0. > For example: > {code} > @deprecated("Use read.json(). This will be removed in Spark 2.0.", "1.4.0") > def jsonFile(path: String): DataFrame = { > read.json(path) > } > {code} > These deprecated method may be used by existing third party data source. We > probably want to add them back to remain source code level backward > compatibility. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15932) document the contract of encoder serializer expressions
Wenchen Fan created SPARK-15932: --- Summary: document the contract of encoder serializer expressions Key: SPARK-15932 URL: https://issues.apache.org/jira/browse/SPARK-15932 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.0.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15914) Add deprecated method back to SQLContext for source code backward compatiblity
[ https://issues.apache.org/jira/browse/SPARK-15914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Zhong updated SPARK-15914: --- Summary: Add deprecated method back to SQLContext for source code backward compatiblity (was: Add deprecated method back to SQLContext for backward compatiblity) > Add deprecated method back to SQLContext for source code backward compatiblity > -- > > Key: SPARK-15914 > URL: https://issues.apache.org/jira/browse/SPARK-15914 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Sean Zhong > > We removed some deprecated method in SQLContext in branch Spark 2.0. > For example: > {code} > @deprecated("Use read.json(). This will be removed in Spark 2.0.", "1.4.0") > def jsonFile(path: String): DataFrame = { > read.json(path) > } > {code} > These deprecated method may be used by existing third party data source. We > probably want to add them back to remain backward-compatibiity. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3155) Support DecisionTree pruning
[ https://issues.apache.org/jira/browse/SPARK-3155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328487#comment-15328487 ] Manoj Kumar commented on SPARK-3155: I would like to add support for pruning DecisionTrees as part of my internship. Some API related questions: Support for DecisionTree pruning in R is done in this way: prune(fit, cp=) A very straightforward extension would be to start would be to: model.prune(validationData, errorTol=) where model is a fit DecisionTreeRegressionModel would stop pruning when the improvement in error is not above a certain tolerance. Does that sound like a good idea? > Support DecisionTree pruning > > > Key: SPARK-3155 > URL: https://issues.apache.org/jira/browse/SPARK-3155 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Joseph K. Bradley > > Improvement: accuracy, computation > Summary: Pruning is a common method for preventing overfitting with decision > trees. A smart implementation can prune the tree during training in order to > avoid training parts of the tree which would be pruned eventually anyways. > DecisionTree does not currently support pruning. > Pruning: A “pruning” of a tree is a subtree with the same root node, but > with zero or more branches removed. > A naive implementation prunes as follows: > (1) Train a depth K tree using a training set. > (2) Compute the optimal prediction at each node (including internal nodes) > based on the training set. > (3) Take a held-out validation set, and use the tree to make predictions for > each validation example. This allows one to compute the validation error > made at each node in the tree (based on the predictions computed in step (2).) > (4) For each pair of leafs with the same parent, compare the total error on > the validation set made by the leafs’ predictions with the error made by the > parent’s predictions. Remove the leafs if the parent has lower error. > A smarter implementation prunes during training, computing the error on the > validation set made by each node as it is trained. Whenever two children > increase the validation error, they are pruned, and no more training is > required on that branch. > It is common to use about 1/3 of the data for pruning. Note that pruning is > important when using a tree directly for prediction. It is less important > when combining trees via ensemble methods. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15925) Replaces registerTempTable with createOrReplaceTempView in SparkR
[ https://issues.apache.org/jira/browse/SPARK-15925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman resolved SPARK-15925. --- Resolution: Fixed Issue resolved by pull request 13644 [https://github.com/apache/spark/pull/13644] > Replaces registerTempTable with createOrReplaceTempView in SparkR > - > > Key: SPARK-15925 > URL: https://issues.apache.org/jira/browse/SPARK-15925 > Project: Spark > Issue Type: Sub-task > Components: SparkR, SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Cheng Lian > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15176) Job Scheduling Within Application Suffers from Priority Inversion
[ https://issues.apache.org/jira/browse/SPARK-15176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328466#comment-15328466 ] Kay Ousterhout commented on SPARK-15176: I thought about this a little more and I think I'm in favor of maxShare instead of maxRunningTasks. The reason is that maxRunningTasks seems brittle to the underlying setup -- if someone configures a certain maximum number of tasks, and then a few machines die, the maximum may no longer be reasonable (e.g., it may become larger than the number of machines in the cluster). The other benefit is symmetry with minShare, as Mark mentioned. [~njw45] why did you chose maxRunningTasks, as opposed to maxShare? Are there other reasons that maxRunningTasks makes more sense? > Job Scheduling Within Application Suffers from Priority Inversion > - > > Key: SPARK-15176 > URL: https://issues.apache.org/jira/browse/SPARK-15176 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 1.6.1 >Reporter: Nick White > > Say I have two pools, and N cores in my cluster: > * I submit a job to one, which has M >> N tasks > * N of the M tasks are scheduled > * I submit a job to the second pool - but none of its tasks get scheduled > until a task from the other pool finishes! > This can lead to unbounded denial-of-service for the second pool - regardless > of `minShare` or `weight` settings. Ideally Spark would support a pre-emption > mechanism, or an upper bound on a pool's resource usage. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15931) SparkR tests failing on R 3.3.0
[ https://issues.apache.org/jira/browse/SPARK-15931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328432#comment-15328432 ] Shivaram Venkataraman commented on SPARK-15931: --- cc [~felixcheung] We should print out what are the names of the methods in expected vs actual as this has failed before as well > SparkR tests failing on R 3.3.0 > --- > > Key: SPARK-15931 > URL: https://issues.apache.org/jira/browse/SPARK-15931 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.0.0 >Reporter: Cheng Lian > > Environment: > # Spark master Git revision: > [f5d38c39255cc75325c6639561bfec1bc051f788|https://github.com/apache/spark/tree/f5d38c39255cc75325c6639561bfec1bc051f788] > # R version: 3.3.0 > To reproduce this, just build Spark with {{-Psparkr}} and run the tests. > Relevant log lines: > {noformat} > ... > Failed > - > 1. Failure: Check masked functions (@test_context.R#44) > > length(maskedCompletely) not equal to length(namesOfMaskedCompletely). > 1/1 mismatches > [1] 3 - 5 == -2 > 2. Failure: Check masked functions (@test_context.R#45) > > sort(maskedCompletely) not equal to sort(namesOfMaskedCompletely). > Lengths differ: 3 vs 5 > ... > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15776) Type coercion incorrect
[ https://issues.apache.org/jira/browse/SPARK-15776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328420#comment-15328420 ] Apache Spark commented on SPARK-15776: -- User 'clockfly' has created a pull request for this issue: https://github.com/apache/spark/pull/13651 > Type coercion incorrect > --- > > Key: SPARK-15776 > URL: https://issues.apache.org/jira/browse/SPARK-15776 > Project: Spark > Issue Type: Bug > Components: SQL > Environment: Spark based on commit > 26c1089c37149061f838129bb53330ded68ff4c9 >Reporter: Weizhong >Priority: Minor > > {code:sql} > CREATE TABLE cdr ( > debet_dt int , > srv_typ_cdstring , > b_brnd_cd smallint , > call_dur int > ) > ROW FORMAT delimited fields terminated by ',' > STORED AS TEXTFILE; > {code} > {code:sql} > SELECT debet_dt, >SUM(CASE WHEN srv_typ_cd LIKE '0%' THEN call_dur / 60 ELSE 0 END) > FROM cdr > GROUP BY debet_dt > ORDER BY debet_dt; > {code} > {noformat} > == Analyzed Logical Plan == > debet_dt: int, sum(CASE WHEN srv_typ_cd LIKE 0% THEN (call_dur / 60) ELSE 0 > END): bigint > Project [debet_dt#16, sum(CASE WHEN srv_typ_cd LIKE 0% THEN (call_dur / 60) > ELSE 0 END)#27L] > +- Sort [debet_dt#16 ASC], true >+- Aggregate [debet_dt#16], [debet_dt#16, sum(cast(CASE WHEN srv_typ_cd#18 > LIKE 0% THEN (cast(call_dur#21 as double) / cast(60 as double)) ELSE cast(0 > as double) END as bigint)) AS sum(CASE WHEN srv_typ_cd LIKE 0% THEN (call_dur > / 60) ELSE 0 END)#27L] > +- MetastoreRelation default, cdr > {noformat} > {code:sql} > SELECT debet_dt, >SUM(CASE WHEN b_brnd_cd IN(1) THEN call_dur / 60 ELSE 0 END) > FROM cdr > GROUP BY debet_dt > ORDER BY debet_dt; > {code} > {noformat} > == Analyzed Logical Plan == > debet_dt: int, sum(CASE WHEN (CAST(b_brnd_cd AS INT) IN (CAST(1 AS INT))) > THEN (CAST(call_dur AS DOUBLE) / CAST(60 AS DOUBLE)) ELSE CAST(0 AS DOUBLE) > END): double > Project [debet_dt#76, sum(CASE WHEN (CAST(b_brnd_cd AS INT) IN (CAST(1 AS > INT))) THEN (CAST(call_dur AS DOUBLE) / CAST(60 AS DOUBLE)) ELSE CAST(0 AS > DOUBLE) END)#87] > +- Sort [debet_dt#76 ASC], true >+- Aggregate [debet_dt#76], [debet_dt#76, sum(CASE WHEN cast(b_brnd_cd#80 > as int) IN (cast(1 as int)) THEN (cast(call_dur#81 as double) / cast(60 as > double)) ELSE cast(0 as double) END) AS sum(CASE WHEN (CAST(b_brnd_cd AS INT) > IN (CAST(1 AS INT))) THEN (CAST(call_dur AS DOUBLE) / CAST(60 AS DOUBLE)) > ELSE CAST(0 AS DOUBLE) END)#87] > +- MetastoreRelation default, cdr > {noformat} > The only difference is WHEN condition, but will result different output > column type(one is bigint, one is double) > We need to apply "Division" before "FunctionArgumentConversion", like below: > {code:java} > val typeCoercionRules = > PropagateTypes :: > InConversion :: > WidenSetOperationTypes :: > PromoteStrings :: > DecimalPrecision :: > BooleanEquality :: > StringToIntegralCasts :: > Division :: > FunctionArgumentConversion :: > CaseWhenCoercion :: > IfCoercion :: > PropagateTypes :: > ImplicitTypeCasts :: > DateTimeOperations :: > Nil > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15861) pyspark mapPartitions with none generator functions / functors
[ https://issues.apache.org/jira/browse/SPARK-15861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328417#comment-15328417 ] Bryan Cutler commented on SPARK-15861: -- {{mapPartitions}} will expect the function to return a sequence, that's what you are referring to right? > pyspark mapPartitions with none generator functions / functors > -- > > Key: SPARK-15861 > URL: https://issues.apache.org/jira/browse/SPARK-15861 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.1 >Reporter: Greg Bowyer >Priority: Minor > > Hi all, it appears that the method `rdd.mapPartitions` does odd things if it > is fed a normal subroutine. > For instance, lets say we have the following > {code} > rows = range(25) > rows = [rows[i:i+5] for i in range(0, len(rows), 5)] > rdd = sc.parallelize(rows, 2) > def to_np(data): > return np.array(list(data)) > rdd.mapPartitions(to_np).collect() > ... > [array([0, 1, 2, 3, 4]), > array([5, 6, 7, 8, 9]), > array([10, 11, 12, 13, 14]), > array([15, 16, 17, 18, 19]), > array([20, 21, 22, 23, 24])] > rdd.mapPartitions(to_np, preservePartitioning=True).collect() > ... > [array([0, 1, 2, 3, 4]), > array([5, 6, 7, 8, 9]), > array([10, 11, 12, 13, 14]), > array([15, 16, 17, 18, 19]), > array([20, 21, 22, 23, 24])] > {code} > This basically makes the provided function that did return act like the end > user called {code}rdd.map{code} > I think that maybe a check should be put in to call > {code}inspect.isgeneratorfunction{code} > ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15918) unionAll returns wrong result when two dataframes has schema in different order
[ https://issues.apache.org/jira/browse/SPARK-15918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328405#comment-15328405 ] Dongjoon Hyun commented on SPARK-15918: --- Hi, [~Prabhu Joseph]. Instead of changing one of the tables, you just need to use explicit `select`. If `df1(a,b)` and `df2(b,a)`, please do the followings. {code} df1.union(df2.select("a", "b")) {code} IMHO, this is not a problem. > unionAll returns wrong result when two dataframes has schema in different > order > --- > > Key: SPARK-15918 > URL: https://issues.apache.org/jira/browse/SPARK-15918 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 > Environment: CentOS >Reporter: Prabhu Joseph > > On applying unionAll operation between A and B dataframes, they both has same > schema but in different order and hence the result has column value mapping > changed. > Repro: > {code} > A.show() > +---++---+--+--+-++---+--+---+---+-+ > |tag|year_day|tm_hour|tm_min|tm_sec|dtype|time|tm_mday|tm_mon|tm_yday|tm_year|value| > +---++---+--+--+-++---+--+---+---+-+ > +---++---+--+--+-++---+--+---+---+-+ > B.show() > +-+---+--+---+---+--+--+--+---+---+--++ > |dtype|tag| > time|tm_hour|tm_mday|tm_min|tm_mon|tm_sec|tm_yday|tm_year| value|year_day| > +-+---+--+---+---+--+--+--+---+---+--++ > |F|C_FNHXUT701Z.CNSTLO|1443790800| 13| 2| 0|10| 0| > 275| 2015|1.2345| 2015275| > |F|C_FNHXUDP713.CNSTHI|1443790800| 13| 2| 0|10| 0| > 275| 2015|1.2345| 2015275| > |F| C_FNHXUT718.CNSTHI|1443790800| 13| 2| 0|10| 0| > 275| 2015|1.2345| 2015275| > |F|C_FNHXUT703Z.CNSTLO|1443790800| 13| 2| 0|10| 0| > 275| 2015|1.2345| 2015275| > |F|C_FNHXUR716A.CNSTLO|1443790800| 13| 2| 0|10| 0| > 275| 2015|1.2345| 2015275| > |F|C_FNHXUT803Z.CNSTHI|1443790800| 13| 2| 0|10| 0| > 275| 2015|1.2345| 2015275| > |F| C_FNHXUT728.CNSTHI|1443790800| 13| 2| 0|10| 0| > 275| 2015|1.2345| 2015275| > |F| C_FNHXUR806.CNSTHI|1443790800| 13| 2| 0|10| 0| > 275| 2015|1.2345| 2015275| > +-+---+--+---+---+--+--+--+---+---+--++ > A = A.unionAll(B) > A.show() > +---+---+--+--+--+-++---+--+---+---+-+ > |tag| year_day| > tm_hour|tm_min|tm_sec|dtype|time|tm_mday|tm_mon|tm_yday|tm_year|value| > +---+---+--+--+--+-++---+--+---+---+-+ > | F|C_FNHXUT701Z.CNSTLO|1443790800|13| 2|0| 10| 0| 275| > 2015| 1.2345|2015275.0| > | F|C_FNHXUDP713.CNSTHI|1443790800|13| 2|0| 10| 0| 275| > 2015| 1.2345|2015275.0| > | F| C_FNHXUT718.CNSTHI|1443790800|13| 2|0| 10| 0| 275| > 2015| 1.2345|2015275.0| > | F|C_FNHXUT703Z.CNSTLO|1443790800|13| 2|0| 10| 0| 275| > 2015| 1.2345|2015275.0| > | F|C_FNHXUR716A.CNSTLO|1443790800|13| 2|0| 10| 0| 275| > 2015| 1.2345|2015275.0| > | F|C_FNHXUT803Z.CNSTHI|1443790800|13| 2|0| 10| 0| 275| > 2015| 1.2345|2015275.0| > | F| C_FNHXUT728.CNSTHI|1443790800|13| 2|0| 10| 0| 275| > 2015| 1.2345|2015275.0| > | F| C_FNHXUR806.CNSTHI|1443790800|13| 2|0| 10| 0| 275| > 2015| 1.2345|2015275.0| > +---+---+--+--+--+-++---+--+---+---+-+ > {code} > On changing the schema of A according to B and doing unionAll works fine > {code} > C = > A.select("dtype","tag","time","tm_hour","tm_mday","tm_min",”tm_mon”,"tm_sec","tm_yday","tm_year","value","year_day") > A = C.unionAll(B) > A.show() > +-+---+--+---+---+--+--+--+---+---+--++ > |dtype|tag| > time|tm_hour|tm_mday|tm_min|tm_mon|tm_sec|tm_yday|tm_year| value|year_day| > +-+---+--+---+---+--+--+--+---+---+--++ > |F|C_FNHXUT701Z.CNSTLO|1443790800| 13| 2| 0|10| 0| > 275| 2015|1.2345| 2015275| > |F|C_FNHXUDP713.CNSTHI|1443790800| 13| 2| 0|10| 0| > 275| 2015|1.2345| 2015275| > |F| C_FNH
[jira] [Commented] (SPARK-15931) SparkR tests failing on R 3.3.0
[ https://issues.apache.org/jira/browse/SPARK-15931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328403#comment-15328403 ] Cheng Lian commented on SPARK-15931: cc [~mengxr] > SparkR tests failing on R 3.3.0 > --- > > Key: SPARK-15931 > URL: https://issues.apache.org/jira/browse/SPARK-15931 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.0.0 >Reporter: Cheng Lian > > Environment: > # Spark master Git revision: > [f5d38c39255cc75325c6639561bfec1bc051f788|https://github.com/apache/spark/tree/f5d38c39255cc75325c6639561bfec1bc051f788] > # R version: 3.3.0 > To reproduce this, just build Spark with {{-Psparkr}} and run the tests. > Relevant log lines: > {noformat} > ... > Failed > - > 1. Failure: Check masked functions (@test_context.R#44) > > length(maskedCompletely) not equal to length(namesOfMaskedCompletely). > 1/1 mismatches > [1] 3 - 5 == -2 > 2. Failure: Check masked functions (@test_context.R#45) > > sort(maskedCompletely) not equal to sort(namesOfMaskedCompletely). > Lengths differ: 3 vs 5 > ... > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15931) SparkR tests failing on R 3.3.0
Cheng Lian created SPARK-15931: -- Summary: SparkR tests failing on R 3.3.0 Key: SPARK-15931 URL: https://issues.apache.org/jira/browse/SPARK-15931 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 2.0.0 Reporter: Cheng Lian Environment: # Spark master Git revision: [f5d38c39255cc75325c6639561bfec1bc051f788|https://github.com/apache/spark/tree/f5d38c39255cc75325c6639561bfec1bc051f788] # R version: 3.3.0 To reproduce this, just build Spark with {{-Psparkr}} and run the tests. Relevant log lines: {noformat} ... Failed - 1. Failure: Check masked functions (@test_context.R#44) length(maskedCompletely) not equal to length(namesOfMaskedCompletely). 1/1 mismatches [1] 3 - 5 == -2 2. Failure: Check masked functions (@test_context.R#45) sort(maskedCompletely) not equal to sort(namesOfMaskedCompletely). Lengths differ: 3 vs 5 ... {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15690) Fast single-node (single-process) in-memory shuffle
[ https://issues.apache.org/jira/browse/SPARK-15690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328390#comment-15328390 ] Reynold Xin commented on SPARK-15690: - Yes there is definitely no reason to go through network for a single process. Technically we can even bypass the entire DAGScheduler, although that might be too much work. > Fast single-node (single-process) in-memory shuffle > --- > > Key: SPARK-15690 > URL: https://issues.apache.org/jira/browse/SPARK-15690 > Project: Spark > Issue Type: New Feature > Components: Shuffle, SQL >Reporter: Reynold Xin > > Spark's current shuffle implementation sorts all intermediate data by their > partition id, and then write the data to disk. This is not a big bottleneck > because the network throughput on commodity clusters tend to be low. However, > an increasing number of Spark users are using the system to process data on a > single-node. When in a single node operating against intermediate data that > fits in memory, the existing shuffle code path can become a big bottleneck. > The goal of this ticket is to change Spark so it can use in-memory radix sort > to do data shuffling on a single node, and still gracefully fallback to disk > if the data size does not fit in memory. Given the number of partitions is > usually small (say less than 256), it'd require only a single pass do to the > radix sort with pretty decent CPU efficiency. > Note that there have been many in-memory shuffle attempts in the past. This > ticket has a smaller scope (single-process), and aims to actually > productionize this code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15887) Bring back the hive-site.xml support for Spark 2.0
[ https://issues.apache.org/jira/browse/SPARK-15887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-15887. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 13611 [https://github.com/apache/spark/pull/13611] > Bring back the hive-site.xml support for Spark 2.0 > -- > > Key: SPARK-15887 > URL: https://issues.apache.org/jira/browse/SPARK-15887 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.0.0 > > > Right now, Spark 2.0 does not load hive-site.xml. Based on users' feedback, > it seems make sense to still load this conf file. > Originally, this file was loaded when we load HiveConf class and all settings > can be retrieved after we create a HiveConf instances. Let's avoid of using > this way to load hive-site.xml. Instead, since hive-site.xml is a normal > hadoop conf file, we can first find its url using the classloader and then > use Hadoop Configuration's addResource (or add hive-site.xml as a default > resource through Configuration.addDefaultResource) to load confs. > Please note that hive-site.xml needs to be loaded into the hadoop conf used > to create metadataHive. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15753) Move some Analyzer stuff to Analyzer from DataFrameWriter
[ https://issues.apache.org/jira/browse/SPARK-15753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328381#comment-15328381 ] Wenchen Fan commented on SPARK-15753: - this is reverted, see discussion https://github.com/apache/spark/pull/13496#discussion_r66724862 > Move some Analyzer stuff to Analyzer from DataFrameWriter > - > > Key: SPARK-15753 > URL: https://issues.apache.org/jira/browse/SPARK-15753 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh > Fix For: 2.0.0 > > > DataFrameWriter.insertInto includes some Analyzer stuff. We should move it to > Analyzer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15861) pyspark mapPartitions with none generator functions / functors
[ https://issues.apache.org/jira/browse/SPARK-15861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328380#comment-15328380 ] Greg Bowyer commented on SPARK-15861: - ... Hum from my end-users testing it does not seem to fail if the map function does not return a valid sequence > pyspark mapPartitions with none generator functions / functors > -- > > Key: SPARK-15861 > URL: https://issues.apache.org/jira/browse/SPARK-15861 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.1 >Reporter: Greg Bowyer >Priority: Minor > > Hi all, it appears that the method `rdd.mapPartitions` does odd things if it > is fed a normal subroutine. > For instance, lets say we have the following > {code} > rows = range(25) > rows = [rows[i:i+5] for i in range(0, len(rows), 5)] > rdd = sc.parallelize(rows, 2) > def to_np(data): > return np.array(list(data)) > rdd.mapPartitions(to_np).collect() > ... > [array([0, 1, 2, 3, 4]), > array([5, 6, 7, 8, 9]), > array([10, 11, 12, 13, 14]), > array([15, 16, 17, 18, 19]), > array([20, 21, 22, 23, 24])] > rdd.mapPartitions(to_np, preservePartitioning=True).collect() > ... > [array([0, 1, 2, 3, 4]), > array([5, 6, 7, 8, 9]), > array([10, 11, 12, 13, 14]), > array([15, 16, 17, 18, 19]), > array([20, 21, 22, 23, 24])] > {code} > This basically makes the provided function that did return act like the end > user called {code}rdd.map{code} > I think that maybe a check should be put in to call > {code}inspect.isgeneratorfunction{code} > ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15690) Fast single-node (single-process) in-memory shuffle
[ https://issues.apache.org/jira/browse/SPARK-15690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328378#comment-15328378 ] Saisai Shao commented on SPARK-15690: - I see. Since everything is in a single process, looks like netty layer could be by-passed and directly fetched the memory blocks in the reader side. It should definitely be faster than the current implementation. > Fast single-node (single-process) in-memory shuffle > --- > > Key: SPARK-15690 > URL: https://issues.apache.org/jira/browse/SPARK-15690 > Project: Spark > Issue Type: New Feature > Components: Shuffle, SQL >Reporter: Reynold Xin > > Spark's current shuffle implementation sorts all intermediate data by their > partition id, and then write the data to disk. This is not a big bottleneck > because the network throughput on commodity clusters tend to be low. However, > an increasing number of Spark users are using the system to process data on a > single-node. When in a single node operating against intermediate data that > fits in memory, the existing shuffle code path can become a big bottleneck. > The goal of this ticket is to change Spark so it can use in-memory radix sort > to do data shuffling on a single node, and still gracefully fallback to disk > if the data size does not fit in memory. Given the number of partitions is > usually small (say less than 256), it'd require only a single pass do to the > radix sort with pretty decent CPU efficiency. > Note that there have been many in-memory shuffle attempts in the past. This > ticket has a smaller scope (single-process), and aims to actually > productionize this code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9623) RandomForestRegressor: provide variance of predictions
[ https://issues.apache.org/jira/browse/SPARK-9623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9623: --- Assignee: Apache Spark > RandomForestRegressor: provide variance of predictions > -- > > Key: SPARK-9623 > URL: https://issues.apache.org/jira/browse/SPARK-9623 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley >Assignee: Apache Spark >Priority: Minor > > Variance of predicted value, as estimated from training data. > Analogous to class probabilities for classification. > See [SPARK-3727] for discussion. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15930) Add Row count property to FPGrowth model
John Aherne created SPARK-15930: --- Summary: Add Row count property to FPGrowth model Key: SPARK-15930 URL: https://issues.apache.org/jira/browse/SPARK-15930 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.6.1 Reporter: John Aherne Priority: Minor Add a row count property to MLlib's FPGrowth model. When using the model from FPGrowth, a count of the total number of records is often necessary. It appears that the function already calculates that value when training the model, so it would save time not having to do it again outside the model. Sorry if this is the wrong place for this kind of stuff. I am new to Jira, Spark, and making suggestions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9623) RandomForestRegressor: provide variance of predictions
[ https://issues.apache.org/jira/browse/SPARK-9623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9623: --- Assignee: (was: Apache Spark) > RandomForestRegressor: provide variance of predictions > -- > > Key: SPARK-9623 > URL: https://issues.apache.org/jira/browse/SPARK-9623 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley >Priority: Minor > > Variance of predicted value, as estimated from training data. > Analogous to class probabilities for classification. > See [SPARK-3727] for discussion. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9623) RandomForestRegressor: provide variance of predictions
[ https://issues.apache.org/jira/browse/SPARK-9623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328343#comment-15328343 ] Apache Spark commented on SPARK-9623: - User 'MechCoder' has created a pull request for this issue: https://github.com/apache/spark/pull/13650 > RandomForestRegressor: provide variance of predictions > -- > > Key: SPARK-9623 > URL: https://issues.apache.org/jira/browse/SPARK-9623 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley >Priority: Minor > > Variance of predicted value, as estimated from training data. > Analogous to class probabilities for classification. > See [SPARK-3727] for discussion. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15929) DataFrameSuite path globbing error message tests are not fully portable
[ https://issues.apache.org/jira/browse/SPARK-15929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328329#comment-15328329 ] Apache Spark commented on SPARK-15929: -- User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/13649 > DataFrameSuite path globbing error message tests are not fully portable > --- > > Key: SPARK-15929 > URL: https://issues.apache.org/jira/browse/SPARK-15929 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Josh Rosen >Assignee: Josh Rosen > > The DataFrameSuite regression tests for SPARK-13774 fail in my environment > because they attempt to glob over all of {{/mnt}} and some of the > subdirectories in there have restrictive permissions which cause the test to > fail. I think we should rewrite this test to not depend existing / OS paths. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15929) DataFrameSuite path globbing error message tests are not fully portable
Josh Rosen created SPARK-15929: -- Summary: DataFrameSuite path globbing error message tests are not fully portable Key: SPARK-15929 URL: https://issues.apache.org/jira/browse/SPARK-15929 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Josh Rosen Assignee: Josh Rosen The DataFrameSuite regression tests for SPARK-13774 fail in my environment because they attempt to glob over all of {{/mnt}} and some of the subdirectories in there have restrictive permissions which cause the test to fail. I think we should rewrite this test to not depend existing / OS paths. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Deleted] (SPARK-15928) Eliminate redundant code in DAGScheduler's getParentStages and getAncestorShuffleDependencies methods.
[ https://issues.apache.org/jira/browse/SPARK-15928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout deleted SPARK-15928: --- > Eliminate redundant code in DAGScheduler's getParentStages and > getAncestorShuffleDependencies methods. > -- > > Key: SPARK-15928 > URL: https://issues.apache.org/jira/browse/SPARK-15928 > Project: Spark > Issue Type: Sub-task >Reporter: Kay Ousterhout >Assignee: Kay Ousterhout >Priority: Minor > > The getParentStages and getAncestorShuffleDependencies methods have a lot of > repeated code to traverse the dependency graph. We should create a function > that they can both call. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15905) Driver hung while writing to console progress bar
[ https://issues.apache.org/jira/browse/SPARK-15905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328266#comment-15328266 ] Shixiong Zhu commented on SPARK-15905: -- Do you have a reproducer? What does your code look like? > Driver hung while writing to console progress bar > - > > Key: SPARK-15905 > URL: https://issues.apache.org/jira/browse/SPARK-15905 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1 >Reporter: Tejas Patil >Priority: Minor > > This leads to driver being not able to get heartbeats from its executors and > job being stuck. After looking at the locking dependency amongst the driver > threads per the jstack, this is where the driver seems to be stuck. > {noformat} > "refresh progress" #113 daemon prio=5 os_prio=0 tid=0x7f7986cbc800 > nid=0x7887d runnable [0x7f6d3507a000] >java.lang.Thread.State: RUNNABLE > at java.io.FileOutputStream.writeBytes(Native Method) > at java.io.FileOutputStream.write(FileOutputStream.java:326) > at > java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82) > at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140) > - locked <0x7f6eb81dd290> (a java.io.BufferedOutputStream) > at java.io.PrintStream.write(PrintStream.java:482) >- locked <0x7f6eb81dd258> (a java.io.PrintStream) > at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221) > at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291) > at sun.nio.cs.StreamEncoder.flushBuffer(StreamEncoder.java:104) > - locked <0x7f6eb81dd400> (a java.io.OutputStreamWriter) > at java.io.OutputStreamWriter.flushBuffer(OutputStreamWriter.java:185) > at java.io.PrintStream.write(PrintStream.java:527) > - locked <0x7f6eb81dd258> (a java.io.PrintStream) > at java.io.PrintStream.print(PrintStream.java:669) > at > org.apache.spark.ui.ConsoleProgressBar.show(ConsoleProgressBar.scala:99) > at > org.apache.spark.ui.ConsoleProgressBar.org$apache$spark$ui$ConsoleProgressBar$$refresh(ConsoleProgressBar.scala:69) > - locked <0x7f6ed33b48a0> (a > org.apache.spark.ui.ConsoleProgressBar) > at > org.apache.spark.ui.ConsoleProgressBar$$anon$1.run(ConsoleProgressBar.scala:53) > at java.util.TimerThread.mainLoop(Timer.java:555) > at java.util.TimerThread.run(Timer.java:505) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15690) Fast single-node (single-process) in-memory shuffle
[ https://issues.apache.org/jira/browse/SPARK-15690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328264#comment-15328264 ] Reynold Xin commented on SPARK-15690: - Yup. Eventually we can also generalize this to multiple process (e.g. cluster). > Fast single-node (single-process) in-memory shuffle > --- > > Key: SPARK-15690 > URL: https://issues.apache.org/jira/browse/SPARK-15690 > Project: Spark > Issue Type: New Feature > Components: Shuffle, SQL >Reporter: Reynold Xin > > Spark's current shuffle implementation sorts all intermediate data by their > partition id, and then write the data to disk. This is not a big bottleneck > because the network throughput on commodity clusters tend to be low. However, > an increasing number of Spark users are using the system to process data on a > single-node. When in a single node operating against intermediate data that > fits in memory, the existing shuffle code path can become a big bottleneck. > The goal of this ticket is to change Spark so it can use in-memory radix sort > to do data shuffling on a single node, and still gracefully fallback to disk > if the data size does not fit in memory. Given the number of partitions is > usually small (say less than 256), it'd require only a single pass do to the > radix sort with pretty decent CPU efficiency. > Note that there have been many in-memory shuffle attempts in the past. This > ticket has a smaller scope (single-process), and aims to actually > productionize this code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-15861) pyspark mapPartitions with none generator functions / functors
[ https://issues.apache.org/jira/browse/SPARK-15861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328245#comment-15328245 ] Bryan Cutler edited comment on SPARK-15861 at 6/13/16 9:05 PM: --- [~gbow...@fastmail.co.uk] {{mapPartitions}} expects a function that takes an iterator as input then outputs an iterable sequence, and your function in the example is actually providing this. I think what is going on here is your function will map the iterator to a numpy array, that internally will be something like {noformat}array([[0, 1, 2, 3, 4], [5, 6, 7, 8, 9]]){noformat} for the first partition, then {{collect}} will iterate over that sequence and return each element, which will also be a numpy array, so you get {noformat}array([0, 1, 2, 3, 4]), array([5, 6, 7, 8, 9])) {noformat} for the first 2 elements and so on.. I believe this is working as it is supposed to, and in general, {{mapPartitions}} will not usually give the same result as {{map}} - it will fail if the function does not return a valid sequence. The documentation could perhaps be a little clearer in that regard. was (Author: bryanc): [~gbow...@fastmail.co.uk] {{mapPartitions}} expects a function the takes an iterator as input then outputs an iterable sequence, and your function in the example is actually providing this. I think what is going on here is your function will map the iterator to a numpy array, that internally will be something like {noformat}array([[0, 1, 2, 3, 4], [5, 6, 7, 8, 9]]){noformat} for the first partition, then {{collect}} will iterate over that sequence and return each element, which will also be a numpy array, so you get {noformat}array([0, 1, 2, 3, 4]), array([5, 6, 7, 8, 9])) {noformat} for the first 2 elements and so on.. I believe this is working as it is supposed to, and in general, {{mapPartitions}} will not usually give the same result as {{map}} - it will fail if the function does not return a valid sequence. The documentation could perhaps be a little clearer in that regard. > pyspark mapPartitions with none generator functions / functors > -- > > Key: SPARK-15861 > URL: https://issues.apache.org/jira/browse/SPARK-15861 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.1 >Reporter: Greg Bowyer >Priority: Minor > > Hi all, it appears that the method `rdd.mapPartitions` does odd things if it > is fed a normal subroutine. > For instance, lets say we have the following > {code} > rows = range(25) > rows = [rows[i:i+5] for i in range(0, len(rows), 5)] > rdd = sc.parallelize(rows, 2) > def to_np(data): > return np.array(list(data)) > rdd.mapPartitions(to_np).collect() > ... > [array([0, 1, 2, 3, 4]), > array([5, 6, 7, 8, 9]), > array([10, 11, 12, 13, 14]), > array([15, 16, 17, 18, 19]), > array([20, 21, 22, 23, 24])] > rdd.mapPartitions(to_np, preservePartitioning=True).collect() > ... > [array([0, 1, 2, 3, 4]), > array([5, 6, 7, 8, 9]), > array([10, 11, 12, 13, 14]), > array([15, 16, 17, 18, 19]), > array([20, 21, 22, 23, 24])] > {code} > This basically makes the provided function that did return act like the end > user called {code}rdd.map{code} > I think that maybe a check should be put in to call > {code}inspect.isgeneratorfunction{code} > ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-5374) abstract RDD's DAG graph iteration in DAGScheduler
[ https://issues.apache.org/jira/browse/SPARK-5374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout closed SPARK-5374. - Resolution: Duplicate Closing this because it duplicates the more narrowly-scoped JIRAs linked above. > abstract RDD's DAG graph iteration in DAGScheduler > -- > > Key: SPARK-5374 > URL: https://issues.apache.org/jira/browse/SPARK-5374 > Project: Spark > Issue Type: Sub-task > Components: Scheduler >Reporter: Wenchen Fan > > DAGScheduler has many methods that iterate an RDD's DAG graph, we should > abstract the iterate process to reduce code size. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15690) Fast single-node (single-process) in-memory shuffle
[ https://issues.apache.org/jira/browse/SPARK-15690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328255#comment-15328255 ] Saisai Shao commented on SPARK-15690: - Hi [~rxin], what's the meaning of "single-process", is that referring to something similar to local mode? > Fast single-node (single-process) in-memory shuffle > --- > > Key: SPARK-15690 > URL: https://issues.apache.org/jira/browse/SPARK-15690 > Project: Spark > Issue Type: New Feature > Components: Shuffle, SQL >Reporter: Reynold Xin > > Spark's current shuffle implementation sorts all intermediate data by their > partition id, and then write the data to disk. This is not a big bottleneck > because the network throughput on commodity clusters tend to be low. However, > an increasing number of Spark users are using the system to process data on a > single-node. When in a single node operating against intermediate data that > fits in memory, the existing shuffle code path can become a big bottleneck. > The goal of this ticket is to change Spark so it can use in-memory radix sort > to do data shuffling on a single node, and still gracefully fallback to disk > if the data size does not fit in memory. Given the number of partitions is > usually small (say less than 256), it'd require only a single pass do to the > radix sort with pretty decent CPU efficiency. > Note that there have been many in-memory shuffle attempts in the past. This > ticket has a smaller scope (single-process), and aims to actually > productionize this code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15861) pyspark mapPartitions with none generator functions / functors
[ https://issues.apache.org/jira/browse/SPARK-15861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328245#comment-15328245 ] Bryan Cutler commented on SPARK-15861: -- [~gbow...@fastmail.co.uk] {{mapPartitions}} expects a function the takes an iterator as input then outputs an iterable sequence, and your function in the example is actually providing this. I think what is going on here is your function will map the iterator to a numpy array, that internally will be something like {noformat}array([[0, 1, 2, 3, 4], [5, 6, 7, 8, 9]]){noformat} for the first partition, then {{collect}} will iterate over that sequence and return each element, which will also be a numpy array, so you get {noformat}array([0, 1, 2, 3, 4]), array([5, 6, 7, 8, 9])) {noformat} for the first 2 elements and so on.. I believe this is working as it is supposed to, and in general, {{mapPartitions}} will not usually give the same result as {{map}} - it will fail if the function does not return a valid sequence. The documentation could perhaps be a little clearer in that regard. > pyspark mapPartitions with none generator functions / functors > -- > > Key: SPARK-15861 > URL: https://issues.apache.org/jira/browse/SPARK-15861 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.1 >Reporter: Greg Bowyer >Priority: Minor > > Hi all, it appears that the method `rdd.mapPartitions` does odd things if it > is fed a normal subroutine. > For instance, lets say we have the following > {code} > rows = range(25) > rows = [rows[i:i+5] for i in range(0, len(rows), 5)] > rdd = sc.parallelize(rows, 2) > def to_np(data): > return np.array(list(data)) > rdd.mapPartitions(to_np).collect() > ... > [array([0, 1, 2, 3, 4]), > array([5, 6, 7, 8, 9]), > array([10, 11, 12, 13, 14]), > array([15, 16, 17, 18, 19]), > array([20, 21, 22, 23, 24])] > rdd.mapPartitions(to_np, preservePartitioning=True).collect() > ... > [array([0, 1, 2, 3, 4]), > array([5, 6, 7, 8, 9]), > array([10, 11, 12, 13, 14]), > array([15, 16, 17, 18, 19]), > array([20, 21, 22, 23, 24])] > {code} > This basically makes the provided function that did return act like the end > user called {code}rdd.map{code} > I think that maybe a check should be put in to call > {code}inspect.isgeneratorfunction{code} > ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15889) Add a unique id to ContinuousQuery
[ https://issues.apache.org/jira/browse/SPARK-15889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-15889. -- Resolution: Fixed > Add a unique id to ContinuousQuery > -- > > Key: SPARK-15889 > URL: https://issues.apache.org/jira/browse/SPARK-15889 > Project: Spark > Issue Type: Sub-task > Components: SQL, Streaming >Reporter: Tathagata Das >Assignee: Tathagata Das > > ContinuousQueries have names that are unique across all the active ones. > However, when queries are rapidly restarted with same name, it causes races > conditions with the listener. A listener event from a stopped query can > arrive after the query has been restarted, leading to complexities in > monitoring infrastructure. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15889) Add a unique id to ContinuousQuery
[ https://issues.apache.org/jira/browse/SPARK-15889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-15889: - Fix Version/s: 2.0.0 > Add a unique id to ContinuousQuery > -- > > Key: SPARK-15889 > URL: https://issues.apache.org/jira/browse/SPARK-15889 > Project: Spark > Issue Type: Sub-task > Components: SQL, Streaming >Reporter: Tathagata Das >Assignee: Tathagata Das > Fix For: 2.0.0 > > > ContinuousQueries have names that are unique across all the active ones. > However, when queries are rapidly restarted with same name, it causes races > conditions with the listener. A listener event from a stopped query can > arrive after the query has been restarted, leading to complexities in > monitoring infrastructure. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15530) Partitioning discovery logic HadoopFsRelation should use a higher setting of parallelism
[ https://issues.apache.org/jira/browse/SPARK-15530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-15530: - Assignee: Takeshi Yamamuro > Partitioning discovery logic HadoopFsRelation should use a higher setting of > parallelism > > > Key: SPARK-15530 > URL: https://issues.apache.org/jira/browse/SPARK-15530 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Assignee: Takeshi Yamamuro > Fix For: 2.0.0 > > > At > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala#L418, > we launch a spark job to do parallel file listing in order to discover > partitions. However, we do not set the number of partitions at here, which > means that we are using the default parallelism of the cluster. It is better > to set the number of partitions explicitly to generate smaller tasks, which > help load balancing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15530) Partitioning discovery logic HadoopFsRelation should use a higher setting of parallelism
[ https://issues.apache.org/jira/browse/SPARK-15530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-15530. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 13444 [https://github.com/apache/spark/pull/13444] > Partitioning discovery logic HadoopFsRelation should use a higher setting of > parallelism > > > Key: SPARK-15530 > URL: https://issues.apache.org/jira/browse/SPARK-15530 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai > Fix For: 2.0.0 > > > At > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala#L418, > we launch a spark job to do parallel file listing in order to discover > partitions. However, we do not set the number of partitions at here, which > means that we are using the default parallelism of the cluster. It is better > to set the number of partitions explicitly to generate smaller tasks, which > help load balancing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15924) SparkR parser bug with backslash in comments
[ https://issues.apache.org/jira/browse/SPARK-15924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328205#comment-15328205 ] Xuan Wang commented on SPARK-15924: --- I then realized that this is not a problem with SparkR, so I closed the issue. Thanks! > SparkR parser bug with backslash in comments > > > Key: SPARK-15924 > URL: https://issues.apache.org/jira/browse/SPARK-15924 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.6.1 >Reporter: Xuan Wang > > When I run an R cell with the following comments: > {code} > # p <- p + scale_fill_manual(values = set2[groups]) > # # p <- p + scale_fill_brewer(palette = "Set2") + > scale_color_brewer(palette = "Set2") > # p <- p + scale_x_date(labels = date_format("%m/%d\n%a")) > # p > {code} > I get the following error message > {quote} > :16:1: unexpected input > 15: # p <- p + scale_x_date(labels = date_format("%m/%d > 16: %a")) > ^ > {quote} > After I remove the backslash in "date_format("%m/%d\n%a"))", it works fine. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15822) segmentation violation in o.a.s.unsafe.types.UTF8String
[ https://issues.apache.org/jira/browse/SPARK-15822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328200#comment-15328200 ] Herman van Hovell commented on SPARK-15822: --- [~robbinspg] Could you try this without caching? > segmentation violation in o.a.s.unsafe.types.UTF8String > > > Key: SPARK-15822 > URL: https://issues.apache.org/jira/browse/SPARK-15822 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 > Environment: linux amd64 > openjdk version "1.8.0_91" > OpenJDK Runtime Environment (build 1.8.0_91-b14) > OpenJDK 64-Bit Server VM (build 25.91-b14, mixed mode) >Reporter: Pete Robbins >Assignee: Herman van Hovell >Priority: Blocker > > Executors fail with segmentation violation while running application with > spark.memory.offHeap.enabled true > spark.memory.offHeap.size 512m > Also now reproduced with > spark.memory.offHeap.enabled false > {noformat} > # > # A fatal error has been detected by the Java Runtime Environment: > # > # SIGSEGV (0xb) at pc=0x7f4559b4d4bd, pid=14182, tid=139935319750400 > # > # JRE version: OpenJDK Runtime Environment (8.0_91-b14) (build 1.8.0_91-b14) > # Java VM: OpenJDK 64-Bit Server VM (25.91-b14 mixed mode linux-amd64 > compressed oops) > # Problematic frame: > # J 4816 C2 > org.apache.spark.unsafe.types.UTF8String.compareTo(Lorg/apache/spark/unsafe/types/UTF8String;)I > (64 bytes) @ 0x7f4559b4d4bd [0x7f4559b4d460+0x5d] > {noformat} > We initially saw this on IBM java on PowerPC box but is recreatable on linux > with OpenJDK. On linux with IBM Java 8 we see a null pointer exception at the > same code point: > {noformat} > 16/06/08 11:14:58 ERROR Executor: Exception in task 1.0 in stage 5.0 (TID 48) > java.lang.NullPointerException > at > org.apache.spark.unsafe.types.UTF8String.compareTo(UTF8String.java:831) > at org.apache.spark.unsafe.types.UTF8String.compare(UTF8String.java:844) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$doExecute$2$$anon$2.hasNext(WholeStageCodegenExec.scala:377) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > scala.collection.convert.Wrappers$IteratorWrapper.hasNext(Wrappers.scala:30) > at org.spark_project.guava.collect.Ordering.leastOf(Ordering.java:664) > at org.apache.spark.util.collection.Utils$.takeOrdered(Utils.scala:37) > at > org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$30.apply(RDD.scala:1365) > at > org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$30.apply(RDD.scala:1362) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:757) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:757) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:282) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1153) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.lang.Thread.run(Thread.java:785) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15676) Disallow Column Names as Partition Columns For Hive Tables
[ https://issues.apache.org/jira/browse/SPARK-15676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-15676. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 13415 [https://github.com/apache/spark/pull/13415] > Disallow Column Names as Partition Columns For Hive Tables > -- > > Key: SPARK-15676 > URL: https://issues.apache.org/jira/browse/SPARK-15676 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > Fix For: 2.0.0 > > > Below is a common mistake users might make: > {noformat} > hive> CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY (data > string, part string); > FAILED: SemanticException [Error 10035]: Column repeated in partitioning > columns > {noformat} > Different from what Hive returned, currently, we return a confusing error > message: > {noformat} > org.apache.spark.sql.AnalysisException: > org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:For > direct MetaStore DB connections, we don't support retries at the client > level.); > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15676) Disallow Column Names as Partition Columns For Hive Tables
[ https://issues.apache.org/jira/browse/SPARK-15676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-15676: - Assignee: Xiao Li > Disallow Column Names as Partition Columns For Hive Tables > -- > > Key: SPARK-15676 > URL: https://issues.apache.org/jira/browse/SPARK-15676 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li >Assignee: Xiao Li > Fix For: 2.0.0 > > > Below is a common mistake users might make: > {noformat} > hive> CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY (data > string, part string); > FAILED: SemanticException [Error 10035]: Column repeated in partitioning > columns > {noformat} > Different from what Hive returned, currently, we return a confusing error > message: > {noformat} > org.apache.spark.sql.AnalysisException: > org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:For > direct MetaStore DB connections, we don't support retries at the client > level.); > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org