[jira] [Updated] (SPARK-19007) Speedup and optimize the GradientBoostedTrees in the "data>memory" scene
[ https://issues.apache.org/jira/browse/SPARK-19007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhangdenghui updated SPARK-19007: - Description: Test data:80G CTR training data from criteolabs(http://criteolabs.wpengine.com/downloads/download-terabyte-click-logs/ ) ,I used 1 of the 24 days' data.Some features needed to be repalced by new generated continuous features,the way to generate the new features refers to the way mentioned in the xgboost's paper. Recource allocated: spark on yarn, 20 executors, 8G memory and 2 cores per executor. Parameters: numIterations 10, maxdepth 8, the rest parameters are default I tested the GradientBoostedTrees algorithm in mllib using 80G CTR data mentioned above. It totally costs 1.5 hour, and i found many task failures after 6 or 7 GBT rounds later.Without these task failures and task retry it can be much faster ,which can save about half the time. I think it's caused by the RDD named predError in the while loop of the boost method at GradientBoostedTrees.scala,because the lineage of the RDD named predError is growing after every GBT round, and then it caused failures like this : (ExecutorLostFailure (executor 6 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 10.2 GB of 10 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.). I tried to boosting spark.yarn.executor.memoryOverhead but the meomry it needed is too much (even increase half the memory can't solve the problem) so i think it's not a proper method. Although it can set the predCheckpoint Interval smaller to cut the line of the lineage but it increases IO cost a lot. I tried another way to solve this problem.I persisted the RDD named predError every round and use pre_predError to record the previous RDD and unpersist it because it's useless anymore. Finally it costs about 45 min after i tried my method and no task failure occured and no more memeory added. So when the data is much larger than memory, my little improvement can speedup the GradientBoostedTrees 1~2 times. was: Test data:80G CTR training data from criteolabs(http://criteolabs.wpengine.com/downloads/download-terabyte-click-logs/ ) ,I used 1 of the 24 days' data.Some features needed to be repalced by new generated continuous features,the way to generate the new features refers to the way mentioned in the xgboost's paper. Recource allocated: spark on yarn, 20 executors, 8G memory and 2 cores per executor. I tested the GradientBoostedTrees algorithm in mllib using 80G CTR data mentioned above. It totally costs 1.5 hour, and i found many task failures after 6 or 7 GBT rounds later.Without these task failures and task retry it can be much faster ,which can save about half the time. I think it's caused by the RDD named predError in the while loop of the boost method at GradientBoostedTrees.scala,because the lineage of the RDD named predError is growing after every GBT round, and then it caused failures like this : (ExecutorLostFailure (executor 6 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 10.2 GB of 10 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.). I tried to boosting spark.yarn.executor.memoryOverhead but the meomry it needed is too much (even increase half the memory can't solve the problem) so i think it's not a proper method. Although it can set the predCheckpoint Interval smaller to cut the line of the lineage but it increases IO cost a lot. I tried another way to solve this problem.I persisted the RDD named predError every round and use pre_predError to record the previous RDD and unpersist it because it's useless anymore. Finally it costs about 45 min after i tried my method and no task failure occured and no more memeory added. So when the data is much larger than memory, my little improvement can speedup the GradientBoostedTrees 1~2 times. > Speedup and optimize the GradientBoostedTrees in the "data>memory" scene > > > Key: SPARK-19007 > URL: https://issues.apache.org/jira/browse/SPARK-19007 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 1.6.2, 1.6.3, 2.0.0, > 2.0.1, 2.0.2, 2.1.0 > Environment: A CDH cluster consists of 3 redhat server ,(120G > memory、40 cores、43TB disk per server). >Reporter: zhangdenghui > Fix For: 2.1.0 > > Original Estimate: 168h > Remaining Estimate: 168h > > Test data:80G CTR training data from > criteolabs(http://criteolabs.wpengine.com/downloads/download-terabyte-click-logs/ > ) ,I used 1 of the 24 days' data.Some
[jira] [Updated] (SPARK-19004) Fix `JDBCWriteSuite.testH2Dialect` by removing `getCatalystType`
[ https://issues.apache.org/jira/browse/SPARK-19004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-19004: -- Description: JDBCSuite and JDBCWriterSuite have its own testH2Dialect for their testing purposes. This issue fixes testH2Dialect in JDBCWriterSuite by removing getCatalystType implementation in order to return correct types. Currently, it returns Some(StringType) incorrectly. For the testH2Dialect in JDBCSuite, it's intentional because of the test case Remap types via JdbcDialects. was: `JdbcDialect` subclasses should return `None` by default. However, `testH2Dialect.getCatalystType` returns always `Some(StringType)` incorrectly. This issue fixes `testH2Dialect` by removing the wrong `getCatalystType` implementation. > Fix `JDBCWriteSuite.testH2Dialect` by removing `getCatalystType` > > > Key: SPARK-19004 > URL: https://issues.apache.org/jira/browse/SPARK-19004 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.1.0 >Reporter: Dongjoon Hyun >Priority: Minor > > JDBCSuite and JDBCWriterSuite have its own testH2Dialect for their testing > purposes. > This issue fixes testH2Dialect in JDBCWriterSuite by removing getCatalystType > implementation in order to return correct types. Currently, it returns > Some(StringType) incorrectly. For the testH2Dialect in JDBCSuite, it's > intentional because of the test case Remap types via JdbcDialects. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19004) Fix `JDBCWriteSuite.testH2Dialect` by removing `getCatalystType`
[ https://issues.apache.org/jira/browse/SPARK-19004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-19004: -- Summary: Fix `JDBCWriteSuite.testH2Dialect` by removing `getCatalystType` (was: Fix `testH2Dialect` by removing `getCatalystType`) > Fix `JDBCWriteSuite.testH2Dialect` by removing `getCatalystType` > > > Key: SPARK-19004 > URL: https://issues.apache.org/jira/browse/SPARK-19004 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.1.0 >Reporter: Dongjoon Hyun >Priority: Minor > > `JdbcDialect` subclasses should return `None` by default. However, > `testH2Dialect.getCatalystType` returns always `Some(StringType)` incorrectly. > This issue fixes `testH2Dialect` by removing the wrong `getCatalystType` > implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19007) Speedup and optimize the GradientBoostedTrees in the "data>memory" scene
[ https://issues.apache.org/jira/browse/SPARK-19007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19007: Assignee: Apache Spark > Speedup and optimize the GradientBoostedTrees in the "data>memory" scene > > > Key: SPARK-19007 > URL: https://issues.apache.org/jira/browse/SPARK-19007 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 1.6.2, 1.6.3, 2.0.0, > 2.0.1, 2.0.2, 2.1.0 > Environment: A CDH cluster consists of 3 redhat server ,(120G > memory、40 cores、43TB disk per server). >Reporter: zhangdenghui >Assignee: Apache Spark > Fix For: 2.1.0 > > Original Estimate: 168h > Remaining Estimate: 168h > > Test data:80G CTR training data from > criteolabs(http://criteolabs.wpengine.com/downloads/download-terabyte-click-logs/ > ) ,I used 1 of the 24 days' data.Some features needed to be repalced by new > generated continuous features,the way to generate the new features refers to > the way mentioned in the xgboost's paper. > Recource allocated: spark on yarn, 20 executors, 8G memory and 2 cores per > executor. > I tested the GradientBoostedTrees algorithm in mllib using 80G CTR data > mentioned above. > It totally costs 1.5 hour, and i found many task failures after 6 or 7 GBT > rounds later.Without these task failures and task retry it can be much faster > ,which can save about half the time. I think it's caused by the RDD named > predError in the while loop of the boost method at > GradientBoostedTrees.scala,because the lineage of the RDD named predError is > growing after every GBT round, and then it caused failures like this : > (ExecutorLostFailure (executor 6 exited caused by one of the running tasks) > Reason: Container killed by YARN for exceeding memory limits. 10.2 GB of 10 > GB physical memory used. Consider boosting > spark.yarn.executor.memoryOverhead.). > I tried to boosting spark.yarn.executor.memoryOverhead but the meomry it > needed is too much (even increase half the memory can't solve the problem) > so i think it's not a proper method. > Although it can set the predCheckpoint Interval smaller to cut the line of > the lineage but it increases IO cost a lot. > I tried another way to solve this problem.I persisted the RDD named > predError every round and use pre_predError to record the previous RDD and > unpersist it because it's useless anymore. > Finally it costs about 45 min after i tried my method and no task failure > occured and no more memeory added. > So when the data is much larger than memory, my little improvement can > speedup the GradientBoostedTrees 1~2 times. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19007) Speedup and optimize the GradientBoostedTrees in the "data>memory" scene
[ https://issues.apache.org/jira/browse/SPARK-19007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15779790#comment-15779790 ] Apache Spark commented on SPARK-19007: -- User 'zdh2292390' has created a pull request for this issue: https://github.com/apache/spark/pull/16415 > Speedup and optimize the GradientBoostedTrees in the "data>memory" scene > > > Key: SPARK-19007 > URL: https://issues.apache.org/jira/browse/SPARK-19007 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 1.6.2, 1.6.3, 2.0.0, > 2.0.1, 2.0.2, 2.1.0 > Environment: A CDH cluster consists of 3 redhat server ,(120G > memory、40 cores、43TB disk per server). >Reporter: zhangdenghui > Fix For: 2.1.0 > > Original Estimate: 168h > Remaining Estimate: 168h > > Test data:80G CTR training data from > criteolabs(http://criteolabs.wpengine.com/downloads/download-terabyte-click-logs/ > ) ,I used 1 of the 24 days' data.Some features needed to be repalced by new > generated continuous features,the way to generate the new features refers to > the way mentioned in the xgboost's paper. > Recource allocated: spark on yarn, 20 executors, 8G memory and 2 cores per > executor. > I tested the GradientBoostedTrees algorithm in mllib using 80G CTR data > mentioned above. > It totally costs 1.5 hour, and i found many task failures after 6 or 7 GBT > rounds later.Without these task failures and task retry it can be much faster > ,which can save about half the time. I think it's caused by the RDD named > predError in the while loop of the boost method at > GradientBoostedTrees.scala,because the lineage of the RDD named predError is > growing after every GBT round, and then it caused failures like this : > (ExecutorLostFailure (executor 6 exited caused by one of the running tasks) > Reason: Container killed by YARN for exceeding memory limits. 10.2 GB of 10 > GB physical memory used. Consider boosting > spark.yarn.executor.memoryOverhead.). > I tried to boosting spark.yarn.executor.memoryOverhead but the meomry it > needed is too much (even increase half the memory can't solve the problem) > so i think it's not a proper method. > Although it can set the predCheckpoint Interval smaller to cut the line of > the lineage but it increases IO cost a lot. > I tried another way to solve this problem.I persisted the RDD named > predError every round and use pre_predError to record the previous RDD and > unpersist it because it's useless anymore. > Finally it costs about 45 min after i tried my method and no task failure > occured and no more memeory added. > So when the data is much larger than memory, my little improvement can > speedup the GradientBoostedTrees 1~2 times. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19007) Speedup and optimize the GradientBoostedTrees in the "data>memory" scene
[ https://issues.apache.org/jira/browse/SPARK-19007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19007: Assignee: (was: Apache Spark) > Speedup and optimize the GradientBoostedTrees in the "data>memory" scene > > > Key: SPARK-19007 > URL: https://issues.apache.org/jira/browse/SPARK-19007 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 1.6.2, 1.6.3, 2.0.0, > 2.0.1, 2.0.2, 2.1.0 > Environment: A CDH cluster consists of 3 redhat server ,(120G > memory、40 cores、43TB disk per server). >Reporter: zhangdenghui > Fix For: 2.1.0 > > Original Estimate: 168h > Remaining Estimate: 168h > > Test data:80G CTR training data from > criteolabs(http://criteolabs.wpengine.com/downloads/download-terabyte-click-logs/ > ) ,I used 1 of the 24 days' data.Some features needed to be repalced by new > generated continuous features,the way to generate the new features refers to > the way mentioned in the xgboost's paper. > Recource allocated: spark on yarn, 20 executors, 8G memory and 2 cores per > executor. > I tested the GradientBoostedTrees algorithm in mllib using 80G CTR data > mentioned above. > It totally costs 1.5 hour, and i found many task failures after 6 or 7 GBT > rounds later.Without these task failures and task retry it can be much faster > ,which can save about half the time. I think it's caused by the RDD named > predError in the while loop of the boost method at > GradientBoostedTrees.scala,because the lineage of the RDD named predError is > growing after every GBT round, and then it caused failures like this : > (ExecutorLostFailure (executor 6 exited caused by one of the running tasks) > Reason: Container killed by YARN for exceeding memory limits. 10.2 GB of 10 > GB physical memory used. Consider boosting > spark.yarn.executor.memoryOverhead.). > I tried to boosting spark.yarn.executor.memoryOverhead but the meomry it > needed is too much (even increase half the memory can't solve the problem) > so i think it's not a proper method. > Although it can set the predCheckpoint Interval smaller to cut the line of > the lineage but it increases IO cost a lot. > I tried another way to solve this problem.I persisted the RDD named > predError every round and use pre_predError to record the previous RDD and > unpersist it because it's useless anymore. > Finally it costs about 45 min after i tried my method and no task failure > occured and no more memeory added. > So when the data is much larger than memory, my little improvement can > speedup the GradientBoostedTrees 1~2 times. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15044) spark-sql will throw "input path does not exist" exception if it handles a partition which exists in hive table, but the path is removed manually
[ https://issues.apache.org/jira/browse/SPARK-15044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15779766#comment-15779766 ] Yuming Wang commented on SPARK-15044: - I've tested on v2.1.0-rc5, it works fine if {{spark.sql.hive.verifyPartitionPath=true}}: {code:sql} create table spark_15044 (n string) partitioned by (p string); insert overwrite table spark_15044 partition(p='2016-12-27') values('27'); insert overwrite table spark_15044 partition(p='2016-12-26') values('26'); insert overwrite table spark_15044 partition(p='2016-12-25') values('25'); dfs -rmr /user/hive/warehouse/spark_15044/p=2016-12-25; set spark.sql.hive.verifyPartitionPath=true; select * from spark_15044; {code} > spark-sql will throw "input path does not exist" exception if it handles a > partition which exists in hive table, but the path is removed manually > - > > Key: SPARK-15044 > URL: https://issues.apache.org/jira/browse/SPARK-15044 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1, 2.0.0 >Reporter: huangyu > > spark-sql will throw "input path not exist" exception if it handles a > partition which exists in hive table, but the path is removed manually.The > situation is as follows: > 1) Create a table "test". "create table test (n string) partitioned by (p > string)" > 2) Load some data into partition(p='1') > 3)Remove the path related to partition(p='1') of table test manually. "hadoop > fs -rmr /warehouse//test/p=1" > 4)Run spark sql, spark-sql -e "select n from test where p='1';" > Then it throws exception: > {code} > org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: > ./test/p=1 > at > org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285) > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > {code} > The bug is in spark 1.6.1, if I use spark 1.4.0, It is OK > I think spark-sql should ignore the path, just like hive or it dose in early > versions, rather than throw an exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For
[jira] [Assigned] (SPARK-19009) Add doc for Streaming Rest API
[ https://issues.apache.org/jira/browse/SPARK-19009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19009: Assignee: (was: Apache Spark) > Add doc for Streaming Rest API > -- > > Key: SPARK-19009 > URL: https://issues.apache.org/jira/browse/SPARK-19009 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 2.0.2, 2.1.0 >Reporter: Genmao Yu > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19009) Add doc for Streaming Rest API
[ https://issues.apache.org/jira/browse/SPARK-19009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15779762#comment-15779762 ] Apache Spark commented on SPARK-19009: -- User 'uncleGen' has created a pull request for this issue: https://github.com/apache/spark/pull/16414 > Add doc for Streaming Rest API > -- > > Key: SPARK-19009 > URL: https://issues.apache.org/jira/browse/SPARK-19009 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 2.0.2, 2.1.0 >Reporter: Genmao Yu > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19009) Add doc for Streaming Rest API
[ https://issues.apache.org/jira/browse/SPARK-19009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19009: Assignee: Apache Spark > Add doc for Streaming Rest API > -- > > Key: SPARK-19009 > URL: https://issues.apache.org/jira/browse/SPARK-19009 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 2.0.2, 2.1.0 >Reporter: Genmao Yu >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18710) Add offset to GeneralizedLinearRegression models
[ https://issues.apache.org/jira/browse/SPARK-18710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15779760#comment-15779760 ] Wayne Zhang commented on SPARK-18710: - Thanks for the comment, Yanbo. In IRLS, the fit method expects RDD[Instance]. Does it still work if one feeds a RDD[GLRInstance] object to it? {code} def fit(instances: RDD[Instance]): IterativelyReweightedLeastSquaresModel = { {code} > Add offset to GeneralizedLinearRegression models > > > Key: SPARK-18710 > URL: https://issues.apache.org/jira/browse/SPARK-18710 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.0.2 >Reporter: Wayne Zhang >Assignee: Wayne Zhang > Labels: features > Original Estimate: 10h > Remaining Estimate: 10h > > The current GeneralizedLinearRegression model does not support offset. The > offset can be useful to take into account exposure, or for testing > incremental effect of new variables. It is possible to use weights in current > environment to achieve the same effect of specifying offset for certain > models, e.g., Poisson & Binomial with log offset, it is desirable to have the > offset option to work with more general cases, e.g., negative offset or > offset that is hard to specify using weights (e.g., offset to the probability > rather than odds in logistic regression). > Effort would involve: > * update regression class to support offsetCol > * update IWLS to take into account of offset > * add test case for offset > I can start working on this if the community approves this feature. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19009) Add doc for Streaming Rest API
Genmao Yu created SPARK-19009: - Summary: Add doc for Streaming Rest API Key: SPARK-19009 URL: https://issues.apache.org/jira/browse/SPARK-19009 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 2.1.0, 2.0.2 Reporter: Genmao Yu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19007) Speedup and optimize the GradientBoostedTrees in the "data>memory" scene
[ https://issues.apache.org/jira/browse/SPARK-19007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhangdenghui updated SPARK-19007: - Description: Test data:80G CTR training data from criteolabs(http://criteolabs.wpengine.com/downloads/download-terabyte-click-logs/ ) ,I used 1 of the 24 days' data.Some features needed to be repalced by new generated continuous features,the way to generate the new features refers to the way mentioned in the xgboost's paper. Recource allocated: spark on yarn, 20 executors, 8G memory and 2 cores per executor. I tested the GradientBoostedTrees algorithm in mllib using 80G CTR data mentioned above. It totally costs 1.5 hour, and i found many task failures after 6 or 7 GBT rounds later.Without these task failures and task retry it can be much faster ,which can save about half the time. I think it's caused by the RDD named predError in the while loop of the boost method at GradientBoostedTrees.scala,because the lineage of the RDD named predError is growing after every GBT round, and then it caused failures like this : (ExecutorLostFailure (executor 6 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 10.2 GB of 10 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.). I tried to boosting spark.yarn.executor.memoryOverhead but the meomry it needed is too much (even increase half the memory can't solve the problem) so i think it's not a proper method. Although it can set the predCheckpoint Interval smaller to cut the line of the lineage but it increases IO cost a lot. I tried another way to solve this problem.I persisted the RDD named predError every round and use pre_predError to record the previous RDD and unpersist it because it's useless anymore. Finally it costs about 45 min after i tried my method and no task failure occured and no more memeory added. So when the data is much larger than memory, my little improvement can speedup the GradientBoostedTrees 1~2 times. was: Test data:80G CTR training data from criteolabs(http://criteolabs.wpengine.com/downloads/download-terabyte-click-logs/ ) ,I used 1 of the 24 days' data.Some features needed to be repalced by new generated continuous features,the way to generate the new features refers to the way mentioned in the xgboost's paper. Recource allocated: spark on yarn, 20 executors, 8G memory and 2 cores per executor. I tested the GradientBoostedTrees algorithm in mllib using 80G CTR data mentioned above. It totally costs 1.5 hour, and i found many task failures after 6 or 7 GBT rounds later.Without these task failures and task retry it can be much faster ,which can save about half the time. I think it's caused by the RDD named predError in the while loop of the boost method at GradientBoostedTrees.scala,because the lineage of the RDD named predError is growing after every GBT round, and then it caused failures like this : (ExecutorLostFailure (executor 6 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 10.2 GB of 10 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.). I tried to boosting spark.yarn.executor.memoryOverhead but the meomry it needed is too much (even increase half the memory can't solve the problem) so i think it's not a proper method. Although it can set the predCheckpoint Interval smaller to cut the line of the lineage but it increases IO cost a lot. I tried another way to solve this problem.I persisted the RDD named predError every round and use pre_predError to record the previous RDD and unpersist it because it's useless anymore. Finally it costs about 45 min after i tried my method and no task failure occured and no more memeory added. So when the data is much larger than memory, my little improvement can speedup the GradientBoostedTrees 1~2 times. > Speedup and optimize the GradientBoostedTrees in the "data>memory" scene > > > Key: SPARK-19007 > URL: https://issues.apache.org/jira/browse/SPARK-19007 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 1.6.2, 1.6.3, 2.0.0, > 2.0.1, 2.0.2, 2.1.0 > Environment: A CDH cluster consists of 3 redhat server ,(120G > memory、40 cores、43TB disk per server). >Reporter: zhangdenghui > Fix For: 2.1.0 > > Original Estimate: 168h > Remaining Estimate: 168h > > Test data:80G CTR training data from > criteolabs(http://criteolabs.wpengine.com/downloads/download-terabyte-click-logs/ > ) ,I used 1 of the 24 days' data.Some features needed to be repalced by new > generated continuous features,the way to
[jira] [Commented] (SPARK-18955) Add ability to emit kafka events to DStream or KafkaDStream
[ https://issues.apache.org/jira/browse/SPARK-18955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15779554#comment-15779554 ] Russell Jurney commented on SPARK-18955: Can I please get feedback as to whether this patch would be accepted? I don't want to do the work if it isn't even something that would be merged. > Add ability to emit kafka events to DStream or KafkaDStream > --- > > Key: SPARK-18955 > URL: https://issues.apache.org/jira/browse/SPARK-18955 > Project: Spark > Issue Type: New Feature > Components: DStreams, PySpark >Affects Versions: 2.0.2 >Reporter: Russell Jurney > Labels: features, newbie > > Any I/O that needs doing in Spark Streaming seems to have to be done in a > DStream.foreachRDD loop. For instance, in PySpark if I want to emit Kafka > events for each record... I have to DStream.foreachRDD and use kafka-python > to emit a Kafka event for each record. > This really seems like I/O like this should be part of the pyspark.streaming > or pyspark.streaming.kafka API and the equivalent Scala APIs. Something like > DStream.emitKafkaEvents or KafkaDStream.emitKafkaEvents would seem to make > sense. > If this is a good idea, and it seems feasible, I'd like to take a crack at it > as my first patch for Spark. Advice would be appreciated. What would need to > be modified to make this happen? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19008) Avoid boxing/unboxing overhead of calling a lambda with primitive type from Dataset program
[ https://issues.apache.org/jira/browse/SPARK-19008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15779544#comment-15779544 ] Kazuaki Ishizaki commented on SPARK-19008: -- I will work for this > Avoid boxing/unboxing overhead of calling a lambda with primitive type from > Dataset program > --- > > Key: SPARK-19008 > URL: https://issues.apache.org/jira/browse/SPARK-19008 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Kazuaki Ishizaki > > In a > [discussion|https://github.com/apache/spark/pull/16391#discussion_r93788919] > between [~cloud_fan] and [~kiszk], we noticed an opportunity to avoid > boxing/unboxing overhead when a Dataset program calls a lambda, which > operates on a primitive type, written in Scala. > In such a case, Catalyst can directly call a method {{ > apply();}} instead of {{Object apply(Object);}}. > Of course, the best solution seems to be > [here|https://issues.apache.org/jira/browse/SPARK-14083]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19008) Avoid boxing/unboxing overhead of calling a lambda with primitive type from Dataset program
[ https://issues.apache.org/jira/browse/SPARK-19008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki updated SPARK-19008: - Description: In a [discussion|https://github.com/apache/spark/pull/16391#discussion_r93788919] between [~cloud_fan] and [~kiszk], we noticed an opportunity to avoid boxing/unboxing overhead when a Dataset program calls a lambda, which operates on a primitive type, written in Scala. In such a case, Catalyst can directly call a method {{ apply();}} instead of {{Object apply(Object);}}. Of course, the best solution seems to be [here|https://issues.apache.org/jira/browse/SPARK-14083]. was: In this [discussion|https://github.com/apache/spark/pull/16391#discussion_r93788919] betweem [~cloud_fan] and [~kiszk], we noticed an opportunity to avoid boxing/unboxing overhead when a Dataset program calls a lambda, which operates on a primitive type, written in Scala. In such a case, Catalyst can directly call a method {{ apply();}} instead of {{Object apply(Object);}}. Of course, the best solution seems to be [here|https://issues.apache.org/jira/browse/SPARK-14083]. > Avoid boxing/unboxing overhead of calling a lambda with primitive type from > Dataset program > --- > > Key: SPARK-19008 > URL: https://issues.apache.org/jira/browse/SPARK-19008 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Kazuaki Ishizaki > > In a > [discussion|https://github.com/apache/spark/pull/16391#discussion_r93788919] > between [~cloud_fan] and [~kiszk], we noticed an opportunity to avoid > boxing/unboxing overhead when a Dataset program calls a lambda, which > operates on a primitive type, written in Scala. > In such a case, Catalyst can directly call a method {{ > apply();}} instead of {{Object apply(Object);}}. > Of course, the best solution seems to be > [here|https://issues.apache.org/jira/browse/SPARK-14083]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19008) Avoid boxing/unboxing overhead of calling a lambda with primitive type from Dataset program
Kazuaki Ishizaki created SPARK-19008: Summary: Avoid boxing/unboxing overhead of calling a lambda with primitive type from Dataset program Key: SPARK-19008 URL: https://issues.apache.org/jira/browse/SPARK-19008 Project: Spark Issue Type: Improvement Components: SQL Reporter: Kazuaki Ishizaki In this [discussion|https://github.com/apache/spark/pull/16391#discussion_r93788919] betweem [~cloud_fan] and [~kiszk], we noticed an opportunity to avoid boxing/unboxing overhead when a Dataset program calls a lambda, which operates on a primitive type, written in Scala. In such a case, Catalyst can directly call a method {{ apply();}} instead of {{Object apply(Object);}}. Of course, the best solution seems to be [here|https://issues.apache.org/jira/browse/SPARK-14083]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19007) Speedup and optimize the GradientBoostedTrees in the "data>memory" scene
[ https://issues.apache.org/jira/browse/SPARK-19007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhangdenghui updated SPARK-19007: - Description: Test data:80G CTR training data from criteolabs(http://criteolabs.wpengine.com/downloads/download-terabyte-click-logs/ ) ,I used 1 of the 24 days' data.Some features needed to be repalced by new generated continuous features,the way to generate the new features refers to the way mentioned in the xgboost's paper. Recource allocated: spark on yarn, 20 executors, 8G memory and 2 cores per executor. I tested the GradientBoostedTrees algorithm in mllib using 80G CTR data mentioned above. It totally costs 1.5 hour, and i found many task failures after 6 or 7 GBT rounds later.Without these task failures and task retry it can be much faster ,which can save about half the time. I think it's caused by the RDD named predError in the while loop of the boost method at GradientBoostedTrees.scala,because the lineage of the RDD named predError is growing after every GBT round, and then it caused failures like this : (ExecutorLostFailure (executor 6 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 10.2 GB of 10 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.). I tried to boosting spark.yarn.executor.memoryOverhead but the meomry it needed is too much (even increase half the memory can't solve the problem) so i think it's not a proper method. Although it can set the predCheckpoint Interval smaller to cut the line of the lineage but it increases IO cost a lot. I tried another way to solve this problem.I persisted the RDD named predError every round and use pre_predError to record the previous RDD and unpersist it because it's useless anymore. Finally it costs about 45 min after i tried my method and no task failure occured and no more memeory added. So when the data is much larger than memory, my little improvement can speedup the GradientBoostedTrees 1~2 times. was: Test data:80G CTR training data from criteolabs(http://criteolabs.wpengine.com/downloads/download-terabyte-click-logs/),I used 1 of the 24 days' data.Some features needed to be repalced by new generated continuous features,the way to generate the new features refers to the way mentioned in the xgboost's paper. Recource allocated: spark on yarn, 20 executors, 8G memory and 2 cores per executor. I tested the GradientBoostedTrees algorithm in mllib using 80G CTR data mentioned above. It totally costs 1.5 hour, and i found many task failures after 6 or 7 GBT rounds later.Without these task failures and task retry it can be much faster ,which can save about half the time. I think it's caused by the RDD named predError in the while loop of the boost method at GradientBoostedTrees.scala,because the lineage of the RDD named predError is growing after every GBT round, and then it caused failures like this : (ExecutorLostFailure (executor 6 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 10.2 GB of 10 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.). I tried to boosting spark.yarn.executor.memoryOverhead but the meomry it needed is too much (even increase half the memory can't solve the problem) so i think it's not a proper method. Although it can set the predCheckpoint Interval smaller to cut the line of the lineage but it increases IO cost a lot. I tried another way to solve this problem.I persisted the RDD named predError every round and use pre_predError to record the previous RDD and unpersist it because it's useless anymore. Finally it costs about 45 min after i tried my method and no task failure occured and no more memeory added. So when the data is much larger than memory, my little improvement can speedup the GradientBoostedTrees 1~2 times. > Speedup and optimize the GradientBoostedTrees in the "data>memory" scene > > > Key: SPARK-19007 > URL: https://issues.apache.org/jira/browse/SPARK-19007 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 1.6.2, 1.6.3, 2.0.0, > 2.0.1, 2.0.2, 2.1.0 > Environment: A CDH cluster consists of 3 redhat server ,(120G > memory、40 cores、43TB disk per server). >Reporter: zhangdenghui > Fix For: 2.1.0 > > Original Estimate: 168h > Remaining Estimate: 168h > > Test data:80G CTR training data from > criteolabs(http://criteolabs.wpengine.com/downloads/download-terabyte-click-logs/ > ) ,I used 1 of the 24 days' data.Some features needed to be repalced by new > generated continuous features,the way to
[jira] [Created] (SPARK-19007) Speedup and optimize the GradientBoostedTrees in the "data>memory" scene
zhangdenghui created SPARK-19007: Summary: Speedup and optimize the GradientBoostedTrees in the "data>memory" scene Key: SPARK-19007 URL: https://issues.apache.org/jira/browse/SPARK-19007 Project: Spark Issue Type: Improvement Components: ML, MLlib Affects Versions: 2.1.0, 2.0.2, 2.0.1, 2.0.0, 1.6.3, 1.6.2, 1.6.1, 1.6.0, 1.5.2, 1.5.1, 1.5.0 Environment: A CDH cluster consists of 3 redhat server ,(120G memory、40 cores、43TB disk per server). Reporter: zhangdenghui Fix For: 2.1.0 Test data:80G CTR training data from criteolabs(http://criteolabs.wpengine.com/downloads/download-terabyte-click-logs/),I used 1 of the 24 days' data.Some features needed to be repalced by new generated continuous features,the way to generate the new features refers to the way mentioned in the xgboost's paper. Recource allocated: spark on yarn, 20 executors, 8G memory and 2 cores per executor. I tested the GradientBoostedTrees algorithm in mllib using 80G CTR data mentioned above. It totally costs 1.5 hour, and i found many task failures after 6 or 7 GBT rounds later.Without these task failures and task retry it can be much faster ,which can save about half the time. I think it's caused by the RDD named predError in the while loop of the boost method at GradientBoostedTrees.scala,because the lineage of the RDD named predError is growing after every GBT round, and then it caused failures like this : (ExecutorLostFailure (executor 6 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 10.2 GB of 10 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.). I tried to boosting spark.yarn.executor.memoryOverhead but the meomry it needed is too much (even increase half the memory can't solve the problem) so i think it's not a proper method. Although it can set the predCheckpoint Interval smaller to cut the line of the lineage but it increases IO cost a lot. I tried another way to solve this problem.I persisted the RDD named predError every round and use pre_predError to record the previous RDD and unpersist it because it's useless anymore. Finally it costs about 45 min after i tried my method and no task failure occured and no more memeory added. So when the data is much larger than memory, my little improvement can speedup the GradientBoostedTrees 1~2 times. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18931) Create empty staging directory in partitioned table on insert
[ https://issues.apache.org/jira/browse/SPARK-18931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15779439#comment-15779439 ] Hyukjin Kwon commented on SPARK-18931: -- Could we resolve this as a duplicate? > Create empty staging directory in partitioned table on insert > - > > Key: SPARK-18931 > URL: https://issues.apache.org/jira/browse/SPARK-18931 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: Egor Pahomov > > CREATE TABLE temp.test_partitioning_4 ( > num string > ) > PARTITIONED BY ( > day string) > stored as parquet > On every > INSERT INTO TABLE temp.test_partitioning_4 PARTITION (day) > select day, count(*) as num from > hss.session where year=2016 and month=4 > group by day > new directory > ".hive-staging_hive_2016-12-19_15-55-11_298_3412488541559534475-4" created on > HDFS. It's big issue, because I insert every day and bunch of empty dirs on > HDFS is very bad for HDFS. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3528) Reading data from file:/// should be called NODE_LOCAL not PROCESS_LOCAL
[ https://issues.apache.org/jira/browse/SPARK-3528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15779436#comment-15779436 ] Umesh Chaudhary commented on SPARK-3528: [~gip] looks like when we provide file:/// as URI, executors should be able to find it from their respective host, so makes sense for NODE_LOCAL. Found this[1] user list discussion worth mentioning. IMHO, executors read the data into same JVM (from the mentioned URI) and after that data becomes PROCESS_LOCAL for the executors. [1] http://apache-spark-user-list.1001560.n3.nabble.com/When-does-Spark-switch-from-PROCESS-LOCAL-to-NODE-LOCAL-or-RACK-LOCAL-td7091.html > Reading data from file:/// should be called NODE_LOCAL not PROCESS_LOCAL > > > Key: SPARK-3528 > URL: https://issues.apache.org/jira/browse/SPARK-3528 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Andrew Ash >Priority: Critical > > Note that reading from {{file:///.../pom.xml}} is called a PROCESS_LOCAL task > {noformat} > spark> sc.textFile("pom.xml").count > ... > 14/09/15 00:59:13 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, > localhost, PROCESS_LOCAL, 1191 bytes) > 14/09/15 00:59:13 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, > localhost, PROCESS_LOCAL, 1191 bytes) > 14/09/15 00:59:13 INFO Executor: Running task 0.0 in stage 0.0 (TID 0) > 14/09/15 00:59:13 INFO Executor: Running task 1.0 in stage 0.0 (TID 1) > 14/09/15 00:59:13 INFO HadoopRDD: Input split: > file:/Users/aash/git/spark/pom.xml:20862+20863 > 14/09/15 00:59:13 INFO HadoopRDD: Input split: > file:/Users/aash/git/spark/pom.xml:0+20862 > {noformat} > There is an outstanding TODO in {{HadoopRDD.scala}} that may be related: > {noformat} > override def getPreferredLocations(split: Partition): Seq[String] = { > // TODO: Filtering out "localhost" in case of file:// URLs > val hadoopSplit = split.asInstanceOf[HadoopPartition] > hadoopSplit.inputSplit.value.getLocations.filter(_ != "localhost") > } > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18997) Recommended upgrade libthrift to 0.9.3
[ https://issues.apache.org/jira/browse/SPARK-18997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15779389#comment-15779389 ] meiyoula commented on SPARK-18997: -- I think it is mainly consist in Hive. Now Hive has upgrade to 0.9.3 > Recommended upgrade libthrift to 0.9.3 > --- > > Key: SPARK-18997 > URL: https://issues.apache.org/jira/browse/SPARK-18997 > Project: Spark > Issue Type: Bug > Components: Build >Reporter: meiyoula >Priority: Critical > > libthrift 0.9.2 has a serious security vulnerability:CVE-2015-3254 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19002) Check pep8 against dev/*.py scripts
[ https://issues.apache.org/jira/browse/SPARK-19002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-19002: - Summary: Check pep8 against dev/*.py scripts (was: Check pep8 against merge_spark_pr.py script) > Check pep8 against dev/*.py scripts > --- > > Key: SPARK-19002 > URL: https://issues.apache.org/jira/browse/SPARK-19002 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Hyukjin Kwon >Priority: Trivial > > We can check pep8 against merge_spark_pr.py script. There are already several > python scripts there. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19002) Check pep8 against dev/*.py scripts
[ https://issues.apache.org/jira/browse/SPARK-19002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-19002: - Description: We can check pep8 against dev/*.py scripts. There are already several python scripts being checked. (was: We can check pep8 against merge_spark_pr.py script. There are already several python scripts there.) > Check pep8 against dev/*.py scripts > --- > > Key: SPARK-19002 > URL: https://issues.apache.org/jira/browse/SPARK-19002 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Hyukjin Kwon >Priority: Trivial > > We can check pep8 against dev/*.py scripts. There are already several python > scripts being checked. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19006) should mentioned the max value allowed for spark.kryoserializer.buffer.max in doc
[ https://issues.apache.org/jira/browse/SPARK-19006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19006: Assignee: Apache Spark > should mentioned the max value allowed for spark.kryoserializer.buffer.max in > doc > - > > Key: SPARK-19006 > URL: https://issues.apache.org/jira/browse/SPARK-19006 > Project: Spark > Issue Type: Documentation >Reporter: Yuexin Zhang >Assignee: Apache Spark > > On configuration doc > page:https://spark.apache.org/docs/latest/configuration.html > We mentioned spark.kryoserializer.buffer.max : Maximum allowable size of Kryo > serialization buffer. This must be larger than any object you attempt to > serialize. Increase this if you get a "buffer limit exceeded" exception > inside Kryo. > from source code, it has hard coded upper limit : > val maxBufferSizeMb = conf.getSizeAsMb("spark.kryoserializer.buffer.max", > "64m").toInt > if (maxBufferSizeMb >= ByteUnit.GiB.toMiB(2)) { > throw new IllegalArgumentException("spark.kryoserializer.buffer.max must > be less than " + > s"2048 mb, got: + $maxBufferSizeMb mb.") > } > We should mention "this value must be less than 2048 mb" on the config page > as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19006) should mentioned the max value allowed for spark.kryoserializer.buffer.max in doc
[ https://issues.apache.org/jira/browse/SPARK-19006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19006: Assignee: (was: Apache Spark) > should mentioned the max value allowed for spark.kryoserializer.buffer.max in > doc > - > > Key: SPARK-19006 > URL: https://issues.apache.org/jira/browse/SPARK-19006 > Project: Spark > Issue Type: Documentation >Reporter: Yuexin Zhang > > On configuration doc > page:https://spark.apache.org/docs/latest/configuration.html > We mentioned spark.kryoserializer.buffer.max : Maximum allowable size of Kryo > serialization buffer. This must be larger than any object you attempt to > serialize. Increase this if you get a "buffer limit exceeded" exception > inside Kryo. > from source code, it has hard coded upper limit : > val maxBufferSizeMb = conf.getSizeAsMb("spark.kryoserializer.buffer.max", > "64m").toInt > if (maxBufferSizeMb >= ByteUnit.GiB.toMiB(2)) { > throw new IllegalArgumentException("spark.kryoserializer.buffer.max must > be less than " + > s"2048 mb, got: + $maxBufferSizeMb mb.") > } > We should mention "this value must be less than 2048 mb" on the config page > as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19006) should mentioned the max value allowed for spark.kryoserializer.buffer.max in doc
[ https://issues.apache.org/jira/browse/SPARK-19006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15779334#comment-15779334 ] Apache Spark commented on SPARK-19006: -- User 'cnZach' has created a pull request for this issue: https://github.com/apache/spark/pull/16412 > should mentioned the max value allowed for spark.kryoserializer.buffer.max in > doc > - > > Key: SPARK-19006 > URL: https://issues.apache.org/jira/browse/SPARK-19006 > Project: Spark > Issue Type: Documentation >Reporter: Yuexin Zhang > > On configuration doc > page:https://spark.apache.org/docs/latest/configuration.html > We mentioned spark.kryoserializer.buffer.max : Maximum allowable size of Kryo > serialization buffer. This must be larger than any object you attempt to > serialize. Increase this if you get a "buffer limit exceeded" exception > inside Kryo. > from source code, it has hard coded upper limit : > val maxBufferSizeMb = conf.getSizeAsMb("spark.kryoserializer.buffer.max", > "64m").toInt > if (maxBufferSizeMb >= ByteUnit.GiB.toMiB(2)) { > throw new IllegalArgumentException("spark.kryoserializer.buffer.max must > be less than " + > s"2048 mb, got: + $maxBufferSizeMb mb.") > } > We should mention "this value must be less than 2048 mb" on the config page > as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19006) should mentioned the max value allowed for spark.kryoserializer.buffer.max in doc
[ https://issues.apache.org/jira/browse/SPARK-19006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuexin Zhang updated SPARK-19006: - Description: On configuration doc page:https://spark.apache.org/docs/latest/configuration.html We mentioned spark.kryoserializer.buffer.max : Maximum allowable size of Kryo serialization buffer. This must be larger than any object you attempt to serialize. Increase this if you get a "buffer limit exceeded" exception inside Kryo. from source code, it has hard coded upper limit : val maxBufferSizeMb = conf.getSizeAsMb("spark.kryoserializer.buffer.max", "64m").toInt if (maxBufferSizeMb >= ByteUnit.GiB.toMiB(2)) { throw new IllegalArgumentException("spark.kryoserializer.buffer.max must be less than " + s"2048 mb, got: + $maxBufferSizeMb mb.") } We should mention "this value must be less than 2048 mb" on the config page as well. > should mentioned the max value allowed for spark.kryoserializer.buffer.max in > doc > - > > Key: SPARK-19006 > URL: https://issues.apache.org/jira/browse/SPARK-19006 > Project: Spark > Issue Type: Documentation >Reporter: Yuexin Zhang > > On configuration doc > page:https://spark.apache.org/docs/latest/configuration.html > We mentioned spark.kryoserializer.buffer.max : Maximum allowable size of Kryo > serialization buffer. This must be larger than any object you attempt to > serialize. Increase this if you get a "buffer limit exceeded" exception > inside Kryo. > from source code, it has hard coded upper limit : > val maxBufferSizeMb = conf.getSizeAsMb("spark.kryoserializer.buffer.max", > "64m").toInt > if (maxBufferSizeMb >= ByteUnit.GiB.toMiB(2)) { > throw new IllegalArgumentException("spark.kryoserializer.buffer.max must > be less than " + > s"2048 mb, got: + $maxBufferSizeMb mb.") > } > We should mention "this value must be less than 2048 mb" on the config page > as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17984) Add support for numa aware feature
[ https://issues.apache.org/jira/browse/SPARK-17984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15779305#comment-15779305 ] Apache Spark commented on SPARK-17984: -- User 'xiaochang-wu' has created a pull request for this issue: https://github.com/apache/spark/pull/16411 > Add support for numa aware feature > -- > > Key: SPARK-17984 > URL: https://issues.apache.org/jira/browse/SPARK-17984 > Project: Spark > Issue Type: New Feature > Components: Deploy, Mesos, YARN >Affects Versions: 2.0.1 > Environment: Cluster Topo: 1 Master + 4 Slaves > CPU: Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz(72 Cores) > Memory: 128GB(2 NUMA Nodes) > SW Version: Hadoop-5.7.0 + Spark-2.0.0 >Reporter: quanfuwang > Original Estimate: 672h > Remaining Estimate: 672h > > This Jira is target to add support numa aware feature which can help improve > performance by making core access local memory rather than remote one. > A patch is being developed, see https://github.com/apache/spark/pull/15524. > And the whole task includes 3 subtasks and will be developed iteratively: > Numa aware support for Yarn based deployment mode > Numa aware support for Mesos based deployment mode > Numa aware support for Standalone based deployment mode -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19006) should mentioned the max value allowed for spark.kryoserializer.buffer.max in doc
Yuexin Zhang created SPARK-19006: Summary: should mentioned the max value allowed for spark.kryoserializer.buffer.max in doc Key: SPARK-19006 URL: https://issues.apache.org/jira/browse/SPARK-19006 Project: Spark Issue Type: Documentation Reporter: Yuexin Zhang -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14083) Analyze JVM bytecode and turn closures into Catalyst expressions
[ https://issues.apache.org/jira/browse/SPARK-14083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15779288#comment-15779288 ] Kazuaki Ishizaki commented on SPARK-14083: -- [Here|https://github.com/apache/spark/pull/16391#discussion_r93788919] is another motivation to apply this optimization cc:[~cloud_fan] > Analyze JVM bytecode and turn closures into Catalyst expressions > > > Key: SPARK-14083 > URL: https://issues.apache.org/jira/browse/SPARK-14083 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > One big advantage of the Dataset API is the type safety, at the cost of > performance due to heavy reliance on user-defined closures/lambdas. These > closures are typically slower than expressions because we have more > flexibility to optimize expressions (known data types, no virtual function > calls, etc). In many cases, it's actually not going to be very difficult to > look into the byte code of these closures and figure out what they are trying > to do. If we can understand them, then we can turn them directly into > Catalyst expressions for more optimized executions. > Some examples are: > {code} > df.map(_.name) // equivalent to expression col("name") > ds.groupBy(_.gender) // equivalent to expression col("gender") > df.filter(_.age > 18) // equivalent to expression GreaterThan(col("age"), > lit(18) > df.map(_.id + 1) // equivalent to Add(col("age"), lit(1)) > {code} > The goal of this ticket is to design a small framework for byte code analysis > and use that to convert closures/lambdas into Catalyst expressions in order > to speed up Dataset execution. It is a little bit futuristic, but I believe > it is very doable. The framework should be easy to reason about (e.g. similar > to Catalyst). > Note that a big emphasis on "small" and "easy to reason about". A patch > should be rejected if it is too complicated or difficult to reason about. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19005) Keep column ordering when a schema is explicitly specified
[ https://issues.apache.org/jira/browse/SPARK-19005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19005: Assignee: Apache Spark > Keep column ordering when a schema is explicitly specified > --- > > Key: SPARK-19005 > URL: https://issues.apache.org/jira/browse/SPARK-19005 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Takeshi Yamamuro >Assignee: Apache Spark >Priority: Minor > > This ticket is to keep column ordering when a schema is explicitly specified. > A concrete example is as follows; > {code} > scala> import org.apache.spark.sql.types._ > scala> case class A(a: Long, b: Int) > scala> val as = Seq(A(1, 2)) > scala> > spark.createDataFrame(as).write.parquet("/Users/maropu/Desktop/data/a=1/") > scala> val df = spark.read.parquet("/Users/maropu/Desktop/data/") > scala> df.printSchema > root > |-- a: integer (nullable = true) > |-- b: integer (nullable = true) > scala> val schema = new StructType().add("a", LongType).add("b", IntegerType) > scala> val df = > spark.read.schema(schema).parquet("/Users/maropu/Desktop/data/") > scala> df.printSchema > root > |-- b: integer (nullable = true) > |-- a: long (nullable = true) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19005) Keep column ordering when a schema is explicitly specified
[ https://issues.apache.org/jira/browse/SPARK-19005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15779235#comment-15779235 ] Apache Spark commented on SPARK-19005: -- User 'maropu' has created a pull request for this issue: https://github.com/apache/spark/pull/16410 > Keep column ordering when a schema is explicitly specified > --- > > Key: SPARK-19005 > URL: https://issues.apache.org/jira/browse/SPARK-19005 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Takeshi Yamamuro >Priority: Minor > > This ticket is to keep column ordering when a schema is explicitly specified. > A concrete example is as follows; > {code} > scala> import org.apache.spark.sql.types._ > scala> case class A(a: Long, b: Int) > scala> val as = Seq(A(1, 2)) > scala> > spark.createDataFrame(as).write.parquet("/Users/maropu/Desktop/data/a=1/") > scala> val df = spark.read.parquet("/Users/maropu/Desktop/data/") > scala> df.printSchema > root > |-- a: integer (nullable = true) > |-- b: integer (nullable = true) > scala> val schema = new StructType().add("a", LongType).add("b", IntegerType) > scala> val df = > spark.read.schema(schema).parquet("/Users/maropu/Desktop/data/") > scala> df.printSchema > root > |-- b: integer (nullable = true) > |-- a: long (nullable = true) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19005) Keep column ordering when a schema is explicitly specified
[ https://issues.apache.org/jira/browse/SPARK-19005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19005: Assignee: (was: Apache Spark) > Keep column ordering when a schema is explicitly specified > --- > > Key: SPARK-19005 > URL: https://issues.apache.org/jira/browse/SPARK-19005 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Takeshi Yamamuro >Priority: Minor > > This ticket is to keep column ordering when a schema is explicitly specified. > A concrete example is as follows; > {code} > scala> import org.apache.spark.sql.types._ > scala> case class A(a: Long, b: Int) > scala> val as = Seq(A(1, 2)) > scala> > spark.createDataFrame(as).write.parquet("/Users/maropu/Desktop/data/a=1/") > scala> val df = spark.read.parquet("/Users/maropu/Desktop/data/") > scala> df.printSchema > root > |-- a: integer (nullable = true) > |-- b: integer (nullable = true) > scala> val schema = new StructType().add("a", LongType).add("b", IntegerType) > scala> val df = > spark.read.schema(schema).parquet("/Users/maropu/Desktop/data/") > scala> df.printSchema > root > |-- b: integer (nullable = true) > |-- a: long (nullable = true) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19005) Keep column ordering when a schema is explicitly specified
Takeshi Yamamuro created SPARK-19005: Summary: Keep column ordering when a schema is explicitly specified Key: SPARK-19005 URL: https://issues.apache.org/jira/browse/SPARK-19005 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.1.0 Reporter: Takeshi Yamamuro Priority: Minor This ticket is to keep column ordering when a schema is explicitly specified. A concrete example is as follows; {code} scala> import org.apache.spark.sql.types._ scala> case class A(a: Long, b: Int) scala> val as = Seq(A(1, 2)) scala> spark.createDataFrame(as).write.parquet("/Users/maropu/Desktop/data/a=1/") scala> val df = spark.read.parquet("/Users/maropu/Desktop/data/") scala> df.printSchema root |-- a: integer (nullable = true) |-- b: integer (nullable = true) scala> val schema = new StructType().add("a", LongType).add("b", IntegerType) scala> val df = spark.read.schema(schema).parquet("/Users/maropu/Desktop/data/") scala> df.printSchema root |-- b: integer (nullable = true) |-- a: long (nullable = true) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18991) Change ContextCleaner.referenceBuffer to ConcurrentHashMap to make it faster
[ https://issues.apache.org/jira/browse/SPARK-18991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15779002#comment-15779002 ] Shixiong Zhu commented on SPARK-18991: -- [~prashant_] FYI, this may also be necessary for your Kafka benchmark. I observed our internal jobs failed due to OOM because ContextCleaner was too slow even after disabling blocking clean up. > Change ContextCleaner.referenceBuffer to ConcurrentHashMap to make it faster > > > Key: SPARK-18991 > URL: https://issues.apache.org/jira/browse/SPARK-18991 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Fix For: 2.1.1, 2.2.0 > > > Right now `ContextCleaner.referenceBuffer` is ConcurrentLinkedQueue and the > time complexity of the `remove` action is O ( n ). It can be changed to use > ConcurrentHashMap whose `remove` is O(1). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18997) Recommended upgrade libthrift to 0.9.3
[ https://issues.apache.org/jira/browse/SPARK-18997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15778946#comment-15778946 ] Dongjoon Hyun commented on SPARK-18997: --- +1 > Recommended upgrade libthrift to 0.9.3 > --- > > Key: SPARK-18997 > URL: https://issues.apache.org/jira/browse/SPARK-18997 > Project: Spark > Issue Type: Bug > Components: Build >Reporter: meiyoula >Priority: Critical > > libthrift 0.9.2 has a serious security vulnerability:CVE-2015-3254 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19000) Spark beeline: table was created at default database even though specifing a database name
[ https://issues.apache.org/jira/browse/SPARK-19000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15778943#comment-15778943 ] Dongjoon Hyun edited comment on SPARK-19000 at 12/26/16 8:32 PM: - Thank you for reporting, [~ouyangxc]. Yep. It was correct. Also, the issue was resolved at SPARK-17819 since 2.0.2. Could you try the latest Apache Spark 2.1.0 again? http://www-us.apache.org/dist/spark/spark-2.1.0/ was (Author: dongjoon): Thank you for reporting, [~ouyangxc]. Yep. It was correct. The issue is resolved at SPARK-17819 since 2.0.2. Could you try the latest Apache Spark 2.1.0 again? http://www-us.apache.org/dist/spark/spark-2.1.0/ > Spark beeline: table was created at default database even though specifing > a database name > > > Key: SPARK-19000 > URL: https://issues.apache.org/jira/browse/SPARK-19000 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 > Environment: spark2.0.0 >Reporter: Xiaochen Ouyang > > After running this command as follow: > ${SPARK_HOME}/bin/beeline -u jdbc:hive2://192.168.156.5:10001/mydb -n mr -e > "create table test_1(id int ,name string) row format delimited fields > terminated by ',' stored as textfile;" > I found the table "test_1" was create at default database other than mydb > database that i specified。 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19000) Spark beeline: table was created at default database even though specifing a database name
[ https://issues.apache.org/jira/browse/SPARK-19000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-19000. --- Resolution: Duplicate > Spark beeline: table was created at default database even though specifing > a database name > > > Key: SPARK-19000 > URL: https://issues.apache.org/jira/browse/SPARK-19000 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 > Environment: spark2.0.0 >Reporter: Xiaochen Ouyang > > After running this command as follow: > ${SPARK_HOME}/bin/beeline -u jdbc:hive2://192.168.156.5:10001/mydb -n mr -e > "create table test_1(id int ,name string) row format delimited fields > terminated by ',' stored as textfile;" > I found the table "test_1" was create at default database other than mydb > database that i specified。 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19000) Spark beeline: table was created at default database even though specifing a database name
[ https://issues.apache.org/jira/browse/SPARK-19000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15778943#comment-15778943 ] Dongjoon Hyun commented on SPARK-19000: --- Thank you for reporting, [~ouyangxc]. Yep. It was correct. The issue is resolved at SPARK-17819 since 2.0.2. Could you try the latest Apache Spark 2.1.0 again? http://www-us.apache.org/dist/spark/spark-2.1.0/ > Spark beeline: table was created at default database even though specifing > a database name > > > Key: SPARK-19000 > URL: https://issues.apache.org/jira/browse/SPARK-19000 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 > Environment: spark2.0.0 >Reporter: Xiaochen Ouyang > > After running this command as follow: > ${SPARK_HOME}/bin/beeline -u jdbc:hive2://192.168.156.5:10001/mydb -n mr -e > "create table test_1(id int ,name string) row format delimited fields > terminated by ',' stored as textfile;" > I found the table "test_1" was create at default database other than mydb > database that i specified。 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18989) DESC TABLE should not fail with format class not found
[ https://issues.apache.org/jira/browse/SPARK-18989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-18989. - Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 16388 [https://github.com/apache/spark/pull/16388] > DESC TABLE should not fail with format class not found > -- > > Key: SPARK-18989 > URL: https://issues.apache.org/jira/browse/SPARK-18989 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.2.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19004) Fix `testH2Dialect` by removing `getCatalystType`
[ https://issues.apache.org/jira/browse/SPARK-19004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19004: Assignee: (was: Apache Spark) > Fix `testH2Dialect` by removing `getCatalystType` > - > > Key: SPARK-19004 > URL: https://issues.apache.org/jira/browse/SPARK-19004 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.1.0 >Reporter: Dongjoon Hyun >Priority: Minor > > `JdbcDialect` subclasses should return `None` by default. However, > `testH2Dialect.getCatalystType` returns always `Some(StringType)` incorrectly. > This issue fixes `testH2Dialect` by removing the wrong `getCatalystType` > implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19004) Fix `testH2Dialect` by removing `getCatalystType`
[ https://issues.apache.org/jira/browse/SPARK-19004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15778847#comment-15778847 ] Apache Spark commented on SPARK-19004: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/16409 > Fix `testH2Dialect` by removing `getCatalystType` > - > > Key: SPARK-19004 > URL: https://issues.apache.org/jira/browse/SPARK-19004 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.1.0 >Reporter: Dongjoon Hyun >Priority: Minor > > `JdbcDialect` subclasses should return `None` by default. However, > `testH2Dialect.getCatalystType` returns always `Some(StringType)` incorrectly. > This issue fixes `testH2Dialect` by removing the wrong `getCatalystType` > implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19004) Fix `testH2Dialect` by removing `getCatalystType`
[ https://issues.apache.org/jira/browse/SPARK-19004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19004: Assignee: Apache Spark > Fix `testH2Dialect` by removing `getCatalystType` > - > > Key: SPARK-19004 > URL: https://issues.apache.org/jira/browse/SPARK-19004 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.1.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Minor > > `JdbcDialect` subclasses should return `None` by default. However, > `testH2Dialect.getCatalystType` returns always `Some(StringType)` incorrectly. > This issue fixes `testH2Dialect` by removing the wrong `getCatalystType` > implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19004) Fix `testH2Dialect` by removing `getCatalystType`
Dongjoon Hyun created SPARK-19004: - Summary: Fix `testH2Dialect` by removing `getCatalystType` Key: SPARK-19004 URL: https://issues.apache.org/jira/browse/SPARK-19004 Project: Spark Issue Type: Bug Components: SQL, Tests Affects Versions: 2.1.0 Reporter: Dongjoon Hyun Priority: Minor `JdbcDialect` subclasses should return `None` by default. However, `testH2Dialect.getCatalystType` returns always `Some(StringType)` incorrectly. This issue fixes `testH2Dialect` by removing the wrong `getCatalystType` implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19003) Add Java examples in "Spark Streaming Guide", section "Design Patterns for using foreachRDD"
[ https://issues.apache.org/jira/browse/SPARK-19003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19003: Assignee: (was: Apache Spark) > Add Java examples in "Spark Streaming Guide", section "Design Patterns for > using foreachRDD" > - > > Key: SPARK-19003 > URL: https://issues.apache.org/jira/browse/SPARK-19003 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.2.0 >Reporter: Tushar Adeshara >Priority: Minor > > The page http://spark.apache.org/docs/latest/streaming-programming-guide.html > is missing Java example in section "Design Patterns for using foreachRDD". > Except this section, the page has Scala, Java and Python examples for all > other sections, so would be good to add for consistency. > I have made required code changes, will raise a pull request against this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19003) Add Java examples in "Spark Streaming Guide", section "Design Patterns for using foreachRDD"
[ https://issues.apache.org/jira/browse/SPARK-19003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15778698#comment-15778698 ] Apache Spark commented on SPARK-19003: -- User 'adesharatushar' has created a pull request for this issue: https://github.com/apache/spark/pull/16408 > Add Java examples in "Spark Streaming Guide", section "Design Patterns for > using foreachRDD" > - > > Key: SPARK-19003 > URL: https://issues.apache.org/jira/browse/SPARK-19003 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.2.0 >Reporter: Tushar Adeshara >Priority: Minor > > The page http://spark.apache.org/docs/latest/streaming-programming-guide.html > is missing Java example in section "Design Patterns for using foreachRDD". > Except this section, the page has Scala, Java and Python examples for all > other sections, so would be good to add for consistency. > I have made required code changes, will raise a pull request against this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19003) Add Java examples in "Spark Streaming Guide", section "Design Patterns for using foreachRDD"
[ https://issues.apache.org/jira/browse/SPARK-19003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19003: Assignee: Apache Spark > Add Java examples in "Spark Streaming Guide", section "Design Patterns for > using foreachRDD" > - > > Key: SPARK-19003 > URL: https://issues.apache.org/jira/browse/SPARK-19003 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.2.0 >Reporter: Tushar Adeshara >Assignee: Apache Spark >Priority: Minor > > The page http://spark.apache.org/docs/latest/streaming-programming-guide.html > is missing Java example in section "Design Patterns for using foreachRDD". > Except this section, the page has Scala, Java and Python examples for all > other sections, so would be good to add for consistency. > I have made required code changes, will raise a pull request against this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19003) Add Java examples in "Spark Streaming Guide", section "Design Patterns for using foreachRDD"
[ https://issues.apache.org/jira/browse/SPARK-19003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15778699#comment-15778699 ] Tushar Adeshara commented on SPARK-19003: - Pull request https://github.com/apache/spark/pull/16408 > Add Java examples in "Spark Streaming Guide", section "Design Patterns for > using foreachRDD" > - > > Key: SPARK-19003 > URL: https://issues.apache.org/jira/browse/SPARK-19003 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.2.0 >Reporter: Tushar Adeshara >Priority: Minor > > The page http://spark.apache.org/docs/latest/streaming-programming-guide.html > is missing Java example in section "Design Patterns for using foreachRDD". > Except this section, the page has Scala, Java and Python examples for all > other sections, so would be good to add for consistency. > I have made required code changes, will raise a pull request against this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19003) Add Java examples in "Spark Streaming Guide", section "Design Patterns for using foreachRDD"
Tushar Adeshara created SPARK-19003: --- Summary: Add Java examples in "Spark Streaming Guide", section "Design Patterns for using foreachRDD" Key: SPARK-19003 URL: https://issues.apache.org/jira/browse/SPARK-19003 Project: Spark Issue Type: Documentation Components: Documentation Affects Versions: 2.2.0 Reporter: Tushar Adeshara Priority: Minor The page http://spark.apache.org/docs/latest/streaming-programming-guide.html is missing Java example in section "Design Patterns for using foreachRDD". Except this section, the page has Scala, Java and Python examples for all other sections, so would be good to add for consistency. I have made required code changes, will raise a pull request against this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18980) implement Aggregator with TypedImperativeAggregate
[ https://issues.apache.org/jira/browse/SPARK-18980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-18980. - Resolution: Fixed Fix Version/s: 2.2.0 > implement Aggregator with TypedImperativeAggregate > -- > > Key: SPARK-18980 > URL: https://issues.apache.org/jira/browse/SPARK-18980 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.2.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19001) Worker will submit multiply CleanWorkDir and SendHeartbeat task with each RegisterWorker response
[ https://issues.apache.org/jira/browse/SPARK-19001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liujianhui updated SPARK-19001: --- Description: h2. scene worker will submit multiply task of SendHeartbeat and CleanWorkDir if the worker register itself again, this code as follow {code} private def handleRegisterResponse(msg: RegisterWorkerResponse): Unit = synchronized { msg match { case RegisteredWorker(masterRef, masterWebUiUrl) => logInfo("Successfully registered with master " + masterRef.address.toSparkURL) registered = true changeMaster(masterRef, masterWebUiUrl) forwordMessageScheduler.scheduleAtFixedRate(new Runnable { override def run(): Unit = Utils.tryLogNonFatalError { self.send(SendHeartbeat) } }, 0, HEARTBEAT_MILLIS, TimeUnit.MILLISECONDS) if (CLEANUP_ENABLED) { logInfo( s"Worker cleanup enabled; old application directories will be deleted in: $workDir") forwordMessageScheduler.scheduleAtFixedRate(new Runnable { override def run(): Unit = Utils.tryLogNonFatalError { self.send(WorkDirCleanup) } }, CLEANUP_INTERVAL_MILLIS, CLEANUP_INTERVAL_MILLIS, TimeUnit.MILLISECONDS) } {code} log as follow {code} 2016-12-20 20:23:30,030 | Successfully registered with master spark://polaris-1:8090|org.apache.spark.Logging$class.logInfo(Logging.scala:58) 2016-12-26 20:41:58,058 | Successfully registered with master spark://polaris-1:8090|org.apache.spark.Logging$class.logInfo(Logging.scala:58) {code} if the worker Send RegisterWorker event multiply many times when the master found the heartbeat of the worker expired, then it will submit task multiply times was: h2. scene worker will submit multiply task of SendHeartbeat and CleanWorkDir if the worker register itself again, this code as follow {code} private def handleRegisterResponse(msg: RegisterWorkerResponse): Unit = synchronized { msg match { case RegisteredWorker(masterRef, masterWebUiUrl) => logInfo("Successfully registered with master " + masterRef.address.toSparkURL) registered = true changeMaster(masterRef, masterWebUiUrl) forwordMessageScheduler.scheduleAtFixedRate(new Runnable { override def run(): Unit = Utils.tryLogNonFatalError { self.send(SendHeartbeat) } }, 0, HEARTBEAT_MILLIS, TimeUnit.MILLISECONDS) if (CLEANUP_ENABLED) { logInfo( s"Worker cleanup enabled; old application directories will be deleted in: $workDir") forwordMessageScheduler.scheduleAtFixedRate(new Runnable { override def run(): Unit = Utils.tryLogNonFatalError { self.send(WorkDirCleanup) } }, CLEANUP_INTERVAL_MILLIS, CLEANUP_INTERVAL_MILLIS, TimeUnit.MILLISECONDS) } {code} log as follow {code} 2016-12-20 20:23:30,030 | Successfully registered with master spark://polaris-1:8090|org.apache.spark.Logging$class.logInfo(Logging.scala:58) 2016-12-26 20:41:58,058 | Successfully registered with master spark://polaris-1:8090|org.apache.spark.Logging$class.logInfo(Logging.scala:58) {code} if the worker Send RegisterWorker event multiply many times if the master found the heartbeat of the worker expired, then it will submit task multiply times > Worker will submit multiply CleanWorkDir and SendHeartbeat task with each > RegisterWorker response > - > > Key: SPARK-19001 > URL: https://issues.apache.org/jira/browse/SPARK-19001 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 1.6.1 >Reporter: liujianhui >Priority: Minor > > h2. scene > worker will submit multiply task of SendHeartbeat and CleanWorkDir if the > worker register itself again, this code as follow > {code} > private def handleRegisterResponse(msg: RegisterWorkerResponse): Unit = > synchronized { > msg match { > case RegisteredWorker(masterRef, masterWebUiUrl) => > logInfo("Successfully registered with master " + > masterRef.address.toSparkURL) > registered = true > changeMaster(masterRef, masterWebUiUrl) > forwordMessageScheduler.scheduleAtFixedRate(new Runnable { > override def run(): Unit = Utils.tryLogNonFatalError { > self.send(SendHeartbeat) > } > }, 0, HEARTBEAT_MILLIS, TimeUnit.MILLISECONDS) > if (CLEANUP_ENABLED) { > logInfo( > s"Worker cleanup enabled; old application directories will be > deleted in: $workDir") > forwordMessageScheduler.scheduleAtFixedRate(new Runnable { > override def run(): Unit = Utils.tryLogNonFatalError { >
[jira] [Commented] (SPARK-19001) Worker will submit multiply CleanWorkDir and SendHeartbeat task with each RegisterWorker response
[ https://issues.apache.org/jira/browse/SPARK-19001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15778382#comment-15778382 ] Apache Spark commented on SPARK-19001: -- User 'liujianhuiouc' has created a pull request for this issue: https://github.com/apache/spark/pull/16406 > Worker will submit multiply CleanWorkDir and SendHeartbeat task with each > RegisterWorker response > - > > Key: SPARK-19001 > URL: https://issues.apache.org/jira/browse/SPARK-19001 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 1.6.1 >Reporter: liujianhui >Priority: Minor > > h2. scene > worker will submit multiply task of SendHeartbeat and CleanWorkDir if the > worker register itself again, this code as follow > {code} > private def handleRegisterResponse(msg: RegisterWorkerResponse): Unit = > synchronized { > msg match { > case RegisteredWorker(masterRef, masterWebUiUrl) => > logInfo("Successfully registered with master " + > masterRef.address.toSparkURL) > registered = true > changeMaster(masterRef, masterWebUiUrl) > forwordMessageScheduler.scheduleAtFixedRate(new Runnable { > override def run(): Unit = Utils.tryLogNonFatalError { > self.send(SendHeartbeat) > } > }, 0, HEARTBEAT_MILLIS, TimeUnit.MILLISECONDS) > if (CLEANUP_ENABLED) { > logInfo( > s"Worker cleanup enabled; old application directories will be > deleted in: $workDir") > forwordMessageScheduler.scheduleAtFixedRate(new Runnable { > override def run(): Unit = Utils.tryLogNonFatalError { > self.send(WorkDirCleanup) > } > }, CLEANUP_INTERVAL_MILLIS, CLEANUP_INTERVAL_MILLIS, > TimeUnit.MILLISECONDS) > } > {code} > log as follow > {code} > 2016-12-20 20:23:30,030 | Successfully registered with master > spark://polaris-1:8090|org.apache.spark.Logging$class.logInfo(Logging.scala:58) > 2016-12-26 20:41:58,058 | Successfully registered with master > spark://polaris-1:8090|org.apache.spark.Logging$class.logInfo(Logging.scala:58) > {code} > if the worker Send RegisterWorker event multiply many times if the master > found the heartbeat of the worker expired, then it will submit task multiply > times -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19001) Worker will submit multiply CleanWorkDir and SendHeartbeat task with each RegisterWorker response
[ https://issues.apache.org/jira/browse/SPARK-19001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19001: Assignee: (was: Apache Spark) > Worker will submit multiply CleanWorkDir and SendHeartbeat task with each > RegisterWorker response > - > > Key: SPARK-19001 > URL: https://issues.apache.org/jira/browse/SPARK-19001 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 1.6.1 >Reporter: liujianhui >Priority: Minor > > h2. scene > worker will submit multiply task of SendHeartbeat and CleanWorkDir if the > worker register itself again, this code as follow > {code} > private def handleRegisterResponse(msg: RegisterWorkerResponse): Unit = > synchronized { > msg match { > case RegisteredWorker(masterRef, masterWebUiUrl) => > logInfo("Successfully registered with master " + > masterRef.address.toSparkURL) > registered = true > changeMaster(masterRef, masterWebUiUrl) > forwordMessageScheduler.scheduleAtFixedRate(new Runnable { > override def run(): Unit = Utils.tryLogNonFatalError { > self.send(SendHeartbeat) > } > }, 0, HEARTBEAT_MILLIS, TimeUnit.MILLISECONDS) > if (CLEANUP_ENABLED) { > logInfo( > s"Worker cleanup enabled; old application directories will be > deleted in: $workDir") > forwordMessageScheduler.scheduleAtFixedRate(new Runnable { > override def run(): Unit = Utils.tryLogNonFatalError { > self.send(WorkDirCleanup) > } > }, CLEANUP_INTERVAL_MILLIS, CLEANUP_INTERVAL_MILLIS, > TimeUnit.MILLISECONDS) > } > {code} > log as follow > {code} > 2016-12-20 20:23:30,030 | Successfully registered with master > spark://polaris-1:8090|org.apache.spark.Logging$class.logInfo(Logging.scala:58) > 2016-12-26 20:41:58,058 | Successfully registered with master > spark://polaris-1:8090|org.apache.spark.Logging$class.logInfo(Logging.scala:58) > {code} > if the worker Send RegisterWorker event multiply many times if the master > found the heartbeat of the worker expired, then it will submit task multiply > times -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19001) Worker will submit multiply CleanWorkDir and SendHeartbeat task with each RegisterWorker response
[ https://issues.apache.org/jira/browse/SPARK-19001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19001: Assignee: Apache Spark > Worker will submit multiply CleanWorkDir and SendHeartbeat task with each > RegisterWorker response > - > > Key: SPARK-19001 > URL: https://issues.apache.org/jira/browse/SPARK-19001 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 1.6.1 >Reporter: liujianhui >Assignee: Apache Spark >Priority: Minor > > h2. scene > worker will submit multiply task of SendHeartbeat and CleanWorkDir if the > worker register itself again, this code as follow > {code} > private def handleRegisterResponse(msg: RegisterWorkerResponse): Unit = > synchronized { > msg match { > case RegisteredWorker(masterRef, masterWebUiUrl) => > logInfo("Successfully registered with master " + > masterRef.address.toSparkURL) > registered = true > changeMaster(masterRef, masterWebUiUrl) > forwordMessageScheduler.scheduleAtFixedRate(new Runnable { > override def run(): Unit = Utils.tryLogNonFatalError { > self.send(SendHeartbeat) > } > }, 0, HEARTBEAT_MILLIS, TimeUnit.MILLISECONDS) > if (CLEANUP_ENABLED) { > logInfo( > s"Worker cleanup enabled; old application directories will be > deleted in: $workDir") > forwordMessageScheduler.scheduleAtFixedRate(new Runnable { > override def run(): Unit = Utils.tryLogNonFatalError { > self.send(WorkDirCleanup) > } > }, CLEANUP_INTERVAL_MILLIS, CLEANUP_INTERVAL_MILLIS, > TimeUnit.MILLISECONDS) > } > {code} > log as follow > {code} > 2016-12-20 20:23:30,030 | Successfully registered with master > spark://polaris-1:8090|org.apache.spark.Logging$class.logInfo(Logging.scala:58) > 2016-12-26 20:41:58,058 | Successfully registered with master > spark://polaris-1:8090|org.apache.spark.Logging$class.logInfo(Logging.scala:58) > {code} > if the worker Send RegisterWorker event multiply many times if the master > found the heartbeat of the worker expired, then it will submit task multiply > times -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19002) Check pep8 against merge_spark_pr.py script
[ https://issues.apache.org/jira/browse/SPARK-19002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19002: Assignee: (was: Apache Spark) > Check pep8 against merge_spark_pr.py script > --- > > Key: SPARK-19002 > URL: https://issues.apache.org/jira/browse/SPARK-19002 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Hyukjin Kwon >Priority: Trivial > > We can check pep8 against merge_spark_pr.py script. There are already several > python scripts there. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19002) Check pep8 against merge_spark_pr.py script
[ https://issues.apache.org/jira/browse/SPARK-19002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19002: Assignee: Apache Spark > Check pep8 against merge_spark_pr.py script > --- > > Key: SPARK-19002 > URL: https://issues.apache.org/jira/browse/SPARK-19002 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Trivial > > We can check pep8 against merge_spark_pr.py script. There are already several > python scripts there. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19002) Check pep8 against merge_spark_pr.py script
[ https://issues.apache.org/jira/browse/SPARK-19002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15778313#comment-15778313 ] Apache Spark commented on SPARK-19002: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/16405 > Check pep8 against merge_spark_pr.py script > --- > > Key: SPARK-19002 > URL: https://issues.apache.org/jira/browse/SPARK-19002 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Hyukjin Kwon >Priority: Trivial > > We can check pep8 against merge_spark_pr.py script. There are already several > python scripts there. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19002) Check pep8 against merge_spark_pr.py script
Hyukjin Kwon created SPARK-19002: Summary: Check pep8 against merge_spark_pr.py script Key: SPARK-19002 URL: https://issues.apache.org/jira/browse/SPARK-19002 Project: Spark Issue Type: Improvement Components: Build Reporter: Hyukjin Kwon Priority: Trivial We can check pep8 against merge_spark_pr.py script. There are already several python scripts there. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19001) Worker will submit multiply CleanWorkDir and SendHeartbeat task with each RegisterWorker response
liujianhui created SPARK-19001: -- Summary: Worker will submit multiply CleanWorkDir and SendHeartbeat task with each RegisterWorker response Key: SPARK-19001 URL: https://issues.apache.org/jira/browse/SPARK-19001 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 1.6.1 Reporter: liujianhui Priority: Minor h2. scene worker will submit multiply task of SendHeartbeat and CleanWorkDir if the worker register itself again, this code as follow {code} private def handleRegisterResponse(msg: RegisterWorkerResponse): Unit = synchronized { msg match { case RegisteredWorker(masterRef, masterWebUiUrl) => logInfo("Successfully registered with master " + masterRef.address.toSparkURL) registered = true changeMaster(masterRef, masterWebUiUrl) forwordMessageScheduler.scheduleAtFixedRate(new Runnable { override def run(): Unit = Utils.tryLogNonFatalError { self.send(SendHeartbeat) } }, 0, HEARTBEAT_MILLIS, TimeUnit.MILLISECONDS) if (CLEANUP_ENABLED) { logInfo( s"Worker cleanup enabled; old application directories will be deleted in: $workDir") forwordMessageScheduler.scheduleAtFixedRate(new Runnable { override def run(): Unit = Utils.tryLogNonFatalError { self.send(WorkDirCleanup) } }, CLEANUP_INTERVAL_MILLIS, CLEANUP_INTERVAL_MILLIS, TimeUnit.MILLISECONDS) } {code} log as follow {code} 2016-12-20 20:23:30,030 | Successfully registered with master spark://polaris-1:8090|org.apache.spark.Logging$class.logInfo(Logging.scala:58) 2016-12-26 20:41:58,058 | Successfully registered with master spark://polaris-1:8090|org.apache.spark.Logging$class.logInfo(Logging.scala:58) {code} if the worker Send RegisterWorker event multiply many times if the master found the heartbeat of the worker expired, then it will submit task multiply times -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19000) Spark beeline: table was created at default database even though specifing a database name
Xiaochen Ouyang created SPARK-19000: --- Summary: Spark beeline: table was created at default database even though specifing a database name Key: SPARK-19000 URL: https://issues.apache.org/jira/browse/SPARK-19000 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Environment: spark2.0.0 Reporter: Xiaochen Ouyang After running this command as follow: ${SPARK_HOME}/bin/beeline -u jdbc:hive2://192.168.156.5:10001/mydb -n mr -e "create table test_1(id int ,name string) row format delimited fields terminated by ',' stored as textfile;" I found the table "test_1" was create at default database other than mydb database that i specified。 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18969) PullOutNondeterministic should work for Aggregate operator
[ https://issues.apache.org/jira/browse/SPARK-18969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15778112#comment-15778112 ] Apache Spark commented on SPARK-18969: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/16404 > PullOutNondeterministic should work for Aggregate operator > -- > > Key: SPARK-18969 > URL: https://issues.apache.org/jira/browse/SPARK-18969 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > This test case should pass: > {code} > checkAnalysis( > r.groupBy(rnd)(rnd), > r.select(a, b, rnd).groupBy(rndref)(rndref) > ) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18819) Double alignment on ARM71 platform
[ https://issues.apache.org/jira/browse/SPARK-18819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18819: Assignee: (was: Apache Spark) > Double alignment on ARM71 platform > -- > > Key: SPARK-18819 > URL: https://issues.apache.org/jira/browse/SPARK-18819 > Project: Spark > Issue Type: Bug > Components: Input/Output, PySpark >Affects Versions: 2.0.2 > Environment: Ubuntu 14.04 LTS on ARM 7.1 >Reporter: Michael Kamprath >Priority: Critical > > _Note - Updated the ticket title to be reflective of what was found to be the > underlying issue_ > When I create a data frame in PySpark with a small row count (less than > number executors), then write it to a parquet file, then load that parquet > file into a new data frame, and finally do any sort of read against the > loaded new data frame, Spark fails with an {{ExecutorLostFailure}}. > Example code to replicate this issue: > {code} > from pyspark.sql.types import * > rdd = sc.parallelize([('row1',1,4.33,'name'),('row2',2,3.14,'string')]) > my_schema = StructType([ > StructField("id", StringType(), True), > StructField("value1", IntegerType(), True), > StructField("value2", DoubleType(), True), > StructField("name",StringType(), True) > ]) > df = spark.createDataFrame( rdd, schema=my_schema) > df.write.parquet('hdfs://master:9000/user/michael/test_data',mode='overwrite') > newdf = spark.read.parquet('hdfs://master:9000/user/michael/test_data/') > newdf.take(1) > {code} > The error I get when the {{take}} step runs is: > {code} > --- > Py4JJavaError Traceback (most recent call last) > in () > 1 newdf = > spark.read.parquet('hdfs://master:9000/user/michael/test_data/') > > 2 newdf.take(1) > /usr/local/spark/python/pyspark/sql/dataframe.py in take(self, num) > 346 [Row(age=2, name=u'Alice'), Row(age=5, name=u'Bob')] > 347 """ > --> 348 return self.limit(num).collect() > 349 > 350 @since(1.3) > /usr/local/spark/python/pyspark/sql/dataframe.py in collect(self) > 308 """ > 309 with SCCallSiteSync(self._sc) as css: > --> 310 port = self._jdf.collectToPython() > 311 return list(_load_from_socket(port, > BatchedSerializer(PickleSerializer( > 312 > /usr/local/spark/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py in > __call__(self, *args) >1131 answer = self.gateway_client.send_command(command) >1132 return_value = get_return_value( > -> 1133 answer, self.gateway_client, self.target_id, self.name) >1134 >1135 for temp_arg in temp_args: > /usr/local/spark/python/pyspark/sql/utils.py in deco(*a, **kw) > 61 def deco(*a, **kw): > 62 try: > ---> 63 return f(*a, **kw) > 64 except py4j.protocol.Py4JJavaError as e: > 65 s = e.java_exception.toString() > /usr/local/spark/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py in > get_return_value(answer, gateway_client, target_id, name) > 317 raise Py4JJavaError( > 318 "An error occurred while calling {0}{1}{2}.\n". > --> 319 format(target_id, ".", name), value) > 320 else: > 321 raise Py4JError( > Py4JJavaError: An error occurred while calling o54.collectToPython. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 > (TID 6, 10.10.10.4): ExecutorLostFailure (executor 2 exited caused by one of > the running tasks) Reason: Remote RPC client disassociated. Likely due to > containers exceeding thresholds, or network issues. Check driver logs for > WARN messages. > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1454) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1442) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1441) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1441) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811) > at
[jira] [Assigned] (SPARK-18819) Double alignment on ARM71 platform
[ https://issues.apache.org/jira/browse/SPARK-18819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18819: Assignee: Apache Spark > Double alignment on ARM71 platform > -- > > Key: SPARK-18819 > URL: https://issues.apache.org/jira/browse/SPARK-18819 > Project: Spark > Issue Type: Bug > Components: Input/Output, PySpark >Affects Versions: 2.0.2 > Environment: Ubuntu 14.04 LTS on ARM 7.1 >Reporter: Michael Kamprath >Assignee: Apache Spark >Priority: Critical > > _Note - Updated the ticket title to be reflective of what was found to be the > underlying issue_ > When I create a data frame in PySpark with a small row count (less than > number executors), then write it to a parquet file, then load that parquet > file into a new data frame, and finally do any sort of read against the > loaded new data frame, Spark fails with an {{ExecutorLostFailure}}. > Example code to replicate this issue: > {code} > from pyspark.sql.types import * > rdd = sc.parallelize([('row1',1,4.33,'name'),('row2',2,3.14,'string')]) > my_schema = StructType([ > StructField("id", StringType(), True), > StructField("value1", IntegerType(), True), > StructField("value2", DoubleType(), True), > StructField("name",StringType(), True) > ]) > df = spark.createDataFrame( rdd, schema=my_schema) > df.write.parquet('hdfs://master:9000/user/michael/test_data',mode='overwrite') > newdf = spark.read.parquet('hdfs://master:9000/user/michael/test_data/') > newdf.take(1) > {code} > The error I get when the {{take}} step runs is: > {code} > --- > Py4JJavaError Traceback (most recent call last) > in () > 1 newdf = > spark.read.parquet('hdfs://master:9000/user/michael/test_data/') > > 2 newdf.take(1) > /usr/local/spark/python/pyspark/sql/dataframe.py in take(self, num) > 346 [Row(age=2, name=u'Alice'), Row(age=5, name=u'Bob')] > 347 """ > --> 348 return self.limit(num).collect() > 349 > 350 @since(1.3) > /usr/local/spark/python/pyspark/sql/dataframe.py in collect(self) > 308 """ > 309 with SCCallSiteSync(self._sc) as css: > --> 310 port = self._jdf.collectToPython() > 311 return list(_load_from_socket(port, > BatchedSerializer(PickleSerializer( > 312 > /usr/local/spark/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py in > __call__(self, *args) >1131 answer = self.gateway_client.send_command(command) >1132 return_value = get_return_value( > -> 1133 answer, self.gateway_client, self.target_id, self.name) >1134 >1135 for temp_arg in temp_args: > /usr/local/spark/python/pyspark/sql/utils.py in deco(*a, **kw) > 61 def deco(*a, **kw): > 62 try: > ---> 63 return f(*a, **kw) > 64 except py4j.protocol.Py4JJavaError as e: > 65 s = e.java_exception.toString() > /usr/local/spark/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py in > get_return_value(answer, gateway_client, target_id, name) > 317 raise Py4JJavaError( > 318 "An error occurred while calling {0}{1}{2}.\n". > --> 319 format(target_id, ".", name), value) > 320 else: > 321 raise Py4JError( > Py4JJavaError: An error occurred while calling o54.collectToPython. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 > (TID 6, 10.10.10.4): ExecutorLostFailure (executor 2 exited caused by one of > the running tasks) Reason: Remote RPC client disassociated. Likely due to > containers exceeding thresholds, or network issues. Check driver logs for > WARN messages. > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1454) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1442) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1441) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1441) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811) >
[jira] [Commented] (SPARK-18819) Double alignment on ARM71 platform
[ https://issues.apache.org/jira/browse/SPARK-18819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15778107#comment-15778107 ] Apache Spark commented on SPARK-18819: -- User 'michaelkamprath' has created a pull request for this issue: https://github.com/apache/spark/pull/16403 > Double alignment on ARM71 platform > -- > > Key: SPARK-18819 > URL: https://issues.apache.org/jira/browse/SPARK-18819 > Project: Spark > Issue Type: Bug > Components: Input/Output, PySpark >Affects Versions: 2.0.2 > Environment: Ubuntu 14.04 LTS on ARM 7.1 >Reporter: Michael Kamprath >Priority: Critical > > _Note - Updated the ticket title to be reflective of what was found to be the > underlying issue_ > When I create a data frame in PySpark with a small row count (less than > number executors), then write it to a parquet file, then load that parquet > file into a new data frame, and finally do any sort of read against the > loaded new data frame, Spark fails with an {{ExecutorLostFailure}}. > Example code to replicate this issue: > {code} > from pyspark.sql.types import * > rdd = sc.parallelize([('row1',1,4.33,'name'),('row2',2,3.14,'string')]) > my_schema = StructType([ > StructField("id", StringType(), True), > StructField("value1", IntegerType(), True), > StructField("value2", DoubleType(), True), > StructField("name",StringType(), True) > ]) > df = spark.createDataFrame( rdd, schema=my_schema) > df.write.parquet('hdfs://master:9000/user/michael/test_data',mode='overwrite') > newdf = spark.read.parquet('hdfs://master:9000/user/michael/test_data/') > newdf.take(1) > {code} > The error I get when the {{take}} step runs is: > {code} > --- > Py4JJavaError Traceback (most recent call last) > in () > 1 newdf = > spark.read.parquet('hdfs://master:9000/user/michael/test_data/') > > 2 newdf.take(1) > /usr/local/spark/python/pyspark/sql/dataframe.py in take(self, num) > 346 [Row(age=2, name=u'Alice'), Row(age=5, name=u'Bob')] > 347 """ > --> 348 return self.limit(num).collect() > 349 > 350 @since(1.3) > /usr/local/spark/python/pyspark/sql/dataframe.py in collect(self) > 308 """ > 309 with SCCallSiteSync(self._sc) as css: > --> 310 port = self._jdf.collectToPython() > 311 return list(_load_from_socket(port, > BatchedSerializer(PickleSerializer( > 312 > /usr/local/spark/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py in > __call__(self, *args) >1131 answer = self.gateway_client.send_command(command) >1132 return_value = get_return_value( > -> 1133 answer, self.gateway_client, self.target_id, self.name) >1134 >1135 for temp_arg in temp_args: > /usr/local/spark/python/pyspark/sql/utils.py in deco(*a, **kw) > 61 def deco(*a, **kw): > 62 try: > ---> 63 return f(*a, **kw) > 64 except py4j.protocol.Py4JJavaError as e: > 65 s = e.java_exception.toString() > /usr/local/spark/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py in > get_return_value(answer, gateway_client, target_id, name) > 317 raise Py4JJavaError( > 318 "An error occurred while calling {0}{1}{2}.\n". > --> 319 format(target_id, ".", name), value) > 320 else: > 321 raise Py4JError( > Py4JJavaError: An error occurred while calling o54.collectToPython. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 > (TID 6, 10.10.10.4): ExecutorLostFailure (executor 2 exited caused by one of > the running tasks) Reason: Remote RPC client disassociated. Likely due to > containers exceeding thresholds, or network issues. Check driver logs for > WARN messages. > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1454) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1442) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1441) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1441) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811) > at >
[jira] [Commented] (SPARK-18857) SparkSQL ThriftServer hangs while extracting huge data volumes in incremental collect mode
[ https://issues.apache.org/jira/browse/SPARK-18857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15778011#comment-15778011 ] vishal agrawal commented on SPARK-18857: we have built Spark from 2.0.2 source code by changing SparkExecuteStatementOperation.scala to pre SPARK-16563 version. this version works fine without causing any thrift server issues. > SparkSQL ThriftServer hangs while extracting huge data volumes in incremental > collect mode > -- > > Key: SPARK-18857 > URL: https://issues.apache.org/jira/browse/SPARK-18857 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: vishal agrawal > Attachments: GC-spark-1.6.3, GC-spark-2.0.2 > > > We are trying to run a sql query on our spark cluster and extracting around > 200 million records through SparkSQL ThriftServer interface. This query works > fine for Spark 1.6.3 version, however for spark 2.0.2, thrift server hangs > after fetching data from a few partitions (we are using incremental collect > mode with 400 partitions). As per documentation max memory taken up by thrift > server should be what is required by the biggest data partition. But we > observed that Thrift server is not releasing the old partitions memory > whenever the GC occurs even though it has moved to next partition data > fetches. which is not the case with 1.6.3 version. > On further investigation we found that SparkExecuteStatementOperation.scala > was modified for "[SPARK-16563][SQL] fix spark sql thrift server FetchResults > bug" and result set iterator was duplicated to keep a reference to the first > set. > + val (itra, itrb) = iter.duplicate > + iterHeader = itra > + iter = itrb > We suspect that this is resulting in the memory not being cleared on GC. To > confirm this we created an iterator in our test class and fetched the data > once without duplicating and second time with creating a duplicate. we could > see that in first instance it ran fine and fetched the entire data set while > in second instance driver hanged after fetching data from a few partitions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org