[jira] [Created] (SPARK-15915) CacheManager should use canonicalized plan for planToCache.
Takuya Ueshin created SPARK-15915: - Summary: CacheManager should use canonicalized plan for planToCache. Key: SPARK-15915 URL: https://issues.apache.org/jira/browse/SPARK-15915 Project: Spark Issue Type: Bug Components: SQL Reporter: Takuya Ueshin {{DataFrame}} with plan overriding {{sameResult}} but not using canonicalized plan to compare can't cacheTable. The example is like: {code} val localRelation = Seq(1, 2, 3).toDF() localRelation.createOrReplaceTempView("localRelation") spark.catalog.cacheTable("localRelation") assert( localRelation.queryExecution.withCachedData.collect { case i: InMemoryRelation => i }.size == 1) {code} and this will fail as: {noformat} ArrayBuffer() had size 0 instead of expected size 1 {noformat} The reason is that when do {{spark.catalog.cacheTable("localRelation")}}, {{CacheManager}} tries to cache for the plan wrapped by {{SubqueryAlias}} but when planning for the DataFrame {{localRelation}}, {{CacheManager}} tries to find cached table for the not-wrapped plan because the plan for DataFrame {{localRelation}} is not wrapped. Some plans like {{LocalRelation}}, {{LogicalRDD}}, etc. override {{sameResult}} method, but not use canonicalized plan to compare so the {{CacheManager}} can't detect the plans are the same. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15915) CacheManager should use canonicalized plan for planToCache.
[ https://issues.apache.org/jira/browse/SPARK-15915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15326921#comment-15326921 ] Apache Spark commented on SPARK-15915: -- User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/13638 > CacheManager should use canonicalized plan for planToCache. > --- > > Key: SPARK-15915 > URL: https://issues.apache.org/jira/browse/SPARK-15915 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Takuya Ueshin > > {{DataFrame}} with plan overriding {{sameResult}} but not using canonicalized > plan to compare can't cacheTable. > The example is like: > {code} > val localRelation = Seq(1, 2, 3).toDF() > localRelation.createOrReplaceTempView("localRelation") > spark.catalog.cacheTable("localRelation") > assert( > localRelation.queryExecution.withCachedData.collect { > case i: InMemoryRelation => i > }.size == 1) > {code} > and this will fail as: > {noformat} > ArrayBuffer() had size 0 instead of expected size 1 > {noformat} > The reason is that when do {{spark.catalog.cacheTable("localRelation")}}, > {{CacheManager}} tries to cache for the plan wrapped by {{SubqueryAlias}} but > when planning for the DataFrame {{localRelation}}, {{CacheManager}} tries to > find cached table for the not-wrapped plan because the plan for DataFrame > {{localRelation}} is not wrapped. > Some plans like {{LocalRelation}}, {{LogicalRDD}}, etc. override > {{sameResult}} method, but not use canonicalized plan to compare so the > {{CacheManager}} can't detect the plans are the same. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15915) CacheManager should use canonicalized plan for planToCache.
[ https://issues.apache.org/jira/browse/SPARK-15915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15915: Assignee: (was: Apache Spark) > CacheManager should use canonicalized plan for planToCache. > --- > > Key: SPARK-15915 > URL: https://issues.apache.org/jira/browse/SPARK-15915 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Takuya Ueshin > > {{DataFrame}} with plan overriding {{sameResult}} but not using canonicalized > plan to compare can't cacheTable. > The example is like: > {code} > val localRelation = Seq(1, 2, 3).toDF() > localRelation.createOrReplaceTempView("localRelation") > spark.catalog.cacheTable("localRelation") > assert( > localRelation.queryExecution.withCachedData.collect { > case i: InMemoryRelation => i > }.size == 1) > {code} > and this will fail as: > {noformat} > ArrayBuffer() had size 0 instead of expected size 1 > {noformat} > The reason is that when do {{spark.catalog.cacheTable("localRelation")}}, > {{CacheManager}} tries to cache for the plan wrapped by {{SubqueryAlias}} but > when planning for the DataFrame {{localRelation}}, {{CacheManager}} tries to > find cached table for the not-wrapped plan because the plan for DataFrame > {{localRelation}} is not wrapped. > Some plans like {{LocalRelation}}, {{LogicalRDD}}, etc. override > {{sameResult}} method, but not use canonicalized plan to compare so the > {{CacheManager}} can't detect the plans are the same. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15915) CacheManager should use canonicalized plan for planToCache.
[ https://issues.apache.org/jira/browse/SPARK-15915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15915: Assignee: Apache Spark > CacheManager should use canonicalized plan for planToCache. > --- > > Key: SPARK-15915 > URL: https://issues.apache.org/jira/browse/SPARK-15915 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Takuya Ueshin >Assignee: Apache Spark > > {{DataFrame}} with plan overriding {{sameResult}} but not using canonicalized > plan to compare can't cacheTable. > The example is like: > {code} > val localRelation = Seq(1, 2, 3).toDF() > localRelation.createOrReplaceTempView("localRelation") > spark.catalog.cacheTable("localRelation") > assert( > localRelation.queryExecution.withCachedData.collect { > case i: InMemoryRelation => i > }.size == 1) > {code} > and this will fail as: > {noformat} > ArrayBuffer() had size 0 instead of expected size 1 > {noformat} > The reason is that when do {{spark.catalog.cacheTable("localRelation")}}, > {{CacheManager}} tries to cache for the plan wrapped by {{SubqueryAlias}} but > when planning for the DataFrame {{localRelation}}, {{CacheManager}} tries to > find cached table for the not-wrapped plan because the plan for DataFrame > {{localRelation}} is not wrapped. > Some plans like {{LocalRelation}}, {{LogicalRDD}}, etc. override > {{sameResult}} method, but not use canonicalized plan to compare so the > {{CacheManager}} can't detect the plans are the same. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15916) JDBC AND/OR operator push down does not respect lower OR operator precedence
Piotr Czarnas created SPARK-15916: - Summary: JDBC AND/OR operator push down does not respect lower OR operator precedence Key: SPARK-15916 URL: https://issues.apache.org/jira/browse/SPARK-15916 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Piotr Czarnas A table from sql server Northwind database was registered as a JDBC dataframe. A query was executed on Spark SQL, the "northwind_dbo_Categories" table is a temporary table which is a JDBC dataframe to "[northwind].[dbo].[Categories]" sql server table: SQL executed on Spark sql context: SELECT CategoryID FROM northwind_dbo_Categories WHERE (CategoryID = 1 OR CategoryID = 2) AND CategoryName = 'Beverages' Spark has done a proper predicate pushdown to JDBC, however parenthesis around two OR conditions was removed. Instead the following query was sent over JDBC to SQL Server: SELECT "CategoryID" FROM [northwind].[dbo].[Categories] WHERE (CategoryID = 1) OR (CategoryID = 2) AND CategoryName = 'Beverages' As a result, the last two conditions (around the AND operator) were considered as the highest precedence: (CategoryID = 2) AND CategoryName = 'Beverages' Finally SQL Server has executed a query like this: SELECT "CategoryID" FROM [northwind].[dbo].[Categories] WHERE CategoryID = 1 OR (CategoryID = 2 AND CategoryName = 'Beverages') -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14503) spark.ml API for FPGrowth
[ https://issues.apache.org/jira/browse/SPARK-14503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15326977#comment-15326977 ] Jeff Zhang commented on SPARK-14503: [~GayathriMurali] [~yuhaoyan] Do you still work on this ? If not, I can help to continue > spark.ml API for FPGrowth > - > > Key: SPARK-14503 > URL: https://issues.apache.org/jira/browse/SPARK-14503 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley > > This task is the first port of spark.mllib.fpm functionality to spark.ml > (Scala). > This will require a brief design doc to confirm a reasonable DataFrame-based > API, with details for this class. The doc could also look ahead to the other > fpm classes, especially if their API decisions will affect FPGrowth. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15796) Reduce spark.memory.fraction default to avoid overrunning old gen in JVM default config
[ https://issues.apache.org/jira/browse/SPARK-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15326979#comment-15326979 ] Sean Owen commented on SPARK-15796: --- A new parameter like that would just be going back to the old behavior, and I think there was a good reason to simplify the settings (see above). I agree that it seems like we need more breathing room, so I would argue for the 0.6 limit as well now, and some more extensive documentation about what to do to NewRatio when increasing this. NewRation N needs to be large enough so that N/(N+1) comfortably exceeds {{spark.memory.fraction}}. > Reduce spark.memory.fraction default to avoid overrunning old gen in JVM > default config > --- > > Key: SPARK-15796 > URL: https://issues.apache.org/jira/browse/SPARK-15796 > Project: Spark > Issue Type: Improvement >Affects Versions: 1.6.0, 1.6.1 >Reporter: Gabor Feher >Priority: Minor > Attachments: baseline.txt, memfrac06.txt, memfrac063.txt, > memfrac066.txt > > > While debugging performance issues in a Spark program, I've found a simple > way to slow down Spark 1.6 significantly by filling the RDD memory cache. > This seems to be a regression, because setting > "spark.memory.useLegacyMode=true" fixes the problem. Here is a repro that is > just a simple program that fills the memory cache of Spark using a > MEMORY_ONLY cached RDD (but of course this comes up in more complex > situations, too): > {code} > import org.apache.spark.SparkContext > import org.apache.spark.SparkConf > import org.apache.spark.storage.StorageLevel > object CacheDemoApp { > def main(args: Array[String]) { > val conf = new SparkConf().setAppName("Cache Demo Application") > > val sc = new SparkContext(conf) > val startTime = System.currentTimeMillis() > > > val cacheFiller = sc.parallelize(1 to 5, 1000) > > .mapPartitionsWithIndex { > case (ix, it) => > println(s"CREATE DATA PARTITION ${ix}") > > val r = new scala.util.Random(ix) > it.map(x => (r.nextLong, r.nextLong)) > } > cacheFiller.persist(StorageLevel.MEMORY_ONLY) > cacheFiller.foreach(identity) > val finishTime = System.currentTimeMillis() > val elapsedTime = (finishTime - startTime) / 1000 > println(s"TIME= $elapsedTime s") > } > } > {code} > If I call it the following way, it completes in around 5 minutes on my > Laptop, while often stopping for slow Full GC cycles. I can also see with > jvisualvm (Visual GC plugin) that the old generation of JVM is 96.8% filled. > {code} > sbt package > ~/spark-1.6.0/bin/spark-submit \ > --class "CacheDemoApp" \ > --master "local[2]" \ > --driver-memory 3g \ > --driver-java-options "-XX:+PrintGCDetails" \ > target/scala-2.10/simple-project_2.10-1.0.jar > {code} > If I add any one of the below flags, then the run-time drops to around 40-50 > seconds and the difference is coming from the drop in GC times: > --conf "spark.memory.fraction=0.6" > OR > --conf "spark.memory.useLegacyMode=true" > OR > --driver-java-options "-XX:NewRatio=3" > All the other cache types except for DISK_ONLY produce similar symptoms. It > looks like that the problem is that the amount of data Spark wants to store > long-term ends up being larger than the old generation size in the JVM and > this triggers Full GC repeatedly. > I did some research: > * In Spark 1.6, spark.memory.fraction is the upper limit on cache size. It > defaults to 0.75. > * In Spark 1.5, spark.storage.memoryFraction is the upper limit in cache > size. It defaults to 0.6 and... > * http://spark.apache.org/docs/1.5.2/configuration.html even says that it > shouldn't be bigger than the size of the old generation. > * On the other hand, OpenJDK's default NewRatio is 2, which means an old > generation size of 66%. Hence the default value in Spark 1.6 contradicts this > advice. > http://spark.apache.org/docs/1.6.1/tuning.html recommends that if the old > generation is running close to full, then setting > spark.memory.storageFraction to a lower value should help. I have tried with > spark.memory.storageFraction=0.1, but it still doesn't fix the issue. This is > not a surprise: http://spark.apache.org/docs/1.6.1/configuration.html > explains that storageFraction is not an upper-limit but a lower limit-like > thing on the size of Spark's cache. The real upper limit is > spark.memory.fraction. > To sum up my questions/issues: > * At least http://spark.apache.org/
[jira] [Commented] (SPARK-14503) spark.ml API for FPGrowth
[ https://issues.apache.org/jira/browse/SPARK-14503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15326994#comment-15326994 ] yuhao yang commented on SPARK-14503: Hi Jeff, welcome to contribute. I'm discussing with some industry users to see what's the optimal interface for FPM, especially what should the output column contains. Appreciate if you can share some thoughts. > spark.ml API for FPGrowth > - > > Key: SPARK-14503 > URL: https://issues.apache.org/jira/browse/SPARK-14503 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley > > This task is the first port of spark.mllib.fpm functionality to spark.ml > (Scala). > This will require a brief design doc to confirm a reasonable DataFrame-based > API, with details for this class. The doc could also look ahead to the other > fpm classes, especially if their API decisions will affect FPGrowth. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15796) Reduce spark.memory.fraction default to avoid overrunning old gen in JVM default config
[ https://issues.apache.org/jira/browse/SPARK-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-15796: -- Priority: Blocker (was: Minor) Pardon marking this "Blocker", but I think this needs some attention before 2.0, if in fact the default memory settings for the new memory manager and JVM ergonomics don't play well together. It's an easy resolution one way or the other -- mostly a question of defaults and docs. > Reduce spark.memory.fraction default to avoid overrunning old gen in JVM > default config > --- > > Key: SPARK-15796 > URL: https://issues.apache.org/jira/browse/SPARK-15796 > Project: Spark > Issue Type: Improvement >Affects Versions: 1.6.0, 1.6.1 >Reporter: Gabor Feher >Priority: Blocker > Attachments: baseline.txt, memfrac06.txt, memfrac063.txt, > memfrac066.txt > > > While debugging performance issues in a Spark program, I've found a simple > way to slow down Spark 1.6 significantly by filling the RDD memory cache. > This seems to be a regression, because setting > "spark.memory.useLegacyMode=true" fixes the problem. Here is a repro that is > just a simple program that fills the memory cache of Spark using a > MEMORY_ONLY cached RDD (but of course this comes up in more complex > situations, too): > {code} > import org.apache.spark.SparkContext > import org.apache.spark.SparkConf > import org.apache.spark.storage.StorageLevel > object CacheDemoApp { > def main(args: Array[String]) { > val conf = new SparkConf().setAppName("Cache Demo Application") > > val sc = new SparkContext(conf) > val startTime = System.currentTimeMillis() > > > val cacheFiller = sc.parallelize(1 to 5, 1000) > > .mapPartitionsWithIndex { > case (ix, it) => > println(s"CREATE DATA PARTITION ${ix}") > > val r = new scala.util.Random(ix) > it.map(x => (r.nextLong, r.nextLong)) > } > cacheFiller.persist(StorageLevel.MEMORY_ONLY) > cacheFiller.foreach(identity) > val finishTime = System.currentTimeMillis() > val elapsedTime = (finishTime - startTime) / 1000 > println(s"TIME= $elapsedTime s") > } > } > {code} > If I call it the following way, it completes in around 5 minutes on my > Laptop, while often stopping for slow Full GC cycles. I can also see with > jvisualvm (Visual GC plugin) that the old generation of JVM is 96.8% filled. > {code} > sbt package > ~/spark-1.6.0/bin/spark-submit \ > --class "CacheDemoApp" \ > --master "local[2]" \ > --driver-memory 3g \ > --driver-java-options "-XX:+PrintGCDetails" \ > target/scala-2.10/simple-project_2.10-1.0.jar > {code} > If I add any one of the below flags, then the run-time drops to around 40-50 > seconds and the difference is coming from the drop in GC times: > --conf "spark.memory.fraction=0.6" > OR > --conf "spark.memory.useLegacyMode=true" > OR > --driver-java-options "-XX:NewRatio=3" > All the other cache types except for DISK_ONLY produce similar symptoms. It > looks like that the problem is that the amount of data Spark wants to store > long-term ends up being larger than the old generation size in the JVM and > this triggers Full GC repeatedly. > I did some research: > * In Spark 1.6, spark.memory.fraction is the upper limit on cache size. It > defaults to 0.75. > * In Spark 1.5, spark.storage.memoryFraction is the upper limit in cache > size. It defaults to 0.6 and... > * http://spark.apache.org/docs/1.5.2/configuration.html even says that it > shouldn't be bigger than the size of the old generation. > * On the other hand, OpenJDK's default NewRatio is 2, which means an old > generation size of 66%. Hence the default value in Spark 1.6 contradicts this > advice. > http://spark.apache.org/docs/1.6.1/tuning.html recommends that if the old > generation is running close to full, then setting > spark.memory.storageFraction to a lower value should help. I have tried with > spark.memory.storageFraction=0.1, but it still doesn't fix the issue. This is > not a surprise: http://spark.apache.org/docs/1.6.1/configuration.html > explains that storageFraction is not an upper-limit but a lower limit-like > thing on the size of Spark's cache. The real upper limit is > spark.memory.fraction. > To sum up my questions/issues: > * At least http://spark.apache.org/docs/1.6.1/tuning.html should be fixed. > Maybe the old generation size should also be mentioned in configuration.html > near spark.memory.fraction. > * Is it a goal for Spark to
[jira] [Resolved] (SPARK-15813) Spark Dyn Allocation Cancel log message misleading
[ https://issues.apache.org/jira/browse/SPARK-15813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-15813. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 13552 [https://github.com/apache/spark/pull/13552] > Spark Dyn Allocation Cancel log message misleading > -- > > Key: SPARK-15813 > URL: https://issues.apache.org/jira/browse/SPARK-15813 > Project: Spark > Issue Type: Bug >Reporter: Peter Ableda >Priority: Trivial > Fix For: 2.0.0 > > > *Driver requested* message is logged before the *Canceling* message but has > the updated executor number. The messages are misleading. > See log snippet: > {code} > 16/06/07 18:53:48 INFO yarn.YarnAllocator: Driver requested a total number of > 619 executor(s). > 16/06/07 18:53:48 INFO yarn.YarnAllocator: Canceling requests for 4 executor > containers > 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 382.0 in stage > 0.0 (TID 382) in 22 ms on lava-2.vpc.cloudera.com (382/1000) > 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 383.0 in stage > 0.0 (TID 383, lava-2.vpc.cloudera.com, partition 383,PROCESS_LOCAL, 1980 > bytes) > 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 383.0 in stage > 0.0 (TID 383) in 24 ms on lava-2.vpc.cloudera.com (383/1000) > 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 384.0 in stage > 0.0 (TID 384, lava-2.vpc.cloudera.com, partition 384,PROCESS_LOCAL, 1980 > bytes) > 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 384.0 in stage > 0.0 (TID 384) in 19 ms on lava-2.vpc.cloudera.com (384/1000) > 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 385.0 in stage > 0.0 (TID 385, lava-2.vpc.cloudera.com, partition 385,PROCESS_LOCAL, 1980 > bytes) > 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 385.0 in stage > 0.0 (TID 385) in 22 ms on lava-2.vpc.cloudera.com (385/1000) > 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 386.0 in stage > 0.0 (TID 386, lava-2.vpc.cloudera.com, partition 386,PROCESS_LOCAL, 1980 > bytes) > 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 386.0 in stage > 0.0 (TID 386) in 20 ms on lava-2.vpc.cloudera.com (386/1000) > 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 387.0 in stage > 0.0 (TID 387, lava-2.vpc.cloudera.com, partition 387,PROCESS_LOCAL, 1980 > bytes) > 16/06/07 18:53:48 INFO yarn.YarnAllocator: Driver requested a total number of > 614 executor(s). > 16/06/07 18:53:48 INFO yarn.YarnAllocator: Canceling requests for 5 executor > containers > 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 388.0 in stage > 0.0 (TID 388, lava-4.vpc.cloudera.com, partition 388,PROCESS_LOCAL, 1980 > bytes) > {code} > The easy solution is to update the message to use past tense. This is > consistent with the other messages there. > *Canceled requests for 5 executor container(s).* -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15813) Spark Dyn Allocation Cancel log message misleading
[ https://issues.apache.org/jira/browse/SPARK-15813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-15813: -- Assignee: Peter Ableda > Spark Dyn Allocation Cancel log message misleading > -- > > Key: SPARK-15813 > URL: https://issues.apache.org/jira/browse/SPARK-15813 > Project: Spark > Issue Type: Bug >Reporter: Peter Ableda >Assignee: Peter Ableda >Priority: Trivial > Fix For: 2.0.0 > > > *Driver requested* message is logged before the *Canceling* message but has > the updated executor number. The messages are misleading. > See log snippet: > {code} > 16/06/07 18:53:48 INFO yarn.YarnAllocator: Driver requested a total number of > 619 executor(s). > 16/06/07 18:53:48 INFO yarn.YarnAllocator: Canceling requests for 4 executor > containers > 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 382.0 in stage > 0.0 (TID 382) in 22 ms on lava-2.vpc.cloudera.com (382/1000) > 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 383.0 in stage > 0.0 (TID 383, lava-2.vpc.cloudera.com, partition 383,PROCESS_LOCAL, 1980 > bytes) > 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 383.0 in stage > 0.0 (TID 383) in 24 ms on lava-2.vpc.cloudera.com (383/1000) > 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 384.0 in stage > 0.0 (TID 384, lava-2.vpc.cloudera.com, partition 384,PROCESS_LOCAL, 1980 > bytes) > 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 384.0 in stage > 0.0 (TID 384) in 19 ms on lava-2.vpc.cloudera.com (384/1000) > 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 385.0 in stage > 0.0 (TID 385, lava-2.vpc.cloudera.com, partition 385,PROCESS_LOCAL, 1980 > bytes) > 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 385.0 in stage > 0.0 (TID 385) in 22 ms on lava-2.vpc.cloudera.com (385/1000) > 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 386.0 in stage > 0.0 (TID 386, lava-2.vpc.cloudera.com, partition 386,PROCESS_LOCAL, 1980 > bytes) > 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 386.0 in stage > 0.0 (TID 386) in 20 ms on lava-2.vpc.cloudera.com (386/1000) > 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 387.0 in stage > 0.0 (TID 387, lava-2.vpc.cloudera.com, partition 387,PROCESS_LOCAL, 1980 > bytes) > 16/06/07 18:53:48 INFO yarn.YarnAllocator: Driver requested a total number of > 614 executor(s). > 16/06/07 18:53:48 INFO yarn.YarnAllocator: Canceling requests for 5 executor > containers > 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 388.0 in stage > 0.0 (TID 388, lava-4.vpc.cloudera.com, partition 388,PROCESS_LOCAL, 1980 > bytes) > {code} > The easy solution is to update the message to use past tense. This is > consistent with the other messages there. > *Canceled requests for 5 executor container(s).* -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15813) Spark Dyn Allocation Cancel log message misleading
[ https://issues.apache.org/jira/browse/SPARK-15813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-15813: -- Issue Type: Improvement (was: Bug) > Spark Dyn Allocation Cancel log message misleading > -- > > Key: SPARK-15813 > URL: https://issues.apache.org/jira/browse/SPARK-15813 > Project: Spark > Issue Type: Improvement >Reporter: Peter Ableda >Assignee: Peter Ableda >Priority: Trivial > Fix For: 2.0.0 > > > *Driver requested* message is logged before the *Canceling* message but has > the updated executor number. The messages are misleading. > See log snippet: > {code} > 16/06/07 18:53:48 INFO yarn.YarnAllocator: Driver requested a total number of > 619 executor(s). > 16/06/07 18:53:48 INFO yarn.YarnAllocator: Canceling requests for 4 executor > containers > 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 382.0 in stage > 0.0 (TID 382) in 22 ms on lava-2.vpc.cloudera.com (382/1000) > 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 383.0 in stage > 0.0 (TID 383, lava-2.vpc.cloudera.com, partition 383,PROCESS_LOCAL, 1980 > bytes) > 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 383.0 in stage > 0.0 (TID 383) in 24 ms on lava-2.vpc.cloudera.com (383/1000) > 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 384.0 in stage > 0.0 (TID 384, lava-2.vpc.cloudera.com, partition 384,PROCESS_LOCAL, 1980 > bytes) > 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 384.0 in stage > 0.0 (TID 384) in 19 ms on lava-2.vpc.cloudera.com (384/1000) > 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 385.0 in stage > 0.0 (TID 385, lava-2.vpc.cloudera.com, partition 385,PROCESS_LOCAL, 1980 > bytes) > 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 385.0 in stage > 0.0 (TID 385) in 22 ms on lava-2.vpc.cloudera.com (385/1000) > 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 386.0 in stage > 0.0 (TID 386, lava-2.vpc.cloudera.com, partition 386,PROCESS_LOCAL, 1980 > bytes) > 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 386.0 in stage > 0.0 (TID 386) in 20 ms on lava-2.vpc.cloudera.com (386/1000) > 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 387.0 in stage > 0.0 (TID 387, lava-2.vpc.cloudera.com, partition 387,PROCESS_LOCAL, 1980 > bytes) > 16/06/07 18:53:48 INFO yarn.YarnAllocator: Driver requested a total number of > 614 executor(s). > 16/06/07 18:53:48 INFO yarn.YarnAllocator: Canceling requests for 5 executor > containers > 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 388.0 in stage > 0.0 (TID 388, lava-4.vpc.cloudera.com, partition 388,PROCESS_LOCAL, 1980 > bytes) > {code} > The easy solution is to update the message to use past tense. This is > consistent with the other messages there. > *Canceled requests for 5 executor container(s).* -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6320) Adding new query plan strategy to SQLContext
[ https://issues.apache.org/jira/browse/SPARK-6320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6320: - Assignee: Takuya Ueshin > Adding new query plan strategy to SQLContext > > > Key: SPARK-6320 > URL: https://issues.apache.org/jira/browse/SPARK-6320 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.0 >Reporter: Youssef Hatem >Assignee: Takuya Ueshin >Priority: Minor > Fix For: 2.0.0 > > > Hi, > I would like to add a new strategy to {{SQLContext}}. To do this I created a > new class which extends {{Strategy}}. In my new class I need to call > {{planLater}} function. However this method is defined in {{SparkPlanner}} > (which itself inherits the method from {{QueryPlanner}}). > To my knowledge the only way to make {{planLater}} function visible to my new > strategy is to define my strategy inside another class that extends > {{SparkPlanner}} and inherits {{planLater}} as a result, by doing so I will > have to extend the {{SQLContext}} such that I can override the {{planner}} > field with the new {{Planner}} class I created. > It seems that this is a design problem because adding a new strategy seems to > require extending {{SQLContext}} (unless I am doing it wrong and there is a > better way to do it). > Thanks a lot, > Youssef -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15788) PySpark IDFModel missing "idf" property
[ https://issues.apache.org/jira/browse/SPARK-15788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-15788: -- Assignee: Jeff Zhang > PySpark IDFModel missing "idf" property > --- > > Key: SPARK-15788 > URL: https://issues.apache.org/jira/browse/SPARK-15788 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Nick Pentreath >Assignee: Jeff Zhang >Priority: Trivial > Fix For: 2.0.0 > > > Scala {{IDFModel}} has a method {{def idf: Vector = idfModel.idf.asML}} - > this should be exposed on the Python side as a property -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15489) Dataset kryo encoder won't load custom user settings
[ https://issues.apache.org/jira/browse/SPARK-15489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-15489: -- Assignee: Amit Sela > Dataset kryo encoder won't load custom user settings > - > > Key: SPARK-15489 > URL: https://issues.apache.org/jira/browse/SPARK-15489 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Amit Sela >Assignee: Amit Sela > Fix For: 2.0.0 > > > When setting a custom "spark.kryo.registrator" (or any other configuration > for that matter) through the API, this configuration will not propagate to > the encoder that uses a KryoSerializer since it instantiates with "new > SparkConf()". > See: > https://github.com/apache/spark/blob/07c36a2f07fcf5da6fb395f830ebbfc10eb27dcc/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala#L554 > This could be hacked by providing those configurations as System properties, > but this probably should be passed to the encoder and set in the > SerializerInstance after creation. > Example: > When using Encoders with kryo to encode generically typed Objects in the > following manner: > public static Encoder encoder() { > return Encoders.kryo((Class) Object.class); > } > I get a decoding exception when trying to decode > `java.util.Collections$UnmodifiableCollection`, which probably comes from > Guava's `ImmutableList`. > This happens when running with master = local[1]. Same code had no problems > with RDD api. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15743) Prevent saving with all-column partitioning
[ https://issues.apache.org/jira/browse/SPARK-15743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-15743: -- Assignee: Dongjoon Hyun > Prevent saving with all-column partitioning > --- > > Key: SPARK-15743 > URL: https://issues.apache.org/jira/browse/SPARK-15743 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun > Labels: releasenotes > Fix For: 2.0.0 > > > When saving datasets on storage, `partitionBy` provides an easy way to > construct the directory structure. However, if a user choose all columns as > partition columns, some exceptions occurs. > - ORC: `AnalysisException` on **future read** due to schema inference failure. > - Parquet: `InvalidSchemaException` on **write execution** due to Parquet > limitation. > The followings are the examples. > **ORC with all column partitioning** > {code} > scala> > spark.range(10).write.format("orc").mode("overwrite").partitionBy("id").save("/tmp/data") > > > scala> spark.read.format("orc").load("/tmp/data").collect() > org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC at > /tmp/data. It must be specified manually; > {code} > **Parquet with all-column partitioning** > {code} > scala> > spark.range(100).write.format("parquet").mode("overwrite").partitionBy("id").save("/tmp/data") > [Stage 0:> (0 + 8) / > 8]16/06/02 16:51:17 ERROR Utils: Aborting task > org.apache.parquet.schema.InvalidSchemaException: A group type can not be > empty. Parquet does not support empty group without leaves. Empty group: > spark_schema > ... (lots of error messages) > {code} > Although some formats like JSON support all-column partitioning without any > problem, it seems not a good idea to make lots of empty directories. > This issue prevents this by consistently raising `AnalysisException` before > saving. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15790) Audit @Since annotations in ML
[ https://issues.apache.org/jira/browse/SPARK-15790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327028#comment-15327028 ] Nick Pentreath commented on SPARK-15790: Ah thanks - missed that umbrella. It's actually really the {{ml.feature}} classes mostly, and that PR seems to have stalled. I've started on a new one to cover the feature package. > Audit @Since annotations in ML > -- > > Key: SPARK-15790 > URL: https://issues.apache.org/jira/browse/SPARK-15790 > Project: Spark > Issue Type: Documentation > Components: ML, PySpark >Reporter: Nick Pentreath >Assignee: Nick Pentreath > > Many classes & methods in ML are missing {{@Since}} annotations. Audit what's > missing and add annotations to public API constructors, vals and methods. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6628) ClassCastException occurs when executing sql statement "insert into" on hbase table
[ https://issues.apache.org/jira/browse/SPARK-6628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327065#comment-15327065 ] Murshid Chalaev commented on SPARK-6628: Spark 1.6.1 is affected as well, is there any workaround for this? > ClassCastException occurs when executing sql statement "insert into" on hbase > table > --- > > Key: SPARK-6628 > URL: https://issues.apache.org/jira/browse/SPARK-6628 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: meiyoula > > Error: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 1 in stage 3.0 failed 4 times, most recent failure: Lost task 1.3 in > stage 3.0 (TID 12, vm-17): java.lang.ClassCastException: > org.apache.hadoop.hive.hbase.HiveHBaseTableOutputFormat cannot be cast to > org.apache.hadoop.hive.ql.io.HiveOutputFormat > at > org.apache.spark.sql.hive.SparkHiveWriterContainer.outputFormat$lzycompute(hiveWriterContainers.scala:72) > at > org.apache.spark.sql.hive.SparkHiveWriterContainer.outputFormat(hiveWriterContainers.scala:71) > at > org.apache.spark.sql.hive.SparkHiveWriterContainer.getOutputName(hiveWriterContainers.scala:91) > at > org.apache.spark.sql.hive.SparkHiveWriterContainer.initWriters(hiveWriterContainers.scala:115) > at > org.apache.spark.sql.hive.SparkHiveWriterContainer.executorSideSetup(hiveWriterContainers.scala:84) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.scala:112) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:93) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:93) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) > at org.apache.spark.scheduler.Task.run(Task.scala:56) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:197) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15904) High Memory Pressure using MLlib K-means
[ https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327082#comment-15327082 ] yuhao yang commented on SPARK-15904: Hi [~Purple]] What's your k and vector size? Btw, this should not be a major bug. > High Memory Pressure using MLlib K-means > > > Key: SPARK-15904 > URL: https://issues.apache.org/jira/browse/SPARK-15904 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.6.1 > Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB > of RAM. >Reporter: Alessio > > Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on > Memory and Disk. > Everything's fine, although at the end of K-Means, after the number of > iterations, the cost function value and the running time there's a nice > "Removing RDD from persistent list" stage. However, during this stage > there's a high memory pressure. Weird, since RDDs are about to be removed. > Full log of this stage: > 16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations > 16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds. > 16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations. > 16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is > 49784.87126751288. > 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from > persistence list > 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781 > 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from > persistence list > 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780 > I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. > My machine has an i5 hyperthreaded dual-core, thus [*] means 4. > I'm launching this application though spark-submit with --driver-memory 10G -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15904) High Memory Pressure using MLlib K-means
[ https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327090#comment-15327090 ] Alessio commented on SPARK-15904: - Hi [~yuhaoyan]], the dataset size is 9120 rows and 2125 columns. This problem appears when K>3000. What do you suggest as priority label? I'm sorry if "major" is not appropriate, this is my first post on JIRA > High Memory Pressure using MLlib K-means > > > Key: SPARK-15904 > URL: https://issues.apache.org/jira/browse/SPARK-15904 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.6.1 > Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB > of RAM. >Reporter: Alessio > > Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on > Memory and Disk. > Everything's fine, although at the end of K-Means, after the number of > iterations, the cost function value and the running time there's a nice > "Removing RDD from persistent list" stage. However, during this stage > there's a high memory pressure. Weird, since RDDs are about to be removed. > Full log of this stage: > 16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations > 16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds. > 16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations. > 16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is > 49784.87126751288. > 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from > persistence list > 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781 > 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from > persistence list > 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780 > I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. > My machine has an i5 hyperthreaded dual-core, thus [*] means 4. > I'm launching this application though spark-submit with --driver-memory 10G -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15916) JDBC AND/OR operator push down does not respect lower OR operator precedence
[ https://issues.apache.org/jira/browse/SPARK-15916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327106#comment-15327106 ] Hyukjin Kwon commented on SPARK-15916: -- Indeed. Do you mind if I submit a PR for this? > JDBC AND/OR operator push down does not respect lower OR operator precedence > > > Key: SPARK-15916 > URL: https://issues.apache.org/jira/browse/SPARK-15916 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Piotr Czarnas > > A table from sql server Northwind database was registered as a JDBC dataframe. > A query was executed on Spark SQL, the "northwind_dbo_Categories" table is a > temporary table which is a JDBC dataframe to "[northwind].[dbo].[Categories]" > sql server table: > SQL executed on Spark sql context: > SELECT CategoryID FROM northwind_dbo_Categories > WHERE (CategoryID = 1 OR CategoryID = 2) AND CategoryName = 'Beverages' > Spark has done a proper predicate pushdown to JDBC, however parenthesis > around two OR conditions was removed. Instead the following query was sent > over JDBC to SQL Server: > SELECT "CategoryID" FROM [northwind].[dbo].[Categories] WHERE (CategoryID = > 1) OR (CategoryID = 2) AND CategoryName = 'Beverages' > As a result, the last two conditions (around the AND operator) were > considered as the highest precedence: (CategoryID = 2) AND CategoryName = > 'Beverages' > Finally SQL Server has executed a query like this: > SELECT "CategoryID" FROM [northwind].[dbo].[Categories] WHERE CategoryID = 1 > OR (CategoryID = 2 AND CategoryName = 'Beverages') -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15904) High Memory Pressure using MLlib K-means
[ https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327108#comment-15327108 ] yuhao yang commented on SPARK-15904: Thanks for reporting it. I'm not sure if the issue is valid for now. Maybe Type -> Improvement, Priority -> minor as a start. > High Memory Pressure using MLlib K-means > > > Key: SPARK-15904 > URL: https://issues.apache.org/jira/browse/SPARK-15904 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.6.1 > Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB > of RAM. >Reporter: Alessio > > Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on > Memory and Disk. > Everything's fine, although at the end of K-Means, after the number of > iterations, the cost function value and the running time there's a nice > "Removing RDD from persistent list" stage. However, during this stage > there's a high memory pressure. Weird, since RDDs are about to be removed. > Full log of this stage: > 16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations > 16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds. > 16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations. > 16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is > 49784.87126751288. > 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from > persistence list > 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781 > 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from > persistence list > 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780 > I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. > My machine has an i5 hyperthreaded dual-core, thus [*] means 4. > I'm launching this application though spark-submit with --driver-memory 10G -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15917) Define the number of executors in standalone mode with an easy-to-use property
Jonathan Taws created SPARK-15917: - Summary: Define the number of executors in standalone mode with an easy-to-use property Key: SPARK-15917 URL: https://issues.apache.org/jira/browse/SPARK-15917 Project: Spark Issue Type: Improvement Components: Spark Core, Spark Shell, Spark Submit Affects Versions: 1.6.1 Reporter: Jonathan Taws Priority: Minor After stumbling across a few StackOverflow posts around the issue of using a fixed number of executors in standalone mode (non-YARN), I was wondering if we could not add an easier way to set this parameter than having to resort to some calculations based on the number of cores and the memory you have available on your worker. For example, let's say I have 8 cores and 30GB of memory available. If no option is passed, one executor will be spawned with 8 cores and 1GB of memory allocated. However, let's say I want to have only *2* executors, and to use 2 cores and 10GB of memory per executor, I will end up with *3* executors (as the available memory will limit the number of executors) instead of the 2 I was hoping for. Sure, I can set {{spark.cores.max}} as a workaround to get exactly what I want, but would it not be easier to add a {{--num-executors}}-like option to standalone mode to be able to really fine-tune the configuration ? This option is already available in YARN mode. >From my understanding, I don't see any other option lying around that can help >achieve this. This seems to be slightly disturbing for newcomers, and standalone mode is probably the first thing anyone will use to just try out Spark or test some configuration. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15917) Define the number of executors in standalone mode with an easy-to-use property
[ https://issues.apache.org/jira/browse/SPARK-15917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Taws updated SPARK-15917: -- Description: After stumbling across a few StackOverflow posts around the issue of using a fixed number of executors in standalone mode (non-YARN), I was wondering if we could not add an easier way to set this parameter than having to resort to some calculations based on the number of cores and the memory you have available on your worker. For example, let's say I have 8 cores and 30GB of memory available : - If no option is passed, one executor will be spawned with 8 cores and 1GB of memory allocated. - However, if I want to have only *2* executors, and to use 2 cores and 10GB of memory per executor, I will end up with *3* executors (as the available memory will limit the number of executors) instead of the 2 I was hoping for. Sure, I can set {{spark.cores.max}} as a workaround to get exactly what I want, but would it not be easier to add a {{--num-executors}}-like option to standalone mode to be able to really fine-tune the configuration ? This option is already available in YARN mode. >From my understanding, I don't see any other option lying around that can help >achieve this. This seems to be slightly disturbing for newcomers, and standalone mode is probably the first thing anyone will use to just try out Spark or test some configuration. was: After stumbling across a few StackOverflow posts around the issue of using a fixed number of executors in standalone mode (non-YARN), I was wondering if we could not add an easier way to set this parameter than having to resort to some calculations based on the number of cores and the memory you have available on your worker. For example, let's say I have 8 cores and 30GB of memory available. If no option is passed, one executor will be spawned with 8 cores and 1GB of memory allocated. However, let's say I want to have only *2* executors, and to use 2 cores and 10GB of memory per executor, I will end up with *3* executors (as the available memory will limit the number of executors) instead of the 2 I was hoping for. Sure, I can set {{spark.cores.max}} as a workaround to get exactly what I want, but would it not be easier to add a {{--num-executors}}-like option to standalone mode to be able to really fine-tune the configuration ? This option is already available in YARN mode. >From my understanding, I don't see any other option lying around that can help >achieve this. This seems to be slightly disturbing for newcomers, and standalone mode is probably the first thing anyone will use to just try out Spark or test some configuration. > Define the number of executors in standalone mode with an easy-to-use property > -- > > Key: SPARK-15917 > URL: https://issues.apache.org/jira/browse/SPARK-15917 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Spark Shell, Spark Submit >Affects Versions: 1.6.1 >Reporter: Jonathan Taws >Priority: Minor > > After stumbling across a few StackOverflow posts around the issue of using a > fixed number of executors in standalone mode (non-YARN), I was wondering if > we could not add an easier way to set this parameter than having to resort to > some calculations based on the number of cores and the memory you have > available on your worker. > For example, let's say I have 8 cores and 30GB of memory available : > - If no option is passed, one executor will be spawned with 8 cores and 1GB > of memory allocated. > - However, if I want to have only *2* executors, and to use 2 cores and 10GB > of memory per executor, I will end up with *3* executors (as the available > memory will limit the number of executors) instead of the 2 I was hoping for. > Sure, I can set {{spark.cores.max}} as a workaround to get exactly what I > want, but would it not be easier to add a {{--num-executors}}-like option to > standalone mode to be able to really fine-tune the configuration ? This > option is already available in YARN mode. > From my understanding, I don't see any other option lying around that can > help achieve this. > This seems to be slightly disturbing for newcomers, and standalone mode is > probably the first thing anyone will use to just try out Spark or test some > configuration. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15904) High Memory Pressure using MLlib K-means
[ https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alessio updated SPARK-15904: Issue Type: Improvement (was: Bug) > High Memory Pressure using MLlib K-means > > > Key: SPARK-15904 > URL: https://issues.apache.org/jira/browse/SPARK-15904 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.6.1 > Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB > of RAM. >Reporter: Alessio > > Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on > Memory and Disk. > Everything's fine, although at the end of K-Means, after the number of > iterations, the cost function value and the running time there's a nice > "Removing RDD from persistent list" stage. However, during this stage > there's a high memory pressure. Weird, since RDDs are about to be removed. > Full log of this stage: > 16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations > 16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds. > 16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations. > 16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is > 49784.87126751288. > 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from > persistence list > 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781 > 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from > persistence list > 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780 > I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. > My machine has an i5 hyperthreaded dual-core, thus [*] means 4. > I'm launching this application though spark-submit with --driver-memory 10G -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15904) High Memory Pressure using MLlib K-means
[ https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alessio updated SPARK-15904: Priority: Minor (was: Major) > High Memory Pressure using MLlib K-means > > > Key: SPARK-15904 > URL: https://issues.apache.org/jira/browse/SPARK-15904 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.6.1 > Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB > of RAM. >Reporter: Alessio >Priority: Minor > > Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on > Memory and Disk. > Everything's fine, although at the end of K-Means, after the number of > iterations, the cost function value and the running time there's a nice > "Removing RDD from persistent list" stage. However, during this stage > there's a high memory pressure. Weird, since RDDs are about to be removed. > Full log of this stage: > 16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations > 16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds. > 16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations. > 16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is > 49784.87126751288. > 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from > persistence list > 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781 > 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from > persistence list > 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780 > I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. > My machine has an i5 hyperthreaded dual-core, thus [*] means 4. > I'm launching this application though spark-submit with --driver-memory 10G -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15916) JDBC AND/OR operator push down does not respect lower OR operator precedence
[ https://issues.apache.org/jira/browse/SPARK-15916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327118#comment-15327118 ] Piotr Czarnas commented on SPARK-15916: --- Hi, I wish so. This issue is failing a lot of tests in my project. Best Regards, Piotr On Mon, Jun 13, 2016 at 12:00 PM, Hyukjin Kwon (JIRA) > JDBC AND/OR operator push down does not respect lower OR operator precedence > > > Key: SPARK-15916 > URL: https://issues.apache.org/jira/browse/SPARK-15916 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Piotr Czarnas > > A table from sql server Northwind database was registered as a JDBC dataframe. > A query was executed on Spark SQL, the "northwind_dbo_Categories" table is a > temporary table which is a JDBC dataframe to "[northwind].[dbo].[Categories]" > sql server table: > SQL executed on Spark sql context: > SELECT CategoryID FROM northwind_dbo_Categories > WHERE (CategoryID = 1 OR CategoryID = 2) AND CategoryName = 'Beverages' > Spark has done a proper predicate pushdown to JDBC, however parenthesis > around two OR conditions was removed. Instead the following query was sent > over JDBC to SQL Server: > SELECT "CategoryID" FROM [northwind].[dbo].[Categories] WHERE (CategoryID = > 1) OR (CategoryID = 2) AND CategoryName = 'Beverages' > As a result, the last two conditions (around the AND operator) were > considered as the highest precedence: (CategoryID = 2) AND CategoryName = > 'Beverages' > Finally SQL Server has executed a query like this: > SELECT "CategoryID" FROM [northwind].[dbo].[Categories] WHERE CategoryID = 1 > OR (CategoryID = 2 AND CategoryName = 'Beverages') -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15916) JDBC AND/OR operator push down does not respect lower OR operator precedence
[ https://issues.apache.org/jira/browse/SPARK-15916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327144#comment-15327144 ] Apache Spark commented on SPARK-15916: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/13640 > JDBC AND/OR operator push down does not respect lower OR operator precedence > > > Key: SPARK-15916 > URL: https://issues.apache.org/jira/browse/SPARK-15916 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Piotr Czarnas > > A table from sql server Northwind database was registered as a JDBC dataframe. > A query was executed on Spark SQL, the "northwind_dbo_Categories" table is a > temporary table which is a JDBC dataframe to "[northwind].[dbo].[Categories]" > sql server table: > SQL executed on Spark sql context: > SELECT CategoryID FROM northwind_dbo_Categories > WHERE (CategoryID = 1 OR CategoryID = 2) AND CategoryName = 'Beverages' > Spark has done a proper predicate pushdown to JDBC, however parenthesis > around two OR conditions was removed. Instead the following query was sent > over JDBC to SQL Server: > SELECT "CategoryID" FROM [northwind].[dbo].[Categories] WHERE (CategoryID = > 1) OR (CategoryID = 2) AND CategoryName = 'Beverages' > As a result, the last two conditions (around the AND operator) were > considered as the highest precedence: (CategoryID = 2) AND CategoryName = > 'Beverages' > Finally SQL Server has executed a query like this: > SELECT "CategoryID" FROM [northwind].[dbo].[Categories] WHERE CategoryID = 1 > OR (CategoryID = 2 AND CategoryName = 'Beverages') -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15916) JDBC AND/OR operator push down does not respect lower OR operator precedence
[ https://issues.apache.org/jira/browse/SPARK-15916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15916: Assignee: (was: Apache Spark) > JDBC AND/OR operator push down does not respect lower OR operator precedence > > > Key: SPARK-15916 > URL: https://issues.apache.org/jira/browse/SPARK-15916 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Piotr Czarnas > > A table from sql server Northwind database was registered as a JDBC dataframe. > A query was executed on Spark SQL, the "northwind_dbo_Categories" table is a > temporary table which is a JDBC dataframe to "[northwind].[dbo].[Categories]" > sql server table: > SQL executed on Spark sql context: > SELECT CategoryID FROM northwind_dbo_Categories > WHERE (CategoryID = 1 OR CategoryID = 2) AND CategoryName = 'Beverages' > Spark has done a proper predicate pushdown to JDBC, however parenthesis > around two OR conditions was removed. Instead the following query was sent > over JDBC to SQL Server: > SELECT "CategoryID" FROM [northwind].[dbo].[Categories] WHERE (CategoryID = > 1) OR (CategoryID = 2) AND CategoryName = 'Beverages' > As a result, the last two conditions (around the AND operator) were > considered as the highest precedence: (CategoryID = 2) AND CategoryName = > 'Beverages' > Finally SQL Server has executed a query like this: > SELECT "CategoryID" FROM [northwind].[dbo].[Categories] WHERE CategoryID = 1 > OR (CategoryID = 2 AND CategoryName = 'Beverages') -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15916) JDBC AND/OR operator push down does not respect lower OR operator precedence
[ https://issues.apache.org/jira/browse/SPARK-15916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15916: Assignee: Apache Spark > JDBC AND/OR operator push down does not respect lower OR operator precedence > > > Key: SPARK-15916 > URL: https://issues.apache.org/jira/browse/SPARK-15916 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Piotr Czarnas >Assignee: Apache Spark > > A table from sql server Northwind database was registered as a JDBC dataframe. > A query was executed on Spark SQL, the "northwind_dbo_Categories" table is a > temporary table which is a JDBC dataframe to "[northwind].[dbo].[Categories]" > sql server table: > SQL executed on Spark sql context: > SELECT CategoryID FROM northwind_dbo_Categories > WHERE (CategoryID = 1 OR CategoryID = 2) AND CategoryName = 'Beverages' > Spark has done a proper predicate pushdown to JDBC, however parenthesis > around two OR conditions was removed. Instead the following query was sent > over JDBC to SQL Server: > SELECT "CategoryID" FROM [northwind].[dbo].[Categories] WHERE (CategoryID = > 1) OR (CategoryID = 2) AND CategoryName = 'Beverages' > As a result, the last two conditions (around the AND operator) were > considered as the highest precedence: (CategoryID = 2) AND CategoryName = > 'Beverages' > Finally SQL Server has executed a query like this: > SELECT "CategoryID" FROM [northwind].[dbo].[Categories] WHERE CategoryID = 1 > OR (CategoryID = 2 AND CategoryName = 'Beverages') -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-15345) SparkSession's conf doesn't take effect when there's already an existing SparkContext
[ https://issues.apache.org/jira/browse/SPARK-15345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Piotr Milanowski reopened SPARK-15345: -- Does not work as expected when using spark-submit; for example, this works fine and prints all databases in Hive storage {code} # file test_db.py from pyspark.sql import SparkSession from pyspark import SparkConf if __name__ == "__main__": conf = SparkConf() hive_context = (SparkSession.builder.config(conf=conf) .enableHiveSupport().getOrCreate()) print(hive_context.sql("show databases").collect()) {code} However, using HiveContext yields only 'default' database: {code} #file test.py from pyspark.sql import HiveContext from pyspark improt SparkContext, SparkConf if __name__ == "__main__": conf = SparkConrf() sc = SparkContext(conf=conf) hive_context = HiveContext(sc) print(hive_context.sql("show databases").collect()) # The result is #[Row(result='default')] {code} Is there something I am still missing? I am using the newest branch-2.0 > SparkSession's conf doesn't take effect when there's already an existing > SparkContext > - > > Key: SPARK-15345 > URL: https://issues.apache.org/jira/browse/SPARK-15345 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Reporter: Piotr Milanowski >Assignee: Reynold Xin >Priority: Blocker > Fix For: 2.0.0 > > > I am working with branch-2.0, spark is compiled with hive support (-Phive and > -Phvie-thriftserver). > I am trying to access databases using this snippet: > {code} > from pyspark.sql import HiveContext > hc = HiveContext(sc) > hc.sql("show databases").collect() > [Row(result='default')] > {code} > This means that spark doesn't find any databases specified in configuration. > Using the same configuration (i.e. hive-site.xml and core-site.xml) in spark > 1.6, and launching above snippet, I can print out existing databases. > When run in DEBUG mode this is what spark (2.0) prints out: > {code} > 16/05/16 12:17:47 INFO SparkSqlParser: Parsing command: show databases > 16/05/16 12:17:47 DEBUG SimpleAnalyzer: > === Result of Batch Resolution === > !'Project [unresolveddeserializer(createexternalrow(if (isnull(input[0, > string])) null else input[0, string].toString, > StructField(result,StringType,false)), result#2) AS #3] Project > [createexternalrow(if (isnull(result#2)) null else result#2.toString, > StructField(result,StringType,false)) AS #3] > +- LocalRelation [result#2] > > +- LocalRelation [result#2] > > 16/05/16 12:17:47 DEBUG ClosureCleaner: +++ Cleaning closure > (org.apache.spark.sql.Dataset$$anonfun$53) +++ > 16/05/16 12:17:47 DEBUG ClosureCleaner: + declared fields: 2 > 16/05/16 12:17:47 DEBUG ClosureCleaner: public static final long > org.apache.spark.sql.Dataset$$anonfun$53.serialVersionUID > 16/05/16 12:17:47 DEBUG ClosureCleaner: private final > org.apache.spark.sql.types.StructType > org.apache.spark.sql.Dataset$$anonfun$53.structType$1 > 16/05/16 12:17:47 DEBUG ClosureCleaner: + declared methods: 2 > 16/05/16 12:17:47 DEBUG ClosureCleaner: public final java.lang.Object > org.apache.spark.sql.Dataset$$anonfun$53.apply(java.lang.Object) > 16/05/16 12:17:47 DEBUG ClosureCleaner: public final java.lang.Object > org.apache.spark.sql.Dataset$$anonfun$53.apply(org.apache.spark.sql.catalyst.InternalRow) > 16/05/16 12:17:47 DEBUG ClosureCleaner: + inner classes: 0 > 16/05/16 12:17:47 DEBUG ClosureCleaner: + outer classes: 0 > 16/05/16 12:17:47 DEBUG ClosureCleaner: + outer objects: 0 > 16/05/16 12:17:47 DEBUG ClosureCleaner: + populating accessed fields because > this is the starting closure > 16/05/16 12:17:47 DEBUG ClosureCleaner: + fields accessed by starting > closure: 0 > 16/05/16 12:17:47 DEBUG ClosureCleaner: + there are no enclosing objects! > 16/05/16 12:17:47 DEBUG ClosureCleaner: +++ closure > (org.apache.spark.sql.Dataset$$anonfun$53) is now cleaned +++ > 16/05/16 12:17:47 DEBUG ClosureCleaner: +++ Cleaning closure > (org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1) > +++ > 16/05/16 12:17:47 DEBUG ClosureCleaner: + declared fields: 1 > 16/05/16 12:17:47 DEBUG ClosureCleaner: public static final long > org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1.serialVersionUID > 16/05/16 12:17:47 DEBUG ClosureCleaner: + declared methods: 2 > 16/05/16 12:17:47 DEBUG ClosureCleaner: public final java.lang.Object > org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToP
[jira] [Created] (SPARK-15918) unionAll returns wrong result when two dataframes has schema in different order
Prabhu Joseph created SPARK-15918: - Summary: unionAll returns wrong result when two dataframes has schema in different order Key: SPARK-15918 URL: https://issues.apache.org/jira/browse/SPARK-15918 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.6.1 Environment: CentOS Reporter: Prabhu Joseph Fix For: 1.6.1 On applying unionAll operation between A and B dataframes, they both has same schema but in different order and hence the result has column value mapping changed. Repro: {code} A.show() +---++---+--+--+-++---+--+---+---+-+ |tag|year_day|tm_hour|tm_min|tm_sec|dtype|time|tm_mday|tm_mon|tm_yday|tm_year|value| +---++---+--+--+-++---+--+---+---+-+ +---++---+--+--+-++---+--+---+---+-+ B.show() +-+---+--+---+---+--+--+--+---+---+--++ |dtype|tag| time|tm_hour|tm_mday|tm_min|tm_mon|tm_sec|tm_yday|tm_year| value|year_day| +-+---+--+---+---+--+--+--+---+---+--++ |F|C_FNHXUT701Z.CNSTLO|1443790800| 13| 2| 0|10| 0| 275| 2015|1.2345| 2015275| |F|C_FNHXUDP713.CNSTHI|1443790800| 13| 2| 0|10| 0| 275| 2015|1.2345| 2015275| |F| C_FNHXUT718.CNSTHI|1443790800| 13| 2| 0|10| 0| 275| 2015|1.2345| 2015275| |F|C_FNHXUT703Z.CNSTLO|1443790800| 13| 2| 0|10| 0| 275| 2015|1.2345| 2015275| |F|C_FNHXUR716A.CNSTLO|1443790800| 13| 2| 0|10| 0| 275| 2015|1.2345| 2015275| |F|C_FNHXUT803Z.CNSTHI|1443790800| 13| 2| 0|10| 0| 275| 2015|1.2345| 2015275| |F| C_FNHXUT728.CNSTHI|1443790800| 13| 2| 0|10| 0| 275| 2015|1.2345| 2015275| |F| C_FNHXUR806.CNSTHI|1443790800| 13| 2| 0|10| 0| 275| 2015|1.2345| 2015275| +-+---+--+---+---+--+--+--+---+---+--++ A = A.unionAll(B) A.show() +---+---+--+--+--+-++---+--+---+---+-+ |tag| year_day| tm_hour|tm_min|tm_sec|dtype|time|tm_mday|tm_mon|tm_yday|tm_year|value| +---+---+--+--+--+-++---+--+---+---+-+ | F|C_FNHXUT701Z.CNSTLO|1443790800|13| 2|0| 10| 0| 275| 2015| 1.2345|2015275.0| | F|C_FNHXUDP713.CNSTHI|1443790800|13| 2|0| 10| 0| 275| 2015| 1.2345|2015275.0| | F| C_FNHXUT718.CNSTHI|1443790800|13| 2|0| 10| 0| 275| 2015| 1.2345|2015275.0| | F|C_FNHXUT703Z.CNSTLO|1443790800|13| 2|0| 10| 0| 275| 2015| 1.2345|2015275.0| | F|C_FNHXUR716A.CNSTLO|1443790800|13| 2|0| 10| 0| 275| 2015| 1.2345|2015275.0| | F|C_FNHXUT803Z.CNSTHI|1443790800|13| 2|0| 10| 0| 275| 2015| 1.2345|2015275.0| | F| C_FNHXUT728.CNSTHI|1443790800|13| 2|0| 10| 0| 275| 2015| 1.2345|2015275.0| | F| C_FNHXUR806.CNSTHI|1443790800|13| 2|0| 10| 0| 275| 2015| 1.2345|2015275.0| +---+---+--+--+--+-++---+--+---+---+-+ {code} On changing the schema of A according to B and doing unionAll works fine {code} C = A.select("dtype","tag","time","tm_hour","tm_mday","tm_min",”tm_mon”,"tm_sec","tm_yday","tm_year","value","year_day") A = C.unionAll(B) A.show() +-+---+--+---+---+--+--+--+---+---+--++ |dtype|tag| time|tm_hour|tm_mday|tm_min|tm_mon|tm_sec|tm_yday|tm_year| value|year_day| +-+---+--+---+---+--+--+--+---+---+--++ |F|C_FNHXUT701Z.CNSTLO|1443790800| 13| 2| 0|10| 0| 275| 2015|1.2345| 2015275| |F|C_FNHXUDP713.CNSTHI|1443790800| 13| 2| 0|10| 0| 275| 2015|1.2345| 2015275| |F| C_FNHXUT718.CNSTHI|1443790800| 13| 2| 0|10| 0| 275| 2015|1.2345| 2015275| |F|C_FNHXUT703Z.CNSTLO|1443790800| 13| 2| 0|10| 0| 275| 2015|1.2345| 2015275| |F|C_FNHXUR716A.CNSTLO|1443790800| 13| 2| 0|10| 0| 275| 2015|1.2345| 2015275| |F|C_FNHXUT803Z.CNSTHI|1443790800| 13| 2| 0|10| 0| 275| 2015|1.2345| 2015275| |F| C_FNHXUT728.CNSTHI|1443790800| 13| 2| 0|10| 0| 275| 2015|1.2345| 2015275| |F| C_FNHXUR806.CNSTHI|1443790800| 13| 2| 0|
[jira] [Created] (SPARK-15919) DStream "saveAsTextFile" doesn't update the prefix after each checkpoint
Aamir Abbas created SPARK-15919: --- Summary: DStream "saveAsTextFile" doesn't update the prefix after each checkpoint Key: SPARK-15919 URL: https://issues.apache.org/jira/browse/SPARK-15919 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 1.6.1 Environment: Amazon EMR Reporter: Aamir Abbas I have a Spark streaming job that reads a data stream, and saves it as a text file after a predefined time interval. In the function stream.dstream().repartition(1).saveAsTextFiles(getOutputPath(), ""); The function getOutputPath() generates a new path every time the function is called, depending on the current system time. However, the output path prefix remains the same for all the batches, which effectively means that function is not called again for the next batch of the stream, although the files are being saved after each checkpoint interval. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8546) PMML export for Naive Bayes
[ https://issues.apache.org/jira/browse/SPARK-8546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327167#comment-15327167 ] Radoslaw Gasiorek commented on SPARK-8546: -- hi there, [~josephkb] We would like to use Mllib built models to classify outside spark therefore without Spark context available. We would like to export the models built in spark into PMML format, that then would be read by a stand alone java application without spark context (but with Mllib jar). The java application would load the model from the PMML file and would use the model to 'predict' or rather 'classify' the new data we get. This feature would enable us to proceed without big architectural and operational changes, without this feature we might need get the the sparkContext available to the standalone application that would be bigger operational and architectural overhead. We might need to use the plain java serialization for the proof of concept anyways, but surely not for produtionized product. Can we prioritize this feature as well as https://issues.apache.org/jira/browse/SPARK-8542 and https://issues.apache.org/jira/browse/SPARK-8543 ? What would be LOE and EAT for these? thanks guys in advance for responses, and feedback. > PMML export for Naive Bayes > --- > > Key: SPARK-8546 > URL: https://issues.apache.org/jira/browse/SPARK-8546 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Joseph K. Bradley >Assignee: Xusen Yin >Priority: Minor > > The naive Bayes section of PMML standard can be found at > http://www.dmg.org/v4-1/NaiveBayes.html. We should first figure out how to > generate PMML for both binomial and multinomial naive Bayes models using > JPMML (maybe [~vfed] can help). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15920) Using map on DataFrame
Piotr Milanowski created SPARK-15920: Summary: Using map on DataFrame Key: SPARK-15920 URL: https://issues.apache.org/jira/browse/SPARK-15920 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.0.0 Environment: branch-2.0 Reporter: Piotr Milanowski In Spark 1.6 there was a method {{DataFrame.map}} as an alias to {{DataFrame.rdd.map}}. In spark 2.0 this functionality no longer exists. Is there a preferred way of doing map on a DataFrame without explicitly calling {{DataFrame.rdd.map}}? Maybe this functionality should be kept, just for backward compatibility purpose? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-15293) 'collect_list' function undefined
[ https://issues.apache.org/jira/browse/SPARK-15293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Piotr Milanowski closed SPARK-15293. Works fine, thanks. > 'collect_list' function undefined > - > > Key: SPARK-15293 > URL: https://issues.apache.org/jira/browse/SPARK-15293 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.0.0 >Reporter: Piotr Milanowski >Assignee: Herman van Hovell > Fix For: 2.0.0 > > > When using pyspark.sql.functions.collect_list function in sql queries, an > error occurs - Undefined function collect_list > Example: > {code} > >>> from pyspark.sql import Row > >>> #The same with SQLContext > >>> from pyspark.sql import HiveContext > >>> from pyspark.sql.functions import collect_list > >>> sql = HiveContext(sc) > >>> rows = [Row(age=20, job='Programmer', name='Alice'), Row(age=21, > >>> job='Programmer', name='Bob'), Row(age=30, job='Hacker', name='Fred'), > >>> Row(age=29, job='PM', name='Tom'), Row(age=50, job='CEO', name='Daisy')] > >>> df = sql.createDataFrame(rows) > >>> df.groupby(df.job).agg(df.job, collect_list(df.age)) > Traceback (most recent call last): > File "/mnt/mfs/spark-2.0/python/pyspark/sql/utils.py", line 57, in deco > return f(*a, **kw) > File "/mnt/mfs/spark-2.0/python/lib/py4j-0.9.2-src.zip/py4j/protocol.py", > line 310, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling o193.agg. > : org.apache.spark.sql.AnalysisException: Undefined function: 'collect_list'. > This function is neither a registered temporary function nor a permanent > function registered in the database 'default'.; > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.failFunctionLookup(SessionCatalog.scala:719) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.lookupFunction(SessionCatalog.scala:781) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$13$$anonfun$applyOrElse$6$$anonfun$applyOrElse$38.apply(Analyzer.scala:907) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$13$$anonfun$applyOrElse$6$$anonfun$applyOrElse$38.apply(Analyzer.scala:907) > at > org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$13$$anonfun$applyOrElse$6.applyOrElse(Analyzer.scala:906) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$13$$anonfun$applyOrElse$6.applyOrElse(Analyzer.scala:894) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:265) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:265) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:68) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:264) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:270) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:270) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:307) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) > at scala.collection.AbstractIterator.to(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1336) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:356) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:270) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionDown$1(QueryPlan.scala:156) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:166) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$
[jira] [Created] (SPARK-15921) Spark unable to read partitioned table in avro format and column name in upper case
Rajkumar Singh created SPARK-15921: -- Summary: Spark unable to read partitioned table in avro format and column name in upper case Key: SPARK-15921 URL: https://issues.apache.org/jira/browse/SPARK-15921 Project: Spark Issue Type: Bug Components: Spark Core, SQL Affects Versions: 1.6.0 Environment: Centos 6.6 Spark 1.6 Reporter: Rajkumar Singh Reproduce: {code} [root@sandbox ~]# cat file1.csv rks,2016 [root@sandbox ~]# cat file2.csv raj,2015 hive> CREATE TABLE `sample_table`( > `name` string) > PARTITIONED BY ( > `year` int) > ROW FORMAT DELIMITED > FIELDS TERMINATED BY ',' > STORED AS INPUTFORMAT > 'org.apache.hadoop.mapred.TextInputFormat' > OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' > LOCATION > 'hdfs://sandbox.hortonworks.com:8020/apps/hive/warehouse/sample_table' > TBLPROPERTIES ( > 'transient_lastDdlTime'='1465816403') > ; load data local inpath '/root/file2.csv' overwrite into table sample_table partition(year='2015'); load data local inpath '/root/file1.csv' overwrite into table sample_table partition(year='2016'); hive> CREATE TABLE sample_table_uppercase > PARTITIONeD BY ( YEAR INT) > ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' > STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' > OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' > TBLPROPERTIES ( >'avro.schema.literal'='{ > "namespace": "com.rishav.avro", >"name": "student_marks", >"type": "record", > "fields": [ { "name":"NANME","type":"string"}] > }'); INSERT OVERWRITE TABLE sample_table_uppercase partition(Year) select name,year from sample_table; hive> select * from sample_table_uppercase; OK raj 2015 rks 2016 now using spark-shell scala>val tbl = sqlContext.table("default.sample_table_uppercase"); scala>tbl.show +++ |name|year| +++ |null|2015| |null|2016| +++ {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15921) Spark unable to read partitioned table in avro format and column name in upper case
[ https://issues.apache.org/jira/browse/SPARK-15921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajkumar Singh updated SPARK-15921: --- Description: Spark return null value if the field name is uppercase in hive avro partitioned table. Reproduce: {code} [root@sandbox ~]# cat file1.csv rks,2016 [root@sandbox ~]# cat file2.csv raj,2015 hive> CREATE TABLE `sample_table`( > `name` string) > PARTITIONED BY ( > `year` int) > ROW FORMAT DELIMITED > FIELDS TERMINATED BY ',' > STORED AS INPUTFORMAT > 'org.apache.hadoop.mapred.TextInputFormat' > OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' > LOCATION > 'hdfs://sandbox.hortonworks.com:8020/apps/hive/warehouse/sample_table' > TBLPROPERTIES ( > 'transient_lastDdlTime'='1465816403') > ; load data local inpath '/root/file2.csv' overwrite into table sample_table partition(year='2015'); load data local inpath '/root/file1.csv' overwrite into table sample_table partition(year='2016'); hive> CREATE TABLE sample_table_uppercase > PARTITIONeD BY ( YEAR INT) > ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' > STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' > OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' > TBLPROPERTIES ( >'avro.schema.literal'='{ > "namespace": "com.rishav.avro", >"name": "student_marks", >"type": "record", > "fields": [ { "name":"NANME","type":"string"}] > }'); INSERT OVERWRITE TABLE sample_table_uppercase partition(Year) select name,year from sample_table; hive> select * from sample_table_uppercase; OK raj 2015 rks 2016 now using spark-shell scala>val tbl = sqlContext.table("default.sample_table_uppercase"); scala>tbl.show +++ |name|year| +++ |null|2015| |null|2016| +++ {code} was: Reproduce: {code} [root@sandbox ~]# cat file1.csv rks,2016 [root@sandbox ~]# cat file2.csv raj,2015 hive> CREATE TABLE `sample_table`( > `name` string) > PARTITIONED BY ( > `year` int) > ROW FORMAT DELIMITED > FIELDS TERMINATED BY ',' > STORED AS INPUTFORMAT > 'org.apache.hadoop.mapred.TextInputFormat' > OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' > LOCATION > 'hdfs://sandbox.hortonworks.com:8020/apps/hive/warehouse/sample_table' > TBLPROPERTIES ( > 'transient_lastDdlTime'='1465816403') > ; load data local inpath '/root/file2.csv' overwrite into table sample_table partition(year='2015'); load data local inpath '/root/file1.csv' overwrite into table sample_table partition(year='2016'); hive> CREATE TABLE sample_table_uppercase > PARTITIONeD BY ( YEAR INT) > ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' > STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' > OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' > TBLPROPERTIES ( >'avro.schema.literal'='{ > "namespace": "com.rishav.avro", >"name": "student_marks", >"type": "record", > "fields": [ { "name":"NANME","type":"string"}] > }'); INSERT OVERWRITE TABLE sample_table_uppercase partition(Year) select name,year from sample_table; hive> select * from sample_table_uppercase; OK raj 2015 rks 2016 now using spark-shell scala>val tbl = sqlContext.table("default.sample_table_uppercase"); scala>tbl.show +++ |name|year| +++ |null|2015| |null|2016| +++ {code} > Spark unable to read partitioned table in avro format and column name in > upper case > --- > > Key: SPARK-15921 > URL: https://issues.apache.org/jira/browse/SPARK-15921 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 1.6.0 > Environment: Centos 6.6 > Spark 1.6 >Reporter: Rajkumar Singh > > Spark return null value if the field name is uppercase in hive avro > partitioned table. > Reproduce: > {code} > [root@sandbox ~]# cat file1.csv > rks,2016 > [root@sandbox ~]# cat file2.csv > raj,2015 > hive> CREATE TABLE `sample_table`( > > `name` string) > > PARTITIONED BY ( > > `year` int) > > ROW FORMAT DELIMITED > > FIELDS TERMINATED BY ',' > > STORED AS INPUTFORMAT > > 'org.apache.hadoop.mapred.TextInputFormat' > > OUTPUTFORMAT > > 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' > > LOCATION > > 'hdfs://sandbox.hortonworks.com:8020/apps/hive/warehouse
[jira] [Commented] (SPARK-15790) Audit @Since annotations in ML
[ https://issues.apache.org/jira/browse/SPARK-15790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327193#comment-15327193 ] Nick Pentreath commented on SPARK-15790: Yes, I've just looked at things in the concrete classes - params & methods defined in the traits etc are not annotated. > Audit @Since annotations in ML > -- > > Key: SPARK-15790 > URL: https://issues.apache.org/jira/browse/SPARK-15790 > Project: Spark > Issue Type: Documentation > Components: ML, PySpark >Reporter: Nick Pentreath >Assignee: Nick Pentreath > > Many classes & methods in ML are missing {{@Since}} annotations. Audit what's > missing and add annotations to public API constructors, vals and methods. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10258) Add @Since annotation to ml.feature
[ https://issues.apache.org/jira/browse/SPARK-10258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327197#comment-15327197 ] Apache Spark commented on SPARK-10258: -- User 'MLnick' has created a pull request for this issue: https://github.com/apache/spark/pull/13641 > Add @Since annotation to ml.feature > --- > > Key: SPARK-10258 > URL: https://issues.apache.org/jira/browse/SPARK-10258 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML >Reporter: Xiangrui Meng >Assignee: Martin Brown >Priority: Minor > Labels: starter > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6628) ClassCastException occurs when executing sql statement "insert into" on hbase table
[ https://issues.apache.org/jira/browse/SPARK-6628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327201#comment-15327201 ] Teng Qiu commented on SPARK-6628: - this is caused by missing interface implementation in HiveHBaseTableOutputFormat (or HiveAccumuloTableOutputFormat), i created this issue in hive project: https://issues.apache.org/jira/browse/HIVE-13170 and made this PR for hive-accumulo connector (AccumuloStorageHandler): https://github.com/apache/hive/pull/66/files you can do some similar changes for hive-hbase as well. > ClassCastException occurs when executing sql statement "insert into" on hbase > table > --- > > Key: SPARK-6628 > URL: https://issues.apache.org/jira/browse/SPARK-6628 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: meiyoula > > Error: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 1 in stage 3.0 failed 4 times, most recent failure: Lost task 1.3 in > stage 3.0 (TID 12, vm-17): java.lang.ClassCastException: > org.apache.hadoop.hive.hbase.HiveHBaseTableOutputFormat cannot be cast to > org.apache.hadoop.hive.ql.io.HiveOutputFormat > at > org.apache.spark.sql.hive.SparkHiveWriterContainer.outputFormat$lzycompute(hiveWriterContainers.scala:72) > at > org.apache.spark.sql.hive.SparkHiveWriterContainer.outputFormat(hiveWriterContainers.scala:71) > at > org.apache.spark.sql.hive.SparkHiveWriterContainer.getOutputName(hiveWriterContainers.scala:91) > at > org.apache.spark.sql.hive.SparkHiveWriterContainer.initWriters(hiveWriterContainers.scala:115) > at > org.apache.spark.sql.hive.SparkHiveWriterContainer.executorSideSetup(hiveWriterContainers.scala:84) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.scala:112) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:93) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:93) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) > at org.apache.spark.scheduler.Task.run(Task.scala:56) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:197) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15920) Using map on DataFrame
[ https://issues.apache.org/jira/browse/SPARK-15920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-15920. --- Resolution: Not A Problem Target Version/s: (was: 2.0.0) Don't set Target please, and this question should go to user@ > Using map on DataFrame > -- > > Key: SPARK-15920 > URL: https://issues.apache.org/jira/browse/SPARK-15920 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 > Environment: branch-2.0 >Reporter: Piotr Milanowski > > In Spark 1.6 there was a method {{DataFrame.map}} as an alias to > {{DataFrame.rdd.map}}. In spark 2.0 this functionality no longer exists. > Is there a preferred way of doing map on a DataFrame without explicitly > calling {{DataFrame.rdd.map}}? Maybe this functionality should be kept, just > for backward compatibility purpose? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8546) PMML export for Naive Bayes
[ https://issues.apache.org/jira/browse/SPARK-8546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327205#comment-15327205 ] Villu Ruusmann commented on SPARK-8546: --- Hi [~rgasiorek] - would it be an option to re-build your models in Spark ML instead of MLlib? I have been working on Spark ML pipelines-to-PMML converter called JPMML-SparkML (https://github.com/jpmml/jpmml-sparkml), which could fully address your use case then. JPMML-SparkML supports all tree-based models and the majority of non-NLP domain transformations. It would be possible to add support for the `classification.NaiveBayesModel` model type in a day or two if needed. > PMML export for Naive Bayes > --- > > Key: SPARK-8546 > URL: https://issues.apache.org/jira/browse/SPARK-8546 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Joseph K. Bradley >Assignee: Xusen Yin >Priority: Minor > > The naive Bayes section of PMML standard can be found at > http://www.dmg.org/v4-1/NaiveBayes.html. We should first figure out how to > generate PMML for both binomial and multinomial naive Bayes models using > JPMML (maybe [~vfed] can help). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15919) DStream "saveAsTextFile" doesn't update the prefix after each checkpoint
[ https://issues.apache.org/jira/browse/SPARK-15919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327209#comment-15327209 ] binde commented on SPARK-15919: --- this is not a bug, getOutputPath() will be invoked on the job start run. > DStream "saveAsTextFile" doesn't update the prefix after each checkpoint > > > Key: SPARK-15919 > URL: https://issues.apache.org/jira/browse/SPARK-15919 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 1.6.1 > Environment: Amazon EMR >Reporter: Aamir Abbas > > I have a Spark streaming job that reads a data stream, and saves it as a text > file after a predefined time interval. In the function > stream.dstream().repartition(1).saveAsTextFiles(getOutputPath(), ""); > The function getOutputPath() generates a new path every time the function is > called, depending on the current system time. > However, the output path prefix remains the same for all the batches, which > effectively means that function is not called again for the next batch of the > stream, although the files are being saved after each checkpoint interval. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15919) DStream "saveAsTextFile" doesn't update the prefix after each checkpoint
[ https://issues.apache.org/jira/browse/SPARK-15919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327212#comment-15327212 ] Aamir Abbas commented on SPARK-15919: - I need to save the output of each batch in a different place. This is available for a regular Spark job, should be available for streaming data as well. Should I add this as a feature requirement? > DStream "saveAsTextFile" doesn't update the prefix after each checkpoint > > > Key: SPARK-15919 > URL: https://issues.apache.org/jira/browse/SPARK-15919 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 1.6.1 > Environment: Amazon EMR >Reporter: Aamir Abbas > > I have a Spark streaming job that reads a data stream, and saves it as a text > file after a predefined time interval. In the function > stream.dstream().repartition(1).saveAsTextFiles(getOutputPath(), ""); > The function getOutputPath() generates a new path every time the function is > called, depending on the current system time. > However, the output path prefix remains the same for all the batches, which > effectively means that function is not called again for the next batch of the > stream, although the files are being saved after each checkpoint interval. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15904) High Memory Pressure using MLlib K-means
[ https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327220#comment-15327220 ] Nick Pentreath commented on SPARK-15904: Could you explain why you're using K>3000 when your dataset has dimension ~2000? > High Memory Pressure using MLlib K-means > > > Key: SPARK-15904 > URL: https://issues.apache.org/jira/browse/SPARK-15904 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.6.1 > Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB > of RAM. >Reporter: Alessio >Priority: Minor > > Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on > Memory and Disk. > Everything's fine, although at the end of K-Means, after the number of > iterations, the cost function value and the running time there's a nice > "Removing RDD from persistent list" stage. However, during this stage > there's a high memory pressure. Weird, since RDDs are about to be removed. > Full log of this stage: > 16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations > 16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds. > 16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations. > 16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is > 49784.87126751288. > 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from > persistence list > 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781 > 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from > persistence list > 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780 > I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. > My machine has an i5 hyperthreaded dual-core, thus [*] means 4. > I'm launching this application though spark-submit with --driver-memory 10G -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6628) ClassCastException occurs when executing sql statement "insert into" on hbase table
[ https://issues.apache.org/jira/browse/SPARK-6628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327224#comment-15327224 ] Murshid Chalaev commented on SPARK-6628: Thank you > ClassCastException occurs when executing sql statement "insert into" on hbase > table > --- > > Key: SPARK-6628 > URL: https://issues.apache.org/jira/browse/SPARK-6628 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: meiyoula > > Error: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 1 in stage 3.0 failed 4 times, most recent failure: Lost task 1.3 in > stage 3.0 (TID 12, vm-17): java.lang.ClassCastException: > org.apache.hadoop.hive.hbase.HiveHBaseTableOutputFormat cannot be cast to > org.apache.hadoop.hive.ql.io.HiveOutputFormat > at > org.apache.spark.sql.hive.SparkHiveWriterContainer.outputFormat$lzycompute(hiveWriterContainers.scala:72) > at > org.apache.spark.sql.hive.SparkHiveWriterContainer.outputFormat(hiveWriterContainers.scala:71) > at > org.apache.spark.sql.hive.SparkHiveWriterContainer.getOutputName(hiveWriterContainers.scala:91) > at > org.apache.spark.sql.hive.SparkHiveWriterContainer.initWriters(hiveWriterContainers.scala:115) > at > org.apache.spark.sql.hive.SparkHiveWriterContainer.executorSideSetup(hiveWriterContainers.scala:84) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.scala:112) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:93) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:93) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) > at org.apache.spark.scheduler.Task.run(Task.scala:56) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:197) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15919) DStream "saveAsTextFile" doesn't update the prefix after each checkpoint
[ https://issues.apache.org/jira/browse/SPARK-15919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-15919. --- Resolution: Not A Problem No, this is simple to accomplish in Spark already. You need to use foreachRDD to get an RDD and timestamp, and use that in your call to saveAsTextFiles > DStream "saveAsTextFile" doesn't update the prefix after each checkpoint > > > Key: SPARK-15919 > URL: https://issues.apache.org/jira/browse/SPARK-15919 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 1.6.1 > Environment: Amazon EMR >Reporter: Aamir Abbas > > I have a Spark streaming job that reads a data stream, and saves it as a text > file after a predefined time interval. In the function > stream.dstream().repartition(1).saveAsTextFiles(getOutputPath(), ""); > The function getOutputPath() generates a new path every time the function is > called, depending on the current system time. > However, the output path prefix remains the same for all the batches, which > effectively means that function is not called again for the next batch of the > stream, although the files are being saved after each checkpoint interval. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15919) DStream "saveAsTextFile" doesn't update the prefix after each checkpoint
[ https://issues.apache.org/jira/browse/SPARK-15919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327229#comment-15327229 ] Aamir Abbas commented on SPARK-15919: - ForeachRDD is fine in case you want to save individual RDDs separately. I need to do this for entire batch of stream. Could you please share the relevant link to the documentation that can help me save the entire batch of the stream like this? > DStream "saveAsTextFile" doesn't update the prefix after each checkpoint > > > Key: SPARK-15919 > URL: https://issues.apache.org/jira/browse/SPARK-15919 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 1.6.1 > Environment: Amazon EMR >Reporter: Aamir Abbas > > I have a Spark streaming job that reads a data stream, and saves it as a text > file after a predefined time interval. In the function > stream.dstream().repartition(1).saveAsTextFiles(getOutputPath(), ""); > The function getOutputPath() generates a new path every time the function is > called, depending on the current system time. > However, the output path prefix remains the same for all the batches, which > effectively means that function is not called again for the next batch of the > stream, although the files are being saved after each checkpoint interval. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15904) High Memory Pressure using MLlib K-means
[ https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327234#comment-15327234 ] Alessio commented on SPARK-15904: - My dataset has 9000+ patterns, each of which has 2000+ attributes. Thus it's perfectly legal to search for K>3000 and (of course) smaller than or equal to the number of patterns (9120) > High Memory Pressure using MLlib K-means > > > Key: SPARK-15904 > URL: https://issues.apache.org/jira/browse/SPARK-15904 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.6.1 > Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB > of RAM. >Reporter: Alessio >Priority: Minor > > Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on > Memory and Disk. > Everything's fine, although at the end of K-Means, after the number of > iterations, the cost function value and the running time there's a nice > "Removing RDD from persistent list" stage. However, during this stage > there's a high memory pressure. Weird, since RDDs are about to be removed. > Full log of this stage: > 16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations > 16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds. > 16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations. > 16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is > 49784.87126751288. > 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from > persistence list > 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781 > 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from > persistence list > 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780 > I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. > My machine has an i5 hyperthreaded dual-core, thus [*] means 4. > I'm launching this application though spark-submit with --driver-memory 10G -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12623) map key_values to values
[ https://issues.apache.org/jira/browse/SPARK-12623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327236#comment-15327236 ] Elazar Gershuni commented on SPARK-12623: - At the very least, it should have a "won't fix" status, rather than "resolved". How can I suggest this change to Spark 2.0? > map key_values to values > > > Key: SPARK-12623 > URL: https://issues.apache.org/jira/browse/SPARK-12623 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Elazar Gershuni >Priority: Minor > Labels: easyfix, features, performance > Original Estimate: 0.5h > Remaining Estimate: 0.5h > > Why doesn't the argument to mapValues() take a key as an agument? > Alternatively, can we have a "mapKeyValuesToValues" that does? > Use case: I want to write a simpler analyzer that takes the argument to > map(), and analyze it to see whether it (trivially) doesn't change the key, > e.g. > g = lambda kv: (kv[0], f(kv[0], kv[1])) > rdd.map(g) > Problem is, if I find that it is the case, I can't call mapValues() with that > function, as in `rdd.mapValues(lambda kv: g(kv)[1])`, since mapValues > receives only `v` as an argument. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15746) SchemaUtils.checkColumnType with VectorUDT prints instance details in error message
[ https://issues.apache.org/jira/browse/SPARK-15746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327237#comment-15327237 ] Nick Pentreath commented on SPARK-15746: I think you can go ahead now - I also vote for the {{case object VectorUDT}} approach. > SchemaUtils.checkColumnType with VectorUDT prints instance details in error > message > --- > > Key: SPARK-15746 > URL: https://issues.apache.org/jira/browse/SPARK-15746 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Nick Pentreath >Priority: Minor > > Currently, many feature transformers in {{ml}} use > {{SchemaUtils.checkColumnType(schema, ..., new VectorUDT)}} to check the > column type is a ({{ml.linalg}}) vector. > The resulting error message contains "instance" info for the {{VectorUDT}}, > i.e. something like this: > {code} > java.lang.IllegalArgumentException: requirement failed: Column features must > be of type org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but was actually > StringType. > {code} > A solution would either be to amend {{SchemaUtils.checkColumnType}} to print > the error message using {{getClass.getName}}, or to create a {{private[spark] > case object VectorUDT extends VectorUDT}} for convenience, since it is used > so often (and incidentally this would make it easier to put {{VectorUDT}} > into lists of data types e.g. schema validation, UDAFs etc). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-15919) DStream "saveAsTextFile" doesn't update the prefix after each checkpoint
[ https://issues.apache.org/jira/browse/SPARK-15919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aamir Abbas reopened SPARK-15919: - This is an issue, as I do not actually need the current timestamp to use in output path. I need the new path, which doesn't have the current timestamp, but a new output path. > DStream "saveAsTextFile" doesn't update the prefix after each checkpoint > > > Key: SPARK-15919 > URL: https://issues.apache.org/jira/browse/SPARK-15919 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 1.6.1 > Environment: Amazon EMR >Reporter: Aamir Abbas > > I have a Spark streaming job that reads a data stream, and saves it as a text > file after a predefined time interval. In the function > stream.dstream().repartition(1).saveAsTextFiles(getOutputPath(), ""); > The function getOutputPath() generates a new path every time the function is > called, depending on the current system time. > However, the output path prefix remains the same for all the batches, which > effectively means that function is not called again for the next batch of the > stream, although the files are being saved after each checkpoint interval. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12623) map key_values to values
[ https://issues.apache.org/jira/browse/SPARK-12623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327248#comment-15327248 ] Sean Owen commented on SPARK-12623: --- The Status can only be "Resolved". You're referring to the Resolution, which is Not A Problem. I think that's accurate for the original issue here, even if in practice the exact value doesn't matter a lot. If you mean exposing preservesPartitioning on map, yeah I think that's a legitimate change to consider and you can make another JIRA. > map key_values to values > > > Key: SPARK-12623 > URL: https://issues.apache.org/jira/browse/SPARK-12623 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Elazar Gershuni >Priority: Minor > Labels: easyfix, features, performance > Original Estimate: 0.5h > Remaining Estimate: 0.5h > > Why doesn't the argument to mapValues() take a key as an agument? > Alternatively, can we have a "mapKeyValuesToValues" that does? > Use case: I want to write a simpler analyzer that takes the argument to > map(), and analyze it to see whether it (trivially) doesn't change the key, > e.g. > g = lambda kv: (kv[0], f(kv[0], kv[1])) > rdd.map(g) > Problem is, if I find that it is the case, I can't call mapValues() with that > function, as in `rdd.mapValues(lambda kv: g(kv)[1])`, since mapValues > receives only `v` as an argument. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-15919) DStream "saveAsTextFile" doesn't update the prefix after each checkpoint
[ https://issues.apache.org/jira/browse/SPARK-15919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen closed SPARK-15919. - > DStream "saveAsTextFile" doesn't update the prefix after each checkpoint > > > Key: SPARK-15919 > URL: https://issues.apache.org/jira/browse/SPARK-15919 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 1.6.1 > Environment: Amazon EMR >Reporter: Aamir Abbas > > I have a Spark streaming job that reads a data stream, and saves it as a text > file after a predefined time interval. In the function > stream.dstream().repartition(1).saveAsTextFiles(getOutputPath(), ""); > The function getOutputPath() generates a new path every time the function is > called, depending on the current system time. > However, the output path prefix remains the same for all the batches, which > effectively means that function is not called again for the next batch of the > stream, although the files are being saved after each checkpoint interval. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15919) DStream "saveAsTextFile" doesn't update the prefix after each checkpoint
[ https://issues.apache.org/jira/browse/SPARK-15919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-15919. --- Resolution: Not A Problem Look at the implementation of DStream.saveAsTextFiles -- about all it does is call foreachRDD as I described. You can make this do whatever you like to name the file in your own code, but, you have to do something like this to achieve what you want. This JIRA should not be reopened. > DStream "saveAsTextFile" doesn't update the prefix after each checkpoint > > > Key: SPARK-15919 > URL: https://issues.apache.org/jira/browse/SPARK-15919 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 1.6.1 > Environment: Amazon EMR >Reporter: Aamir Abbas > > I have a Spark streaming job that reads a data stream, and saves it as a text > file after a predefined time interval. In the function > stream.dstream().repartition(1).saveAsTextFiles(getOutputPath(), ""); > The function getOutputPath() generates a new path every time the function is > called, depending on the current system time. > However, the output path prefix remains the same for all the batches, which > effectively means that function is not called again for the next batch of the > stream, although the files are being saved after each checkpoint interval. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15904) High Memory Pressure using MLlib K-means
[ https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327255#comment-15327255 ] Sean Owen commented on SPARK-15904: --- Yeah it's coherent, though typically k << number of points. It would help to know more about how you're running, what slows down, what -verbose:gc says during this time, etc. It may be a problem with memory settings rather than some particular problem with this value of k. > High Memory Pressure using MLlib K-means > > > Key: SPARK-15904 > URL: https://issues.apache.org/jira/browse/SPARK-15904 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.6.1 > Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB > of RAM. >Reporter: Alessio >Priority: Minor > > Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on > Memory and Disk. > Everything's fine, although at the end of K-Means, after the number of > iterations, the cost function value and the running time there's a nice > "Removing RDD from persistent list" stage. However, during this stage > there's a high memory pressure. Weird, since RDDs are about to be removed. > Full log of this stage: > 16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations > 16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds. > 16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations. > 16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is > 49784.87126751288. > 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from > persistence list > 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781 > 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from > persistence list > 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780 > I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. > My machine has an i5 hyperthreaded dual-core, thus [*] means 4. > I'm launching this application though spark-submit with --driver-memory 10G -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15904) High Memory Pressure using MLlib K-means
[ https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alessio updated SPARK-15904: Description: Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on Memory and Disk. Everything's fine, although at the end of K-Means, after the number of iterations, the cost function value and the running time there's a nice "Removing RDD from persistent list" stage. However, during this stage there's a high memory pressure. Weird, since RDDs are about to be removed. Full log of this stage: 16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations 16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds. 16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations. 16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is 49784.87126751288. 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from persistence list 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from persistence list 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780 I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. My machine has an i5 hyperthreaded dual-core, thus [*] means 4. I'm launching this application though spark-submit with --driver-memory 9G was: Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on Memory and Disk. Everything's fine, although at the end of K-Means, after the number of iterations, the cost function value and the running time there's a nice "Removing RDD from persistent list" stage. However, during this stage there's a high memory pressure. Weird, since RDDs are about to be removed. Full log of this stage: 16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations 16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds. 16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations. 16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is 49784.87126751288. 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from persistence list 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from persistence list 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780 I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. My machine has an i5 hyperthreaded dual-core, thus [*] means 4. I'm launching this application though spark-submit with --driver-memory 10G > High Memory Pressure using MLlib K-means > > > Key: SPARK-15904 > URL: https://issues.apache.org/jira/browse/SPARK-15904 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.6.1 > Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB > of RAM. >Reporter: Alessio >Priority: Minor > > Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on > Memory and Disk. > Everything's fine, although at the end of K-Means, after the number of > iterations, the cost function value and the running time there's a nice > "Removing RDD from persistent list" stage. However, during this stage > there's a high memory pressure. Weird, since RDDs are about to be removed. > Full log of this stage: > 16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations > 16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds. > 16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations. > 16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is > 49784.87126751288. > 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from > persistence list > 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781 > 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from > persistence list > 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780 > I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. > My machine has an i5 hyperthreaded dual-core, thus [*] means 4. > I'm launching this application though spark-submit with --driver-memory 9G -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-15921) Spark unable to read partitioned table in avro format and column name in upper case
[ https://issues.apache.org/jira/browse/SPARK-15921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajkumar Singh closed SPARK-15921. -- Resolution: Fixed > Spark unable to read partitioned table in avro format and column name in > upper case > --- > > Key: SPARK-15921 > URL: https://issues.apache.org/jira/browse/SPARK-15921 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 1.6.0 > Environment: Centos 6.6 > Spark 1.6 >Reporter: Rajkumar Singh > > Spark return null value if the field name is uppercase in hive avro > partitioned table. > Reproduce: > {code} > [root@sandbox ~]# cat file1.csv > rks,2016 > [root@sandbox ~]# cat file2.csv > raj,2015 > hive> CREATE TABLE `sample_table`( > > `name` string) > > PARTITIONED BY ( > > `year` int) > > ROW FORMAT DELIMITED > > FIELDS TERMINATED BY ',' > > STORED AS INPUTFORMAT > > 'org.apache.hadoop.mapred.TextInputFormat' > > OUTPUTFORMAT > > 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' > > LOCATION > > 'hdfs://sandbox.hortonworks.com:8020/apps/hive/warehouse/sample_table' > > TBLPROPERTIES ( > > 'transient_lastDdlTime'='1465816403') > > ; > load data local inpath '/root/file2.csv' overwrite into table sample_table > partition(year='2015'); > load data local inpath '/root/file1.csv' overwrite into table sample_table > partition(year='2016'); > hive> CREATE TABLE sample_table_uppercase > > PARTITIONeD BY ( YEAR INT) > > ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' > > STORED AS INPUTFORMAT > 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' > > OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' > > TBLPROPERTIES ( > >'avro.schema.literal'='{ > > "namespace": "com.rishav.avro", > >"name": "student_marks", > >"type": "record", > > "fields": [ { "name":"NANME","type":"string"}] > > }'); > INSERT OVERWRITE TABLE sample_table_uppercase partition(Year) select > name,year from sample_table; > hive> select * from sample_table_uppercase; > OK > raj 2015 > rks 2016 > now using spark-shell > scala>val tbl = sqlContext.table("default.sample_table_uppercase"); > scala>tbl.show > +++ > |name|year| > +++ > |null|2015| > |null|2016| > +++ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15904) High Memory Pressure using MLlib K-means
[ https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327272#comment-15327272 ] Alessio commented on SPARK-15904: - Dear Sean, I must certainly agree with you on k< High Memory Pressure using MLlib K-means > > > Key: SPARK-15904 > URL: https://issues.apache.org/jira/browse/SPARK-15904 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.6.1 > Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB > of RAM. >Reporter: Alessio >Priority: Minor > > Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on > Memory and Disk. > Everything's fine, although at the end of K-Means, after the number of > iterations, the cost function value and the running time there's a nice > "Removing RDD from persistent list" stage. However, during this stage > there's a high memory pressure. Weird, since RDDs are about to be removed. > Full log of this stage: > 16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations > 16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds. > 16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations. > 16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is > 49784.87126751288. > 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from > persistence list > 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781 > 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from > persistence list > 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780 > I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. > My machine has an i5 hyperthreaded dual-core, thus [*] means 4. > I'm launching this application though spark-submit with --driver-memory 9G -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-15904) High Memory Pressure using MLlib K-means
[ https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327272#comment-15327272 ] Alessio edited comment on SPARK-15904 at 6/13/16 12:41 PM: --- Dear Sean, I must certainly agree with you on k< High Memory Pressure using MLlib K-means > > > Key: SPARK-15904 > URL: https://issues.apache.org/jira/browse/SPARK-15904 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.6.1 > Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB > of RAM. >Reporter: Alessio >Priority: Minor > > Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on > Memory and Disk. > Everything's fine, although at the end of K-Means, after the number of > iterations, the cost function value and the running time there's a nice > "Removing RDD from persistent list" stage. However, during this stage > there's a high memory pressure. Weird, since RDDs are about to be removed. > Full log of this stage: > 16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations > 16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds. > 16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations. > 16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is > 49784.87126751288. > 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from > persistence list > 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781 > 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from > persistence list > 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780 > I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. > My machine has an i5 hyperthreaded dual-core, thus [*] means 4. > I'm launching this application though spark-submit with --driver-memory 9G -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-8546) PMML export for Naive Bayes
[ https://issues.apache.org/jira/browse/SPARK-8546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327167#comment-15327167 ] Radoslaw Gasiorek edited comment on SPARK-8546 at 6/13/16 12:43 PM: hi there, [~josephkb], [~apachespark] We would like to use Mllib built models to classify outside spark therefore without Spark context available. We would like to export the models built in spark into PMML format, that then would be read by a stand alone java application without spark context (but with Mllib jar). The java application would load the model from the PMML file and would use the model to 'predict' or rather 'classify' the new data we get. This feature would enable us to proceed without big architectural and operational changes, without this feature we might need get the the sparkContext available to the standalone application that would be bigger operational and architectural overhead. We might need to use the plain java serialization for the proof of concept anyways, but surely not for produtionized product. Can we prioritize this feature as well as https://issues.apache.org/jira/browse/SPARK-8542 and https://issues.apache.org/jira/browse/SPARK-8543 ? What would be LOE and EAT for these? thanks guys in advance for responses, and feedback. was (Author: rgasiorek): hi there, [~josephkb] We would like to use Mllib built models to classify outside spark therefore without Spark context available. We would like to export the models built in spark into PMML format, that then would be read by a stand alone java application without spark context (but with Mllib jar). The java application would load the model from the PMML file and would use the model to 'predict' or rather 'classify' the new data we get. This feature would enable us to proceed without big architectural and operational changes, without this feature we might need get the the sparkContext available to the standalone application that would be bigger operational and architectural overhead. We might need to use the plain java serialization for the proof of concept anyways, but surely not for produtionized product. Can we prioritize this feature as well as https://issues.apache.org/jira/browse/SPARK-8542 and https://issues.apache.org/jira/browse/SPARK-8543 ? What would be LOE and EAT for these? thanks guys in advance for responses, and feedback. > PMML export for Naive Bayes > --- > > Key: SPARK-8546 > URL: https://issues.apache.org/jira/browse/SPARK-8546 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Joseph K. Bradley >Assignee: Xusen Yin >Priority: Minor > > The naive Bayes section of PMML standard can be found at > http://www.dmg.org/v4-1/NaiveBayes.html. We should first figure out how to > generate PMML for both binomial and multinomial naive Bayes models using > JPMML (maybe [~vfed] can help). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-15904) High Memory Pressure using MLlib K-means
[ https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327272#comment-15327272 ] Alessio edited comment on SPARK-15904 at 6/13/16 12:44 PM: --- Dear Sean, I must certainly agree with you on k< High Memory Pressure using MLlib K-means > > > Key: SPARK-15904 > URL: https://issues.apache.org/jira/browse/SPARK-15904 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.6.1 > Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB > of RAM. >Reporter: Alessio >Priority: Minor > > Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on > Memory and Disk. > Everything's fine, although at the end of K-Means, after the number of > iterations, the cost function value and the running time there's a nice > "Removing RDD from persistent list" stage. However, during this stage > there's a high memory pressure. Weird, since RDDs are about to be removed. > Full log of this stage: > 16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations > 16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds. > 16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations. > 16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is > 49784.87126751288. > 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from > persistence list > 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781 > 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from > persistence list > 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780 > I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. > My machine has an i5 hyperthreaded dual-core, thus [*] means 4. > I'm launching this application though spark-submit with --driver-memory 9G -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-15904) High Memory Pressure using MLlib K-means
[ https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327272#comment-15327272 ] Alessio edited comment on SPARK-15904 at 6/13/16 12:45 PM: --- Dear Sean, I must certainly agree with you on k< High Memory Pressure using MLlib K-means > > > Key: SPARK-15904 > URL: https://issues.apache.org/jira/browse/SPARK-15904 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.6.1 > Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB > of RAM. >Reporter: Alessio >Priority: Minor > > Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on > Memory and Disk. > Everything's fine, although at the end of K-Means, after the number of > iterations, the cost function value and the running time there's a nice > "Removing RDD from persistent list" stage. However, during this stage > there's a high memory pressure. Weird, since RDDs are about to be removed. > Full log of this stage: > 16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations > 16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds. > 16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations. > 16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is > 49784.87126751288. > 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from > persistence list > 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781 > 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from > persistence list > 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780 > I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. > My machine has an i5 hyperthreaded dual-core, thus [*] means 4. > I'm launching this application though spark-submit with --driver-memory 9G -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15904) High Memory Pressure using MLlib K-means
[ https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327369#comment-15327369 ] Sean Owen commented on SPARK-15904: --- -verbose:gc is a JVM option and should write to stderr. You'd definitely see it; it's pretty verbose. But, are you saying things are running out of memory or just referring to the RDDs being unpersisted? the latter is not necessarily a sign of memory shortage. What does memory pressure mean here? > High Memory Pressure using MLlib K-means > > > Key: SPARK-15904 > URL: https://issues.apache.org/jira/browse/SPARK-15904 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.6.1 > Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB > of RAM. >Reporter: Alessio >Priority: Minor > > Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on > Memory and Disk. > Everything's fine, although at the end of K-Means, after the number of > iterations, the cost function value and the running time there's a nice > "Removing RDD from persistent list" stage. However, during this stage > there's a high memory pressure. Weird, since RDDs are about to be removed. > Full log of this stage: > 16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations > 16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds. > 16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations. > 16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is > 49784.87126751288. > 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from > persistence list > 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781 > 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from > persistence list > 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780 > I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. > My machine has an i5 hyperthreaded dual-core, thus [*] means 4. > I'm launching this application though spark-submit with --driver-memory 9G -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15904) High Memory Pressure using MLlib K-means
[ https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327397#comment-15327397 ] Alessio commented on SPARK-15904: - Dear [~srowen], at the beginning I noticed that "Cleaning RDD” phase (as in the original post) took a lot of time (10~15 minutes). So I was curious and I opened the Activity Monitor on Mac OS X. That’s when I noticed the Memory Pressure indicator going crazy. The swap memory increases up to 10GB (when K=9120). And after this Cleaning RDD stage…everything’s back to normal. Swap memory will be reduced to 1GB or 2GBs. No more memory pressure and ready for the next K. Moreover, Spark does not stop the execution. I do not receive any “Out-of-memory” errors from either Java, Python or Spark. > High Memory Pressure using MLlib K-means > > > Key: SPARK-15904 > URL: https://issues.apache.org/jira/browse/SPARK-15904 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.6.1 > Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB > of RAM. >Reporter: Alessio >Priority: Minor > > Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on > Memory and Disk. > Everything's fine, although at the end of K-Means, after the number of > iterations, the cost function value and the running time there's a nice > "Removing RDD from persistent list" stage. However, during this stage > there's a high memory pressure. Weird, since RDDs are about to be removed. > Full log of this stage: > 16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations > 16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds. > 16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations. > 16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is > 49784.87126751288. > 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from > persistence list > 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781 > 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from > persistence list > 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780 > I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. > My machine has an i5 hyperthreaded dual-core, thus [*] means 4. > I'm launching this application though spark-submit with --driver-memory 9G -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-15904) High Memory Pressure using MLlib K-means
[ https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327397#comment-15327397 ] Alessio edited comment on SPARK-15904 at 6/13/16 1:49 PM: -- Dear [~srowen], at the beginning I noticed that "Cleaning RDD” phase (as in the original post) took a lot of time (10~15 minutes). So I was curious and I opened the Activity Monitor on Mac OS X. That’s when I noticed the Memory Pressure indicator going crazy. The swap memory increases up to 10GB (when K=9120). And after this Cleaning RDD stage…everything’s back to normal. Swap memory will be reduced to 1GB or 2GBs. No more memory pressure and ready for the next K. Moreover, Spark does not stop the execution. I do not receive any “Out-of-memory” errors from either Java, Python or Spark. Have a look at the screenshot here (http://postimg.org/image/l4pc0vlzr/). K-means just finished another run for K=6000. See the memory stat, all of these peaks under the Last 24 Hours sections are from Spark, after every K-Means run. After a couple of minutes, here's the screenshot (http://postimg.org/image/qc7re8clt/). The memory pressure indicator is going down, but Swap size is 10GB. If I wait a few more minutes, everything will be back to normal. was (Author: purple): Dear [~srowen], at the beginning I noticed that "Cleaning RDD” phase (as in the original post) took a lot of time (10~15 minutes). So I was curious and I opened the Activity Monitor on Mac OS X. That’s when I noticed the Memory Pressure indicator going crazy. The swap memory increases up to 10GB (when K=9120). And after this Cleaning RDD stage…everything’s back to normal. Swap memory will be reduced to 1GB or 2GBs. No more memory pressure and ready for the next K. Moreover, Spark does not stop the execution. I do not receive any “Out-of-memory” errors from either Java, Python or Spark. Have a look at the screenshot here (http://postimg.org/image/l4pc0vlzr/). K-means just finished another run for K=6000. See the memory stat, all of these peaks under the Last 24 Hours sections are from Spark, after every K-Means run. > High Memory Pressure using MLlib K-means > > > Key: SPARK-15904 > URL: https://issues.apache.org/jira/browse/SPARK-15904 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.6.1 > Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB > of RAM. >Reporter: Alessio >Priority: Minor > > Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on > Memory and Disk. > Everything's fine, although at the end of K-Means, after the number of > iterations, the cost function value and the running time there's a nice > "Removing RDD from persistent list" stage. However, during this stage > there's a high memory pressure. Weird, since RDDs are about to be removed. > Full log of this stage: > 16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations > 16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds. > 16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations. > 16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is > 49784.87126751288. > 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from > persistence list > 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781 > 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from > persistence list > 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780 > I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. > My machine has an i5 hyperthreaded dual-core, thus [*] means 4. > I'm launching this application though spark-submit with --driver-memory 9G -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15904) High Memory Pressure using MLlib K-means
[ https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327405#comment-15327405 ] Sean Owen commented on SPARK-15904: --- Hm, but that only means Spark used a lot of memory, and you gave it permission to use a lot of memory -- too much, if you're swapping. That sounds like the problem to me. It's happily consuming memory you've told it is there, but it's really not. Swapping makes things go very slowly of course. > High Memory Pressure using MLlib K-means > > > Key: SPARK-15904 > URL: https://issues.apache.org/jira/browse/SPARK-15904 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.6.1 > Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB > of RAM. >Reporter: Alessio >Priority: Minor > > Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on > Memory and Disk. > Everything's fine, although at the end of K-Means, after the number of > iterations, the cost function value and the running time there's a nice > "Removing RDD from persistent list" stage. However, during this stage > there's a high memory pressure. Weird, since RDDs are about to be removed. > Full log of this stage: > 16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations > 16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds. > 16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations. > 16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is > 49784.87126751288. > 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from > persistence list > 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781 > 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from > persistence list > 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780 > I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. > My machine has an i5 hyperthreaded dual-core, thus [*] means 4. > I'm launching this application though spark-submit with --driver-memory 9G -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-15904) High Memory Pressure using MLlib K-means
[ https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327397#comment-15327397 ] Alessio edited comment on SPARK-15904 at 6/13/16 1:48 PM: -- Dear [~srowen], at the beginning I noticed that "Cleaning RDD” phase (as in the original post) took a lot of time (10~15 minutes). So I was curious and I opened the Activity Monitor on Mac OS X. That’s when I noticed the Memory Pressure indicator going crazy. The swap memory increases up to 10GB (when K=9120). And after this Cleaning RDD stage…everything’s back to normal. Swap memory will be reduced to 1GB or 2GBs. No more memory pressure and ready for the next K. Moreover, Spark does not stop the execution. I do not receive any “Out-of-memory” errors from either Java, Python or Spark. Have a look at the screenshot here (http://postimg.org/image/l4pc0vlzr/). K-means just finished another run for K=6000. See the memory stat, all of these peaks under the Last 24 Hours sections are from Spark, after every K-Means run. was (Author: purple): Dear [~srowen], at the beginning I noticed that "Cleaning RDD” phase (as in the original post) took a lot of time (10~15 minutes). So I was curious and I opened the Activity Monitor on Mac OS X. That’s when I noticed the Memory Pressure indicator going crazy. The swap memory increases up to 10GB (when K=9120). And after this Cleaning RDD stage…everything’s back to normal. Swap memory will be reduced to 1GB or 2GBs. No more memory pressure and ready for the next K. Moreover, Spark does not stop the execution. I do not receive any “Out-of-memory” errors from either Java, Python or Spark. > High Memory Pressure using MLlib K-means > > > Key: SPARK-15904 > URL: https://issues.apache.org/jira/browse/SPARK-15904 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.6.1 > Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB > of RAM. >Reporter: Alessio >Priority: Minor > > Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on > Memory and Disk. > Everything's fine, although at the end of K-Means, after the number of > iterations, the cost function value and the running time there's a nice > "Removing RDD from persistent list" stage. However, during this stage > there's a high memory pressure. Weird, since RDDs are about to be removed. > Full log of this stage: > 16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations > 16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds. > 16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations. > 16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is > 49784.87126751288. > 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from > persistence list > 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781 > 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from > persistence list > 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780 > I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. > My machine has an i5 hyperthreaded dual-core, thus [*] means 4. > I'm launching this application though spark-submit with --driver-memory 9G -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15904) High Memory Pressure using MLlib K-means
[ https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327411#comment-15327411 ] Alessio commented on SPARK-15904: - This is absolutely weird to me. I gave Spark 9GB and during the K-Means execution, if I monitor the memory stat I can see that Spark/Java has 9GB (nice) and no Swap whatsoever. After K-means has reached convergence, during this last, cleaning stage everything goes wild. > High Memory Pressure using MLlib K-means > > > Key: SPARK-15904 > URL: https://issues.apache.org/jira/browse/SPARK-15904 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.6.1 > Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB > of RAM. >Reporter: Alessio >Priority: Minor > > Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on > Memory and Disk. > Everything's fine, although at the end of K-Means, after the number of > iterations, the cost function value and the running time there's a nice > "Removing RDD from persistent list" stage. However, during this stage > there's a high memory pressure. Weird, since RDDs are about to be removed. > Full log of this stage: > 16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations > 16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds. > 16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations. > 16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is > 49784.87126751288. > 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from > persistence list > 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781 > 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from > persistence list > 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780 > I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. > My machine has an i5 hyperthreaded dual-core, thus [*] means 4. > I'm launching this application though spark-submit with --driver-memory 9G -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15904) High Memory Pressure using MLlib K-means
[ https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327430#comment-15327430 ] Sean Owen commented on SPARK-15904: --- How much RAM does your machine have? 10GB heap means much more than 10GB physical memory in the JVM. Not to mention what the OS needs and all other apps that are running. If 9GB works OK, this pretty much demonstrates Spark is fine, and you overcommitting physical RAM is the problem. > High Memory Pressure using MLlib K-means > > > Key: SPARK-15904 > URL: https://issues.apache.org/jira/browse/SPARK-15904 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.6.1 > Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB > of RAM. >Reporter: Alessio >Priority: Minor > > Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on > Memory and Disk. > Everything's fine, although at the end of K-Means, after the number of > iterations, the cost function value and the running time there's a nice > "Removing RDD from persistent list" stage. However, during this stage > there's a high memory pressure. Weird, since RDDs are about to be removed. > Full log of this stage: > 16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations > 16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds. > 16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations. > 16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is > 49784.87126751288. > 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from > persistence list > 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781 > 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from > persistence list > 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780 > I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. > My machine has an i5 hyperthreaded dual-core, thus [*] means 4. > I'm launching this application though spark-submit with --driver-memory 9G -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-15904) High Memory Pressure using MLlib K-means
[ https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327411#comment-15327411 ] Alessio edited comment on SPARK-15904 at 6/13/16 1:55 PM: -- This is absolutely weird to me. I gave Spark 9GB and during the K-Means execution, if I monitor the memory stat I can see that Spark/Java has 9GB (nice) and no Swap whatsoever. After K-means has reached convergence, during this last, cleaning stage everything goes wild. Also, for the sake of scalability, RDDs are persisted on memory *and disk*. So I can't really understand this pressure blowup. was (Author: purple): This is absolutely weird to me. I gave Spark 9GB and during the K-Means execution, if I monitor the memory stat I can see that Spark/Java has 9GB (nice) and no Swap whatsoever. After K-means has reached convergence, during this last, cleaning stage everything goes wild. > High Memory Pressure using MLlib K-means > > > Key: SPARK-15904 > URL: https://issues.apache.org/jira/browse/SPARK-15904 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.6.1 > Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB > of RAM. >Reporter: Alessio >Priority: Minor > > Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on > Memory and Disk. > Everything's fine, although at the end of K-Means, after the number of > iterations, the cost function value and the running time there's a nice > "Removing RDD from persistent list" stage. However, during this stage > there's a high memory pressure. Weird, since RDDs are about to be removed. > Full log of this stage: > 16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations > 16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds. > 16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations. > 16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is > 49784.87126751288. > 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from > persistence list > 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781 > 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from > persistence list > 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780 > I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. > My machine has an i5 hyperthreaded dual-core, thus [*] means 4. > I'm launching this application though spark-submit with --driver-memory 9G -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15904) High Memory Pressure using MLlib K-means
[ https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327438#comment-15327438 ] Alessio commented on SPARK-15904: - My machine has 16GB of RAM. I also tried closing all the other apps, leaving just the Terminal with Spark running. Still no luck. > High Memory Pressure using MLlib K-means > > > Key: SPARK-15904 > URL: https://issues.apache.org/jira/browse/SPARK-15904 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.6.1 > Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB > of RAM. >Reporter: Alessio >Priority: Minor > > Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on > Memory and Disk. > Everything's fine, although at the end of K-Means, after the number of > iterations, the cost function value and the running time there's a nice > "Removing RDD from persistent list" stage. However, during this stage > there's a high memory pressure. Weird, since RDDs are about to be removed. > Full log of this stage: > 16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations > 16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds. > 16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations. > 16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is > 49784.87126751288. > 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from > persistence list > 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781 > 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from > persistence list > 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780 > I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. > My machine has an i5 hyperthreaded dual-core, thus [*] means 4. > I'm launching this application though spark-submit with --driver-memory 9G -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15904) High Memory Pressure using MLlib K-means
[ https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-15904. --- Resolution: Not A Problem Memory and disk still means it's also persisting in memory. I think you'll see the physical memory used by the JVM is much more than 10GB. Because it works fine with _less_ RAM, this really has to be the issue. You should never be swapping. > High Memory Pressure using MLlib K-means > > > Key: SPARK-15904 > URL: https://issues.apache.org/jira/browse/SPARK-15904 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.6.1 > Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB > of RAM. >Reporter: Alessio >Priority: Minor > > Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on > Memory and Disk. > Everything's fine, although at the end of K-Means, after the number of > iterations, the cost function value and the running time there's a nice > "Removing RDD from persistent list" stage. However, during this stage > there's a high memory pressure. Weird, since RDDs are about to be removed. > Full log of this stage: > 16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations > 16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds. > 16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations. > 16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is > 49784.87126751288. > 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from > persistence list > 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781 > 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from > persistence list > 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780 > I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. > My machine has an i5 hyperthreaded dual-core, thus [*] means 4. > I'm launching this application though spark-submit with --driver-memory 9G -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15904) High Memory Pressure using MLlib K-means
[ https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327443#comment-15327443 ] Alessio commented on SPARK-15904: - Correct. Memory and Disk gives priority to Memory...but my dataset is 400MB so it shouldn't be a problem. If I give Spark less RAM (I tried with 4GB and 8GB) Java throws the Out-of-memory error for K>3000. > High Memory Pressure using MLlib K-means > > > Key: SPARK-15904 > URL: https://issues.apache.org/jira/browse/SPARK-15904 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.6.1 > Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB > of RAM. >Reporter: Alessio >Priority: Minor > > Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on > Memory and Disk. > Everything's fine, although at the end of K-Means, after the number of > iterations, the cost function value and the running time there's a nice > "Removing RDD from persistent list" stage. However, during this stage > there's a high memory pressure. Weird, since RDDs are about to be removed. > Full log of this stage: > 16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations > 16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds. > 16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations. > 16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is > 49784.87126751288. > 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from > persistence list > 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781 > 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from > persistence list > 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780 > I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. > My machine has an i5 hyperthreaded dual-core, thus [*] means 4. > I'm launching this application though spark-submit with --driver-memory 9G -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15904) High Memory Pressure using MLlib K-means
[ https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327449#comment-15327449 ] Sean Owen commented on SPARK-15904: --- It's not your 400MB data set that is the only thing in memory or using memory. OK, that's new information, but, you're also just saying that large k needs more memory. At the moment it's not clear whether it's unreasonably high, or due to Spark or your code. What ran out of memory? > High Memory Pressure using MLlib K-means > > > Key: SPARK-15904 > URL: https://issues.apache.org/jira/browse/SPARK-15904 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.6.1 > Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB > of RAM. >Reporter: Alessio >Priority: Minor > > Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on > Memory and Disk. > Everything's fine, although at the end of K-Means, after the number of > iterations, the cost function value and the running time there's a nice > "Removing RDD from persistent list" stage. However, during this stage > there's a high memory pressure. Weird, since RDDs are about to be removed. > Full log of this stage: > 16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations > 16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds. > 16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations. > 16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is > 49784.87126751288. > 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from > persistence list > 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781 > 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from > persistence list > 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780 > I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. > My machine has an i5 hyperthreaded dual-core, thus [*] means 4. > I'm launching this application though spark-submit with --driver-memory 9G -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15922) BlockMatrix to IndexedRowMatrix throws an error
Charlie Evans created SPARK-15922: - Summary: BlockMatrix to IndexedRowMatrix throws an error Key: SPARK-15922 URL: https://issues.apache.org/jira/browse/SPARK-15922 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 2.0.0 Reporter: Charlie Evans import org.apache.spark.mllib.linalg.distributed._ import org.apache.spark.mllib.linalg._ val rows = IndexedRow(0L, new DenseVector(Array(1,2,3))) :: IndexedRow(1L, new DenseVector(Array(1,2,3))):: IndexedRow(2L, new DenseVector(Array(1,2,3))):: Nil val rdd = sc.parallelize(rows) val matrix = new IndexedRowMatrix(rdd, 3, 3) val bmat = matrix.toBlockMatrix val imat = bmat.toIndexedRowMatrix imat.rows.collect // this throws an error - Caused by: java.lang.IllegalArgumentException: requirement failed: Vectors must be the same length! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15922) BlockMatrix to IndexedRowMatrix throws an error
[ https://issues.apache.org/jira/browse/SPARK-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Charlie Evans updated SPARK-15922: -- Description: {code} import org.apache.spark.mllib.linalg.distributed._ import org.apache.spark.mllib.linalg._ val rows = IndexedRow(0L, new DenseVector(Array(1,2,3))) :: IndexedRow(1L, new DenseVector(Array(1,2,3))):: IndexedRow(2L, new DenseVector(Array(1,2,3))):: Nil val rdd = sc.parallelize(rows) val matrix = new IndexedRowMatrix(rdd, 3, 3) val bmat = matrix.toBlockMatrix val imat = bmat.toIndexedRowMatrix imat.rows.collect // this throws an error - Caused by: java.lang.IllegalArgumentException: requirement failed: Vectors must be the same length! was: import org.apache.spark.mllib.linalg.distributed._ import org.apache.spark.mllib.linalg._ val rows = IndexedRow(0L, new DenseVector(Array(1,2,3))) :: IndexedRow(1L, new DenseVector(Array(1,2,3))):: IndexedRow(2L, new DenseVector(Array(1,2,3))):: Nil val rdd = sc.parallelize(rows) val matrix = new IndexedRowMatrix(rdd, 3, 3) val bmat = matrix.toBlockMatrix val imat = bmat.toIndexedRowMatrix imat.rows.collect // this throws an error - Caused by: java.lang.IllegalArgumentException: requirement failed: Vectors must be the same length! > BlockMatrix to IndexedRowMatrix throws an error > --- > > Key: SPARK-15922 > URL: https://issues.apache.org/jira/browse/SPARK-15922 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.0.0 >Reporter: Charlie Evans > > {code} > import org.apache.spark.mllib.linalg.distributed._ > import org.apache.spark.mllib.linalg._ > val rows = IndexedRow(0L, new DenseVector(Array(1,2,3))) :: IndexedRow(1L, > new DenseVector(Array(1,2,3))):: IndexedRow(2L, new > DenseVector(Array(1,2,3))):: Nil > val rdd = sc.parallelize(rows) > val matrix = new IndexedRowMatrix(rdd, 3, 3) > val bmat = matrix.toBlockMatrix > val imat = bmat.toIndexedRowMatrix > imat.rows.collect // this throws an error - Caused by: > java.lang.IllegalArgumentException: requirement failed: Vectors must be the > same length! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15922) BlockMatrix to IndexedRowMatrix throws an error
[ https://issues.apache.org/jira/browse/SPARK-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Charlie Evans updated SPARK-15922: -- Description: {code} import org.apache.spark.mllib.linalg.distributed._ import org.apache.spark.mllib.linalg._ val rows = IndexedRow(0L, new DenseVector(Array(1,2,3))) :: IndexedRow(1L, new DenseVector(Array(1,2,3))):: IndexedRow(2L, new DenseVector(Array(1,2,3))):: Nil val rdd = sc.parallelize(rows) val matrix = new IndexedRowMatrix(rdd, 3, 3) val bmat = matrix.toBlockMatrix val imat = bmat.toIndexedRowMatrix imat.rows.collect // this throws an error - Caused by: java.lang.IllegalArgumentException: requirement failed: Vectors must be the same length! {code} was: {code} import org.apache.spark.mllib.linalg.distributed._ import org.apache.spark.mllib.linalg._ val rows = IndexedRow(0L, new DenseVector(Array(1,2,3))) :: IndexedRow(1L, new DenseVector(Array(1,2,3))):: IndexedRow(2L, new DenseVector(Array(1,2,3))):: Nil val rdd = sc.parallelize(rows) val matrix = new IndexedRowMatrix(rdd, 3, 3) val bmat = matrix.toBlockMatrix val imat = bmat.toIndexedRowMatrix imat.rows.collect // this throws an error - Caused by: java.lang.IllegalArgumentException: requirement failed: Vectors must be the same length! > BlockMatrix to IndexedRowMatrix throws an error > --- > > Key: SPARK-15922 > URL: https://issues.apache.org/jira/browse/SPARK-15922 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.0.0 >Reporter: Charlie Evans > > {code} > import org.apache.spark.mllib.linalg.distributed._ > import org.apache.spark.mllib.linalg._ > val rows = IndexedRow(0L, new DenseVector(Array(1,2,3))) :: IndexedRow(1L, > new DenseVector(Array(1,2,3))):: IndexedRow(2L, new > DenseVector(Array(1,2,3))):: Nil > val rdd = sc.parallelize(rows) > val matrix = new IndexedRowMatrix(rdd, 3, 3) > val bmat = matrix.toBlockMatrix > val imat = bmat.toIndexedRowMatrix > imat.rows.collect // this throws an error - Caused by: > java.lang.IllegalArgumentException: requirement failed: Vectors must be the > same length! > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15918) unionAll returns wrong result when two dataframes has schema in different order
[ https://issues.apache.org/jira/browse/SPARK-15918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-15918: -- Fix Version/s: (was: 1.6.1) Don't set fix version; 1.6.1 wouldn't make sense anyway. > unionAll returns wrong result when two dataframes has schema in different > order > --- > > Key: SPARK-15918 > URL: https://issues.apache.org/jira/browse/SPARK-15918 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 > Environment: CentOS >Reporter: Prabhu Joseph > > On applying unionAll operation between A and B dataframes, they both has same > schema but in different order and hence the result has column value mapping > changed. > Repro: > {code} > A.show() > +---++---+--+--+-++---+--+---+---+-+ > |tag|year_day|tm_hour|tm_min|tm_sec|dtype|time|tm_mday|tm_mon|tm_yday|tm_year|value| > +---++---+--+--+-++---+--+---+---+-+ > +---++---+--+--+-++---+--+---+---+-+ > B.show() > +-+---+--+---+---+--+--+--+---+---+--++ > |dtype|tag| > time|tm_hour|tm_mday|tm_min|tm_mon|tm_sec|tm_yday|tm_year| value|year_day| > +-+---+--+---+---+--+--+--+---+---+--++ > |F|C_FNHXUT701Z.CNSTLO|1443790800| 13| 2| 0|10| 0| > 275| 2015|1.2345| 2015275| > |F|C_FNHXUDP713.CNSTHI|1443790800| 13| 2| 0|10| 0| > 275| 2015|1.2345| 2015275| > |F| C_FNHXUT718.CNSTHI|1443790800| 13| 2| 0|10| 0| > 275| 2015|1.2345| 2015275| > |F|C_FNHXUT703Z.CNSTLO|1443790800| 13| 2| 0|10| 0| > 275| 2015|1.2345| 2015275| > |F|C_FNHXUR716A.CNSTLO|1443790800| 13| 2| 0|10| 0| > 275| 2015|1.2345| 2015275| > |F|C_FNHXUT803Z.CNSTHI|1443790800| 13| 2| 0|10| 0| > 275| 2015|1.2345| 2015275| > |F| C_FNHXUT728.CNSTHI|1443790800| 13| 2| 0|10| 0| > 275| 2015|1.2345| 2015275| > |F| C_FNHXUR806.CNSTHI|1443790800| 13| 2| 0|10| 0| > 275| 2015|1.2345| 2015275| > +-+---+--+---+---+--+--+--+---+---+--++ > A = A.unionAll(B) > A.show() > +---+---+--+--+--+-++---+--+---+---+-+ > |tag| year_day| > tm_hour|tm_min|tm_sec|dtype|time|tm_mday|tm_mon|tm_yday|tm_year|value| > +---+---+--+--+--+-++---+--+---+---+-+ > | F|C_FNHXUT701Z.CNSTLO|1443790800|13| 2|0| 10| 0| 275| > 2015| 1.2345|2015275.0| > | F|C_FNHXUDP713.CNSTHI|1443790800|13| 2|0| 10| 0| 275| > 2015| 1.2345|2015275.0| > | F| C_FNHXUT718.CNSTHI|1443790800|13| 2|0| 10| 0| 275| > 2015| 1.2345|2015275.0| > | F|C_FNHXUT703Z.CNSTLO|1443790800|13| 2|0| 10| 0| 275| > 2015| 1.2345|2015275.0| > | F|C_FNHXUR716A.CNSTLO|1443790800|13| 2|0| 10| 0| 275| > 2015| 1.2345|2015275.0| > | F|C_FNHXUT803Z.CNSTHI|1443790800|13| 2|0| 10| 0| 275| > 2015| 1.2345|2015275.0| > | F| C_FNHXUT728.CNSTHI|1443790800|13| 2|0| 10| 0| 275| > 2015| 1.2345|2015275.0| > | F| C_FNHXUR806.CNSTHI|1443790800|13| 2|0| 10| 0| 275| > 2015| 1.2345|2015275.0| > +---+---+--+--+--+-++---+--+---+---+-+ > {code} > On changing the schema of A according to B and doing unionAll works fine > {code} > C = > A.select("dtype","tag","time","tm_hour","tm_mday","tm_min",”tm_mon”,"tm_sec","tm_yday","tm_year","value","year_day") > A = C.unionAll(B) > A.show() > +-+---+--+---+---+--+--+--+---+---+--++ > |dtype|tag| > time|tm_hour|tm_mday|tm_min|tm_mon|tm_sec|tm_yday|tm_year| value|year_day| > +-+---+--+---+---+--+--+--+---+---+--++ > |F|C_FNHXUT701Z.CNSTLO|1443790800| 13| 2| 0|10| 0| > 275| 2015|1.2345| 2015275| > |F|C_FNHXUDP713.CNSTHI|1443790800| 13| 2| 0|10| 0| > 275| 2015|1.2345| 2015275| > |F| C_FNHXUT718.CNSTHI|1443790800| 13| 2| 0|10| 0| > 275| 2015|1.2345| 2015275| > |F|C_FNHXUT703Z.CNSTLO|1443790800| 13| 2| 0|10| 0| > 275| 2015|1.2345| 20152
[jira] [Commented] (SPARK-15904) High Memory Pressure using MLlib K-means
[ https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327476#comment-15327476 ] Alessio commented on SPARK-15904: - If anyone's interested, the dataset I'm working on is freely available from UCI ML Repository (http://archive.ics.uci.edu/ml/datasets/Daily+and+Sports+Activities). I tried just now running the above K-Means for K=9120, with --driver-memory 4G. The full traceback can be found here (https://ghostbin.com/paste/9pu9k). The code is absolutely simple, I don't think there's nothing wrong with it: sc = SparkContext("local[*]", "Spark K-Means") data = sc.textFile() parsedData = data.map(lambda line: array([float(x) for x in line.split(',')])) parsedDataNOID=parsedData.map(lambda pattern: pattern[1:]) parsedDataNOID.persist(StorageLevel.MEMORY_AND_DISK) K_CANDIDATES= initCentroids=scipy.io.loadmat(<.mat file with initial seeds>) datatmp=numpy.genfromtxt(,delimiter=",") for K in K_CANDIDATES: clusters = KMeans.train(parsedDataNOID, K, maxIterations=2000, runs=1, epsilon=0.0, initialModel = KMeansModel(datatmp[initCentroids['initSeedsA'][0][k_tmp][0]-1,:])) > High Memory Pressure using MLlib K-means > > > Key: SPARK-15904 > URL: https://issues.apache.org/jira/browse/SPARK-15904 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.6.1 > Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB > of RAM. >Reporter: Alessio >Priority: Minor > > Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on > Memory and Disk. > Everything's fine, although at the end of K-Means, after the number of > iterations, the cost function value and the running time there's a nice > "Removing RDD from persistent list" stage. However, during this stage > there's a high memory pressure. Weird, since RDDs are about to be removed. > Full log of this stage: > 16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations > 16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds. > 16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations. > 16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is > 49784.87126751288. > 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from > persistence list > 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781 > 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from > persistence list > 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780 > I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. > My machine has an i5 hyperthreaded dual-core, thus [*] means 4. > I'm launching this application though spark-submit with --driver-memory 9G -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15904) High Memory Pressure using MLlib K-means
[ https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327510#comment-15327510 ] Sean Owen commented on SPARK-15904: --- Yes, that just means "out of memory". The question is whether this is unusual or not. You might try storing the serialized representation in memory, not the 'raw' object form, which is often bigger. You almost certainly need more partitions in the source data, since I expect it's just 1 or 2 partitions according to the block size, but, you probably want the problem to be broken down into smaller chunks rather than process big chunks at once in memory. It's the second arg to textFile. Finally you may get better results with 2.0, or, by using the ML + Dataset APIs. Those are bigger changes though. > High Memory Pressure using MLlib K-means > > > Key: SPARK-15904 > URL: https://issues.apache.org/jira/browse/SPARK-15904 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.6.1 > Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB > of RAM. >Reporter: Alessio >Priority: Minor > > Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on > Memory and Disk. > Everything's fine, although at the end of K-Means, after the number of > iterations, the cost function value and the running time there's a nice > "Removing RDD from persistent list" stage. However, during this stage > there's a high memory pressure. Weird, since RDDs are about to be removed. > Full log of this stage: > 16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations > 16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds. > 16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations. > 16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is > 49784.87126751288. > 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from > persistence list > 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781 > 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from > persistence list > 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780 > I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. > My machine has an i5 hyperthreaded dual-core, thus [*] means 4. > I'm launching this application though spark-submit with --driver-memory 9G -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-15904) High Memory Pressure using MLlib K-means
[ https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327476#comment-15327476 ] Alessio edited comment on SPARK-15904 at 6/13/16 2:48 PM: -- If anyone's interested, the dataset I'm working on is freely available from UCI ML Repository (http://archive.ics.uci.edu/ml/datasets/Daily+and+Sports+Activities). I tried just now running the above K-Means for K=9120, with --driver-memory 4G. The full traceback can be found here (https://ghostbin.com/paste/9pu9k). The code is absolutely simple, I don't think there's something wrong with it: sc = SparkContext("local[*]", "Spark K-Means") data = sc.textFile() parsedData = data.map(lambda line: array([float(x) for x in line.split(',')])) parsedDataNOID=parsedData.map(lambda pattern: pattern[1:]) parsedDataNOID.persist(StorageLevel.MEMORY_AND_DISK) K_CANDIDATES= initCentroids=scipy.io.loadmat(<.mat file with initial seeds>) datatmp=numpy.genfromtxt(,delimiter=",") for K in K_CANDIDATES: clusters = KMeans.train(parsedDataNOID, K, maxIterations=2000, runs=1, epsilon=0.0, initialModel = KMeansModel(datatmp[initCentroids['initSeedsA'][0][k_tmp][0]-1,:])) was (Author: purple): If anyone's interested, the dataset I'm working on is freely available from UCI ML Repository (http://archive.ics.uci.edu/ml/datasets/Daily+and+Sports+Activities). I tried just now running the above K-Means for K=9120, with --driver-memory 4G. The full traceback can be found here (https://ghostbin.com/paste/9pu9k). The code is absolutely simple, I don't think there's nothing wrong with it: sc = SparkContext("local[*]", "Spark K-Means") data = sc.textFile() parsedData = data.map(lambda line: array([float(x) for x in line.split(',')])) parsedDataNOID=parsedData.map(lambda pattern: pattern[1:]) parsedDataNOID.persist(StorageLevel.MEMORY_AND_DISK) K_CANDIDATES= initCentroids=scipy.io.loadmat(<.mat file with initial seeds>) datatmp=numpy.genfromtxt(,delimiter=",") for K in K_CANDIDATES: clusters = KMeans.train(parsedDataNOID, K, maxIterations=2000, runs=1, epsilon=0.0, initialModel = KMeansModel(datatmp[initCentroids['initSeedsA'][0][k_tmp][0]-1,:])) > High Memory Pressure using MLlib K-means > > > Key: SPARK-15904 > URL: https://issues.apache.org/jira/browse/SPARK-15904 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.6.1 > Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB > of RAM. >Reporter: Alessio >Priority: Minor > > Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on > Memory and Disk. > Everything's fine, although at the end of K-Means, after the number of > iterations, the cost function value and the running time there's a nice > "Removing RDD from persistent list" stage. However, during this stage > there's a high memory pressure. Weird, since RDDs are about to be removed. > Full log of this stage: > 16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations > 16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds. > 16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations. > 16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is > 49784.87126751288. > 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from > persistence list > 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781 > 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from > persistence list > 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780 > I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. > My machine has an i5 hyperthreaded dual-core, thus [*] means 4. > I'm launching this application though spark-submit with --driver-memory 9G -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15904) High Memory Pressure using MLlib K-means
[ https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327542#comment-15327542 ] Alessio commented on SPARK-15904: - With the --driver-memory 4G switch I've tried both. With no luck. At first I changed the storage level to serialized, then I also increased the number of partitions (from 12 - default - to 20). Still "out of memory". I guess I'll wait for 2.0 > High Memory Pressure using MLlib K-means > > > Key: SPARK-15904 > URL: https://issues.apache.org/jira/browse/SPARK-15904 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.6.1 > Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB > of RAM. >Reporter: Alessio >Priority: Minor > > Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on > Memory and Disk. > Everything's fine, although at the end of K-Means, after the number of > iterations, the cost function value and the running time there's a nice > "Removing RDD from persistent list" stage. However, during this stage > there's a high memory pressure. Weird, since RDDs are about to be removed. > Full log of this stage: > 16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations > 16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds. > 16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations. > 16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is > 49784.87126751288. > 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from > persistence list > 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781 > 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from > persistence list > 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780 > I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. > My machine has an i5 hyperthreaded dual-core, thus [*] means 4. > I'm launching this application though spark-submit with --driver-memory 9G -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15118) spark couldn't get hive properyties in hive-site.xml
[ https://issues.apache.org/jira/browse/SPARK-15118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327674#comment-15327674 ] Herman van Hovell commented on SPARK-15118: --- [~eksmile] any update on this? > spark couldn't get hive properyties in hive-site.xml > - > > Key: SPARK-15118 > URL: https://issues.apache.org/jira/browse/SPARK-15118 > Project: Spark > Issue Type: Bug > Components: Block Manager, Deploy >Affects Versions: 1.6.1 > Environment: hadoop-2.7.1.tar.gz; > apache-hive-2.0.0-bin.tar.gz; > spark-1.6.1-bin-hadoop2.6.tgz; > scala-2.11.8.tgz >Reporter: eksmile >Priority: Blocker > > I have three question. > First: > I've already put "hive-site.xml" in $SPARK_HOME/conf, but when I run > spark-sql, it tell me "HiveConf of name *** does not exist", and repeat many > times. > All of these "HiveConf" are in "hive-site.xml", why these warnings appear? > I'm not sure this is a bug or not. > Second: > In the middle of logs as follow, there's a paragraph : "Failed to get > database default, returning NoSuchObjectException", > I don't know is there something worng? > Third: > In the middle of logs, there's a paragraph : " metastore.MetaStoreDirectSql: > Using direct SQL, underlying DB is DERBY", > but, in the end of logs, there's a paragraph : "metastore.MetaStoreDirectSql: > Using direct SQL, underlying DB is MYSQL" > My Hive metastore is MYSQL. Is this something wrong? > spark-env.sh as follow: > export JAVA_HOME=/usr/java/jdk1.8.0_73 > export SCALA_HOME=/home/scala > export SPARK_MASTER_IP=192.168.124.129 > export SPARK_WORKER_MEMORY=1g > export HADOOP_CONF_DIR=/usr/hadoop/etc/hadoop > export HIVE_HOME=/opt/hive > export HIVE_CONF_DIR=/opt/hive/conf > export > SPARK_CLASSPATH=$SPARK_CLASSPATH:/opt/hive/lib/mysql-connector-java-5.1.38-bin.jar > export HADOOP_HOME=/usr/hadoop > Thanks for reading > Here're the logs: > [yezt@Master spark]$ bin/spark-sql --master spark://master:7077 > 16/05/04 16:17:16 WARN conf.HiveConf: HiveConf of name > hive.metastore.hbase.aggregate.stats.false.positive.probability does not exist > 16/05/04 16:17:16 WARN conf.HiveConf: HiveConf of name > hive.llap.io.orc.time.counters does not exist > 16/05/04 16:17:16 WARN conf.HiveConf: HiveConf of name > hive.server2.metrics.enabled does not exist > 16/05/04 16:17:16 WARN conf.HiveConf: HiveConf of name > hive.llap.am.liveness.connection.timeout.ms does not exist > 16/05/04 16:17:16 WARN conf.HiveConf: HiveConf of name > hive.server2.thrift.client.connect.retry.limit does not exist > 16/05/04 16:17:16 WARN conf.HiveConf: HiveConf of name > hive.llap.io.allocator.direct does not exist > 16/05/04 16:17:16 WARN conf.HiveConf: HiveConf of name > hive.llap.auto.enforce.stats does not exist > 16/05/04 16:17:16 WARN conf.HiveConf: HiveConf of name > hive.llap.client.consistent.splits does not exist > 16/05/04 16:17:16 WARN conf.HiveConf: HiveConf of name > hive.server2.tez.session.lifetime does not exist > 16/05/04 16:17:16 WARN conf.HiveConf: HiveConf of name > hive.timedout.txn.reaper.start does not exist > 16/05/04 16:17:16 WARN conf.HiveConf: HiveConf of name > hive.metastore.hbase.cache.ttl does not exist > 16/05/04 16:17:16 WARN conf.HiveConf: HiveConf of name > hive.llap.management.acl does not exist > 16/05/04 16:17:16 WARN conf.HiveConf: HiveConf of name > hive.llap.daemon.delegation.token.lifetime does not exist > 16/05/04 16:17:16 WARN conf.HiveConf: HiveConf of name > hive.strict.checks.large.query does not exist > 16/05/04 16:17:16 WARN conf.HiveConf: HiveConf of name > hive.llap.io.allocator.alloc.min does not exist > 16/05/04 16:17:16 WARN conf.HiveConf: HiveConf of name > hive.server2.thrift.client.user does not exist > 16/05/04 16:17:16 WARN conf.HiveConf: HiveConf of name > hive.llap.daemon.wait.queue.comparator.class.name does not exist > 16/05/04 16:17:16 WARN conf.HiveConf: HiveConf of name > hive.llap.daemon.am.liveness.heartbeat.interval.ms does not exist > 16/05/04 16:17:16 WARN conf.HiveConf: HiveConf of name > hive.llap.object.cache.enabled does not exist > 16/05/04 16:17:16 WARN conf.HiveConf: HiveConf of name > hive.server2.webui.use.ssl does not exist > 16/05/04 16:17:16 WARN conf.HiveConf: HiveConf of name hive.metastore.local > does not exist > 16/05/04 16:17:16 WARN conf.HiveConf: HiveConf of name > hive.service.metrics.file.location does not exist > 16/05/04 16:17:16 WARN conf.HiveConf: HiveConf of name > hive.server2.thrift.client.retry.delay.seconds does not exist > 16/05/04 16:17:16 WARN conf.HiveConf: HiveConf of name > hive.llap.daemon.num.file.cleaner.threads does not exist > 16/05/04 16:17:16 WARN conf.HiveConf: HiveConf of name > hive.test.fail.compaction does not exist > 16/05/04 16:17:16 WARN conf.HiveConf: HiveConf of
[jira] [Commented] (SPARK-15370) Some correlated subqueries return incorrect answers
[ https://issues.apache.org/jira/browse/SPARK-15370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327683#comment-15327683 ] Luciano Resende commented on SPARK-15370: - [~hvanhovell] You might need to add [~freiss] to contributor group in Spark jira admin console in order to assign the ticket to Fred. If you don't have access to it, maybe [~rxin] might be able to help sort this out. > Some correlated subqueries return incorrect answers > --- > > Key: SPARK-15370 > URL: https://issues.apache.org/jira/browse/SPARK-15370 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Frederick Reiss > > The rewrite introduced in SPARK-14785 has the COUNT bug. The rewrite changes > the semantics of some correlated subqueries when there are tuples from the > outer query block that do not join with the subquery. For example: > {noformat} > spark-sql> create table R(a integer) as values (1); > spark-sql> create table S(b integer); > spark-sql> select R.a from R > > where (select count(*) from S where R.a = S.b) = 0; > Time taken: 2.139 seconds > > spark-sql> > (returns zero rows; the answer should be one row of '1') > {noformat} > This problem also affects the SELECT clause: > {noformat} > spark-sql> select R.a, > > (select count(*) from S where R.a = S.b) as cnt > > from R; > 1 NULL > (the answer should be "1 0") > {noformat} > Some subqueries with COUNT aggregates are *not* affected: > {noformat} > spark-sql> select R.a from R > > where (select count(*) from S where R.a = S.b) > 0; > Time taken: 0.609 seconds > spark-sql> > (Correct answer) > spark-sql> select R.a from R > > where (select count(*) + sum(S.b) from S where R.a = S.b) = 0; > Time taken: 0.553 seconds > spark-sql> > (Correct answer) > {noformat} > Other cases can trigger the variant of the COUNT bug for expressions > involving NULL checks: > {noformat} > spark-sql> select R.a from R > > where (select sum(S.b) is null from S where R.a = S.b); > (returns zero rows, should return one row) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15822) segmentation violation in o.a.s.unsafe.types.UTF8String
[ https://issues.apache.org/jira/browse/SPARK-15822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327703#comment-15327703 ] Herman van Hovell commented on SPARK-15822: --- [~robbinspg] You can dump the plan to the console by using calling {{explain(true)}} on a DataFrame or by prepending {{EXPLAIN EXTENDED ...}} to your SQL statement. > segmentation violation in o.a.s.unsafe.types.UTF8String > > > Key: SPARK-15822 > URL: https://issues.apache.org/jira/browse/SPARK-15822 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 > Environment: linux amd64 > openjdk version "1.8.0_91" > OpenJDK Runtime Environment (build 1.8.0_91-b14) > OpenJDK 64-Bit Server VM (build 25.91-b14, mixed mode) >Reporter: Pete Robbins >Assignee: Herman van Hovell >Priority: Blocker > > Executors fail with segmentation violation while running application with > spark.memory.offHeap.enabled true > spark.memory.offHeap.size 512m > Also now reproduced with > spark.memory.offHeap.enabled false > {noformat} > # > # A fatal error has been detected by the Java Runtime Environment: > # > # SIGSEGV (0xb) at pc=0x7f4559b4d4bd, pid=14182, tid=139935319750400 > # > # JRE version: OpenJDK Runtime Environment (8.0_91-b14) (build 1.8.0_91-b14) > # Java VM: OpenJDK 64-Bit Server VM (25.91-b14 mixed mode linux-amd64 > compressed oops) > # Problematic frame: > # J 4816 C2 > org.apache.spark.unsafe.types.UTF8String.compareTo(Lorg/apache/spark/unsafe/types/UTF8String;)I > (64 bytes) @ 0x7f4559b4d4bd [0x7f4559b4d460+0x5d] > {noformat} > We initially saw this on IBM java on PowerPC box but is recreatable on linux > with OpenJDK. On linux with IBM Java 8 we see a null pointer exception at the > same code point: > {noformat} > 16/06/08 11:14:58 ERROR Executor: Exception in task 1.0 in stage 5.0 (TID 48) > java.lang.NullPointerException > at > org.apache.spark.unsafe.types.UTF8String.compareTo(UTF8String.java:831) > at org.apache.spark.unsafe.types.UTF8String.compare(UTF8String.java:844) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$doExecute$2$$anon$2.hasNext(WholeStageCodegenExec.scala:377) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > scala.collection.convert.Wrappers$IteratorWrapper.hasNext(Wrappers.scala:30) > at org.spark_project.guava.collect.Ordering.leastOf(Ordering.java:664) > at org.apache.spark.util.collection.Utils$.takeOrdered(Utils.scala:37) > at > org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$30.apply(RDD.scala:1365) > at > org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$30.apply(RDD.scala:1362) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:757) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:757) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:282) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1153) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.lang.Thread.run(Thread.java:785) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15902) Add a deprecation warning for Python 2.6
[ https://issues.apache.org/jira/browse/SPARK-15902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327723#comment-15327723 ] Krishna Kalyan commented on SPARK-15902: Hi [~holdenk], I have some questions, where do I add this warning? [here below](https://github.com/apache/spark/blob/master/python/pyspark/context.py) I need to add something like {code} if sys.version < 2.6: warnings.warn("Deprecated in 2.1.0. Use Python 2.7+ instead", DeprecationWarning) {code} Thanks > Add a deprecation warning for Python 2.6 > > > Key: SPARK-15902 > URL: https://issues.apache.org/jira/browse/SPARK-15902 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Reporter: holdenk >Priority: Minor > > As we move to Python 2.7+ in Spark 2.1+ it would be good to add a deprecation > warning if we detect we are running in Python 2.6. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15923) Spark Application rest api returns "no such app: "
Yesha Vora created SPARK-15923: -- Summary: Spark Application rest api returns "no such app: " Key: SPARK-15923 URL: https://issues.apache.org/jira/browse/SPARK-15923 Project: Spark Issue Type: Bug Affects Versions: 1.6.1 Reporter: Yesha Vora Env : secure cluster Scenario: * Run SparkPi application in yarn-client or yarn-cluster mode * After application finishes, check Spark HS rest api to get details like jobs / executor etc. {code} http://:18080/api/v1/applications/application_1465778870517_0001/1/executors{code} Rest api return HTTP Code: 404 and prints "HTTP Data: no such app: application_1465778870517_0001" -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15923) Spark Application rest api returns "no such app: "
[ https://issues.apache.org/jira/browse/SPARK-15923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327750#comment-15327750 ] Sean Owen commented on SPARK-15923: --- [~tgraves] or [~ste...@apache.org] will probably know better, but I'm not sure all of that is the app ID? > Spark Application rest api returns "no such app: " > - > > Key: SPARK-15923 > URL: https://issues.apache.org/jira/browse/SPARK-15923 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.1 >Reporter: Yesha Vora > > Env : secure cluster > Scenario: > * Run SparkPi application in yarn-client or yarn-cluster mode > * After application finishes, check Spark HS rest api to get details like > jobs / executor etc. > {code} > http://:18080/api/v1/applications/application_1465778870517_0001/1/executors{code} > > Rest api return HTTP Code: 404 and prints "HTTP Data: no such app: > application_1465778870517_0001" -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15814) Aggregator can return null result
[ https://issues.apache.org/jira/browse/SPARK-15814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell resolved SPARK-15814. --- Resolution: Resolved > Aggregator can return null result > - > > Key: SPARK-15814 > URL: https://issues.apache.org/jira/browse/SPARK-15814 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15163) Mark experimental algorithms experimental in PySpark
[ https://issues.apache.org/jira/browse/SPARK-15163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327748#comment-15327748 ] Krishna Kalyan commented on SPARK-15163: Hi [~holdenk], Is this task still up for grabs?. Thanks > Mark experimental algorithms experimental in PySpark > > > Key: SPARK-15163 > URL: https://issues.apache.org/jira/browse/SPARK-15163 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: holdenk >Priority: Trivial > > While we are going through them anyways might as well mark the PySpark > algorithm as experimental that are marked so in Scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15924) SparkR parser bug with backslash in comments
Xuan Wang created SPARK-15924: - Summary: SparkR parser bug with backslash in comments Key: SPARK-15924 URL: https://issues.apache.org/jira/browse/SPARK-15924 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 1.6.1 Reporter: Xuan Wang When I run an R cell with the following comments: {code} # p <- p + scale_fill_manual(values = set2[groups]) # # p <- p + scale_fill_brewer(palette = "Set2") + scale_color_brewer(palette = "Set2") # p <- p + scale_x_date(labels = date_format("%m/%d\n%a")) # p {code} I get the following error message Error in parse(text = DATABRICKS_CURRENT_TEMP_CMD__) : :16:1: unexpected input 15: # p <- p + scale_x_date(labels = date_format("%m/%d 16: %a")) ^ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15924) SparkR parser bug with backslash in comments
[ https://issues.apache.org/jira/browse/SPARK-15924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Wang updated SPARK-15924: -- Description: When I run an R cell with the following comments: {code} # p <- p + scale_fill_manual(values = set2[groups]) # # p <- p + scale_fill_brewer(palette = "Set2") + scale_color_brewer(palette = "Set2") # p <- p + scale_x_date(labels = date_format("%m/%d\n%a")) # p {code} I get the following error message {quote} Error in parse(text = DATABRICKS_CURRENT_TEMP_CMD__) : :16:1: unexpected input 15: # p <- p + scale_x_date(labels = date_format("%m/%d 16: %a")) ^ {quote} was: When I run an R cell with the following comments: {code} # p <- p + scale_fill_manual(values = set2[groups]) # # p <- p + scale_fill_brewer(palette = "Set2") + scale_color_brewer(palette = "Set2") # p <- p + scale_x_date(labels = date_format("%m/%d\n%a")) # p {code} I get the following error message Error in parse(text = DATABRICKS_CURRENT_TEMP_CMD__) : :16:1: unexpected input 15: # p <- p + scale_x_date(labels = date_format("%m/%d 16: %a")) ^ > SparkR parser bug with backslash in comments > > > Key: SPARK-15924 > URL: https://issues.apache.org/jira/browse/SPARK-15924 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.6.1 >Reporter: Xuan Wang > > When I run an R cell with the following comments: > {code} > # p <- p + scale_fill_manual(values = set2[groups]) > # # p <- p + scale_fill_brewer(palette = "Set2") + > scale_color_brewer(palette = "Set2") > # p <- p + scale_x_date(labels = date_format("%m/%d\n%a")) > # p > {code} > I get the following error message > {quote} > Error in parse(text = DATABRICKS_CURRENT_TEMP_CMD__) : > :16:1: unexpected input > 15: # p <- p + scale_x_date(labels = date_format("%m/%d > 16: %a")) > ^ > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15924) SparkR parser bug with backslash in comments
[ https://issues.apache.org/jira/browse/SPARK-15924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Wang updated SPARK-15924: -- Description: When I run an R cell with the following comments: {code} # p <- p + scale_fill_manual(values = set2[groups]) # # p <- p + scale_fill_brewer(palette = "Set2") + scale_color_brewer(palette = "Set2") # p <- p + scale_x_date(labels = date_format("%m/%d\n%a")) # p {code} I get the following error message {quote} Error in parse(text = DATABRICKS_CURRENT_TEMP_CMD__) : :16:1: unexpected input 15: # p <- p + scale_x_date(labels = date_format("%m/%d 16: %a")) ^ {quote} After I remove the backslash in "date_format("%m/%d\n%a"))", it works fine. was: When I run an R cell with the following comments: {code} # p <- p + scale_fill_manual(values = set2[groups]) # # p <- p + scale_fill_brewer(palette = "Set2") + scale_color_brewer(palette = "Set2") # p <- p + scale_x_date(labels = date_format("%m/%d\n%a")) # p {code} I get the following error message {quote} Error in parse(text = DATABRICKS_CURRENT_TEMP_CMD__) : :16:1: unexpected input 15: # p <- p + scale_x_date(labels = date_format("%m/%d 16: %a")) ^ {quote} > SparkR parser bug with backslash in comments > > > Key: SPARK-15924 > URL: https://issues.apache.org/jira/browse/SPARK-15924 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.6.1 >Reporter: Xuan Wang > > When I run an R cell with the following comments: > {code} > # p <- p + scale_fill_manual(values = set2[groups]) > # # p <- p + scale_fill_brewer(palette = "Set2") + > scale_color_brewer(palette = "Set2") > # p <- p + scale_x_date(labels = date_format("%m/%d\n%a")) > # p > {code} > I get the following error message > {quote} > Error in parse(text = DATABRICKS_CURRENT_TEMP_CMD__) : > :16:1: unexpected input > 15: # p <- p + scale_x_date(labels = date_format("%m/%d > 16: %a")) > ^ > {quote} > After I remove the backslash in "date_format("%m/%d\n%a"))", it works fine. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15924) SparkR parser bug with backslash in comments
[ https://issues.apache.org/jira/browse/SPARK-15924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Wang updated SPARK-15924: -- Description: When I run an R cell with the following comments: {code} # p <- p + scale_fill_manual(values = set2[groups]) # # p <- p + scale_fill_brewer(palette = "Set2") + scale_color_brewer(palette = "Set2") # p <- p + scale_x_date(labels = date_format("%m/%d\n%a")) # p {code} I get the following error message {quote} :16:1: unexpected input 15: # p <- p + scale_x_date(labels = date_format("%m/%d 16: %a")) ^ {quote} After I remove the backslash in "date_format("%m/%d\n%a"))", it works fine. was: When I run an R cell with the following comments: {code} # p <- p + scale_fill_manual(values = set2[groups]) # # p <- p + scale_fill_brewer(palette = "Set2") + scale_color_brewer(palette = "Set2") # p <- p + scale_x_date(labels = date_format("%m/%d\n%a")) # p {code} I get the following error message {quote} Error in parse(text = DATABRICKS_CURRENT_TEMP_CMD__) : :16:1: unexpected input 15: # p <- p + scale_x_date(labels = date_format("%m/%d 16: %a")) ^ {quote} After I remove the backslash in "date_format("%m/%d\n%a"))", it works fine. > SparkR parser bug with backslash in comments > > > Key: SPARK-15924 > URL: https://issues.apache.org/jira/browse/SPARK-15924 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.6.1 >Reporter: Xuan Wang > > When I run an R cell with the following comments: > {code} > # p <- p + scale_fill_manual(values = set2[groups]) > # # p <- p + scale_fill_brewer(palette = "Set2") + > scale_color_brewer(palette = "Set2") > # p <- p + scale_x_date(labels = date_format("%m/%d\n%a")) > # p > {code} > I get the following error message > {quote} > :16:1: unexpected input > 15: # p <- p + scale_x_date(labels = date_format("%m/%d > 16: %a")) > ^ > {quote} > After I remove the backslash in "date_format("%m/%d\n%a"))", it works fine. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15913) Dispatcher.stopped should be enclosed by synchronized block.
[ https://issues.apache.org/jira/browse/SPARK-15913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-15913. Resolution: Fixed Assignee: Dongjoon Hyun Fix Version/s: 2.0.0 > Dispatcher.stopped should be enclosed by synchronized block. > > > Key: SPARK-15913 > URL: https://issues.apache.org/jira/browse/SPARK-15913 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 2.0.0 > > > Dispatcher.stopped is guarded by `this`, but it is used without > synchronization in `postMessage` function. This issue fixes this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15826) PipedRDD to allow configurable char encoding (default: UTF-8)
[ https://issues.apache.org/jira/browse/SPARK-15826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil updated SPARK-15826: Summary: PipedRDD to allow configurable char encoding (default: UTF-8) (was: PipedRDD to strictly use UTF-8 and not rely on default encoding) > PipedRDD to allow configurable char encoding (default: UTF-8) > - > > Key: SPARK-15826 > URL: https://issues.apache.org/jira/browse/SPARK-15826 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Tejas Patil >Priority: Trivial > > Encountered an issue wherein the code works in some cluster but fails on > another one for the same input. After debugging realised that PipedRDD is > picking default char encoding from the JVM which may be different across > different platforms. Making it use UTF-8 encoding just like > `ScriptTransformation` does. > Stack trace: > {noformat} > Caused by: java.nio.charset.MalformedInputException: Input length = 1 > at java.nio.charset.CoderResult.throwException(CoderResult.java:281) > at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:339) > at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178) > at java.io.InputStreamReader.read(InputStreamReader.java:184) > at java.io.BufferedReader.fill(BufferedReader.java:161) > at java.io.BufferedReader.readLine(BufferedReader.java:324) > at java.io.BufferedReader.readLine(BufferedReader.java:389) > at > scala.io.BufferedSource$BufferedLineIterator.hasNext(BufferedSource.scala:67) > at org.apache.spark.rdd.PipedRDD$$anon$1.hasNext(PipedRDD.scala:185) > at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1612) > at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1160) > at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1160) > at > org.apache.spark.SparkContext$$anonfun$runJob$6.apply(SparkContext.scala:1868) > at > org.apache.spark.SparkContext$$anonfun$runJob$6.apply(SparkContext.scala:1868) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:89) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15345) SparkSession's conf doesn't take effect when there's already an existing SparkContext
[ https://issues.apache.org/jira/browse/SPARK-15345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327803#comment-15327803 ] Herman van Hovell commented on SPARK-15345: --- [~m1lan] Just to be sure, is this the actual code you have copy & pasted here? There is a typo in {{conf = SparkConrf()}}, should be {{conf = SparkConf()}}. > SparkSession's conf doesn't take effect when there's already an existing > SparkContext > - > > Key: SPARK-15345 > URL: https://issues.apache.org/jira/browse/SPARK-15345 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Reporter: Piotr Milanowski >Assignee: Reynold Xin >Priority: Blocker > Fix For: 2.0.0 > > > I am working with branch-2.0, spark is compiled with hive support (-Phive and > -Phvie-thriftserver). > I am trying to access databases using this snippet: > {code} > from pyspark.sql import HiveContext > hc = HiveContext(sc) > hc.sql("show databases").collect() > [Row(result='default')] > {code} > This means that spark doesn't find any databases specified in configuration. > Using the same configuration (i.e. hive-site.xml and core-site.xml) in spark > 1.6, and launching above snippet, I can print out existing databases. > When run in DEBUG mode this is what spark (2.0) prints out: > {code} > 16/05/16 12:17:47 INFO SparkSqlParser: Parsing command: show databases > 16/05/16 12:17:47 DEBUG SimpleAnalyzer: > === Result of Batch Resolution === > !'Project [unresolveddeserializer(createexternalrow(if (isnull(input[0, > string])) null else input[0, string].toString, > StructField(result,StringType,false)), result#2) AS #3] Project > [createexternalrow(if (isnull(result#2)) null else result#2.toString, > StructField(result,StringType,false)) AS #3] > +- LocalRelation [result#2] > > +- LocalRelation [result#2] > > 16/05/16 12:17:47 DEBUG ClosureCleaner: +++ Cleaning closure > (org.apache.spark.sql.Dataset$$anonfun$53) +++ > 16/05/16 12:17:47 DEBUG ClosureCleaner: + declared fields: 2 > 16/05/16 12:17:47 DEBUG ClosureCleaner: public static final long > org.apache.spark.sql.Dataset$$anonfun$53.serialVersionUID > 16/05/16 12:17:47 DEBUG ClosureCleaner: private final > org.apache.spark.sql.types.StructType > org.apache.spark.sql.Dataset$$anonfun$53.structType$1 > 16/05/16 12:17:47 DEBUG ClosureCleaner: + declared methods: 2 > 16/05/16 12:17:47 DEBUG ClosureCleaner: public final java.lang.Object > org.apache.spark.sql.Dataset$$anonfun$53.apply(java.lang.Object) > 16/05/16 12:17:47 DEBUG ClosureCleaner: public final java.lang.Object > org.apache.spark.sql.Dataset$$anonfun$53.apply(org.apache.spark.sql.catalyst.InternalRow) > 16/05/16 12:17:47 DEBUG ClosureCleaner: + inner classes: 0 > 16/05/16 12:17:47 DEBUG ClosureCleaner: + outer classes: 0 > 16/05/16 12:17:47 DEBUG ClosureCleaner: + outer objects: 0 > 16/05/16 12:17:47 DEBUG ClosureCleaner: + populating accessed fields because > this is the starting closure > 16/05/16 12:17:47 DEBUG ClosureCleaner: + fields accessed by starting > closure: 0 > 16/05/16 12:17:47 DEBUG ClosureCleaner: + there are no enclosing objects! > 16/05/16 12:17:47 DEBUG ClosureCleaner: +++ closure > (org.apache.spark.sql.Dataset$$anonfun$53) is now cleaned +++ > 16/05/16 12:17:47 DEBUG ClosureCleaner: +++ Cleaning closure > (org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1) > +++ > 16/05/16 12:17:47 DEBUG ClosureCleaner: + declared fields: 1 > 16/05/16 12:17:47 DEBUG ClosureCleaner: public static final long > org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1.serialVersionUID > 16/05/16 12:17:47 DEBUG ClosureCleaner: + declared methods: 2 > 16/05/16 12:17:47 DEBUG ClosureCleaner: public final java.lang.Object > org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1.apply(java.lang.Object) > 16/05/16 12:17:47 DEBUG ClosureCleaner: public final > org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler > org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1.apply(scala.collection.Iterator) > 16/05/16 12:17:47 DEBUG ClosureCleaner: + inner classes: 0 > 16/05/16 12:17:47 DEBUG ClosureCleaner: + outer classes: 0 > 16/05/16 12:17:47 DEBUG ClosureCleaner: + outer objects: 0 > 16/05/16 12:17:47 DEBUG ClosureCleaner: + populating accessed fields because > this is the starting closure > 16/05/16 12:17:47 DEBUG ClosureCleaner: + fields accessed by starting > closure: 0 > 16/05/16 12:17:47 DEBUG ClosureCleaner: + there are no enclosing objects!
[jira] [Commented] (SPARK-15666) Join on two tables generated from a same table throwing query analyzer issue
[ https://issues.apache.org/jira/browse/SPARK-15666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327818#comment-15327818 ] Herman van Hovell commented on SPARK-15666: --- [~mkbond777] Is this also a problem on 2.0? Any chance you could provide a reproducible example? > Join on two tables generated from a same table throwing query analyzer issue > > > Key: SPARK-15666 > URL: https://issues.apache.org/jira/browse/SPARK-15666 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1 > Environment: AWS EMR >Reporter: Manish Kumar >Priority: Blocker > > If two dataframes (named leftdf and rightdf) which are created by performimg > some opeartions on a single dataframe are joined then we are getting some > analyzer issue: > leftdf schema > {noformat} > root > |-- affinity_monitor_copay: string (nullable = true) > |-- affinity_monitor_digital_pull: string (nullable = true) > |-- affinity_monitor_digital_push: string (nullable = true) > |-- affinity_monitor_direct: string (nullable = true) > |-- affinity_monitor_peer: string (nullable = true) > |-- affinity_monitor_peer_interaction: string (nullable = true) > |-- affinity_monitor_personal_f2f: string (nullable = true) > |-- affinity_monitor_personal_remote: string (nullable = true) > |-- affinity_monitor_sample: string (nullable = true) > |-- affinity_monitor_voucher: string (nullable = true) > |-- afltn_id: string (nullable = true) > |-- attribute_2_value: string (nullable = true) > |-- brand: string (nullable = true) > |-- city: string (nullable = true) > |-- cycle_time_id: integer (nullable = true) > |-- full_name: string (nullable = true) > |-- hcp: string (nullable = true) > |-- like17_mg17_metric114_aggregated: double (nullable = true) > |-- like17_mg17_metric118_aggregated: double (nullable = true) > |-- metric_group_sk: integer (nullable = true) > |-- metrics: array (nullable = true) > ||-- element: struct (containsNull = true) > |||-- hcp: string (nullable = true) > |||-- brand: string (nullable = true) > |||-- rep: string (nullable = true) > |||-- month: string (nullable = true) > |||-- metric117: string (nullable = true) > |||-- metric114: string (nullable = true) > |||-- metric118: string (nullable = true) > |||-- specialty_1: string (nullable = true) > |||-- full_name: string (nullable = true) > |||-- pri_st: string (nullable = true) > |||-- city: string (nullable = true) > |||-- zip_code: string (nullable = true) > |||-- prsn_id: string (nullable = true) > |||-- afltn_id: string (nullable = true) > |||-- npi_id: string (nullable = true) > |||-- affinity_monitor_sample: string (nullable = true) > |||-- affinity_monitor_personal_f2f: string (nullable = true) > |||-- affinity_monitor_peer: string (nullable = true) > |||-- affinity_monitor_copay: string (nullable = true) > |||-- affinity_monitor_digital_push: string (nullable = true) > |||-- affinity_monitor_voucher: string (nullable = true) > |||-- affinity_monitor_direct: string (nullable = true) > |||-- affinity_monitor_peer_interaction: string (nullable = true) > |||-- affinity_monitor_digital_pull: string (nullable = true) > |||-- affinity_monitor_personal_remote: string (nullable = true) > |||-- attribute_2_value: string (nullable = true) > |||-- metric211: double (nullable = false) > |-- mg17_metric117_3: double (nullable = true) > |-- mg17_metric117_3_actual_metric: double (nullable = true) > |-- mg17_metric117_3_planned_metric: double (nullable = true) > |-- mg17_metric117_D_suggestion_id: integer (nullable = true) > |-- mg17_metric117_D_suggestion_text: string (nullable = true) > |-- mg17_metric117_D_suggestion_text_raw: string (nullable = true) > |-- mg17_metric117_exp_score: integer (nullable = true) > |-- mg17_metric117_severity_index: double (nullable = true) > |-- mg17_metric117_test: integer (nullable = true) > |-- mg17_metric211_P_suggestion_id: integer (nullable = true) > |-- mg17_metric211_P_suggestion_text: string (nullable = true) > |-- mg17_metric211_P_suggestion_text_raw: string (nullable = true) > |-- mg17_metric211_aggregated: double (nullable = false) > |-- mg17_metric211_deviationfrompeers_p_value: double (nullable = true) > |-- mg17_metric211_deviationfromtrend_current_mu: double (nullable = true) > |-- mg17_metric211_deviationfromtrend_p_value: double (nullable = true) > |-- mg17_metric211_deviationfromtrend_previous_mu: double (nullable = true) > |-- mg17_metric211_exp_score: integer (nullable = tru