[jira] [Created] (SPARK-15915) CacheManager should use canonicalized plan for planToCache.

2016-06-13 Thread Takuya Ueshin (JIRA)
Takuya Ueshin created SPARK-15915:
-

 Summary: CacheManager should use canonicalized plan for 
planToCache.
 Key: SPARK-15915
 URL: https://issues.apache.org/jira/browse/SPARK-15915
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Takuya Ueshin


{{DataFrame}} with plan overriding {{sameResult}} but not using canonicalized 
plan to compare can't cacheTable.

The example is like:

{code}
val localRelation = Seq(1, 2, 3).toDF()
localRelation.createOrReplaceTempView("localRelation")

spark.catalog.cacheTable("localRelation")
assert(
  localRelation.queryExecution.withCachedData.collect {
case i: InMemoryRelation => i
  }.size == 1)
{code}

and this will fail as:

{noformat}
ArrayBuffer() had size 0 instead of expected size 1
{noformat}

The reason is that when do {{spark.catalog.cacheTable("localRelation")}}, 
{{CacheManager}} tries to cache for the plan wrapped by {{SubqueryAlias}} but 
when planning for the DataFrame {{localRelation}}, {{CacheManager}} tries to 
find cached table for the not-wrapped plan because the plan for DataFrame 
{{localRelation}} is not wrapped.
Some plans like {{LocalRelation}}, {{LogicalRDD}}, etc. override {{sameResult}} 
method, but not use canonicalized plan to compare so the {{CacheManager}} can't 
detect the plans are the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15915) CacheManager should use canonicalized plan for planToCache.

2016-06-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15326921#comment-15326921
 ] 

Apache Spark commented on SPARK-15915:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/13638

> CacheManager should use canonicalized plan for planToCache.
> ---
>
> Key: SPARK-15915
> URL: https://issues.apache.org/jira/browse/SPARK-15915
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Takuya Ueshin
>
> {{DataFrame}} with plan overriding {{sameResult}} but not using canonicalized 
> plan to compare can't cacheTable.
> The example is like:
> {code}
> val localRelation = Seq(1, 2, 3).toDF()
> localRelation.createOrReplaceTempView("localRelation")
> spark.catalog.cacheTable("localRelation")
> assert(
>   localRelation.queryExecution.withCachedData.collect {
> case i: InMemoryRelation => i
>   }.size == 1)
> {code}
> and this will fail as:
> {noformat}
> ArrayBuffer() had size 0 instead of expected size 1
> {noformat}
> The reason is that when do {{spark.catalog.cacheTable("localRelation")}}, 
> {{CacheManager}} tries to cache for the plan wrapped by {{SubqueryAlias}} but 
> when planning for the DataFrame {{localRelation}}, {{CacheManager}} tries to 
> find cached table for the not-wrapped plan because the plan for DataFrame 
> {{localRelation}} is not wrapped.
> Some plans like {{LocalRelation}}, {{LogicalRDD}}, etc. override 
> {{sameResult}} method, but not use canonicalized plan to compare so the 
> {{CacheManager}} can't detect the plans are the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15915) CacheManager should use canonicalized plan for planToCache.

2016-06-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15915:


Assignee: (was: Apache Spark)

> CacheManager should use canonicalized plan for planToCache.
> ---
>
> Key: SPARK-15915
> URL: https://issues.apache.org/jira/browse/SPARK-15915
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Takuya Ueshin
>
> {{DataFrame}} with plan overriding {{sameResult}} but not using canonicalized 
> plan to compare can't cacheTable.
> The example is like:
> {code}
> val localRelation = Seq(1, 2, 3).toDF()
> localRelation.createOrReplaceTempView("localRelation")
> spark.catalog.cacheTable("localRelation")
> assert(
>   localRelation.queryExecution.withCachedData.collect {
> case i: InMemoryRelation => i
>   }.size == 1)
> {code}
> and this will fail as:
> {noformat}
> ArrayBuffer() had size 0 instead of expected size 1
> {noformat}
> The reason is that when do {{spark.catalog.cacheTable("localRelation")}}, 
> {{CacheManager}} tries to cache for the plan wrapped by {{SubqueryAlias}} but 
> when planning for the DataFrame {{localRelation}}, {{CacheManager}} tries to 
> find cached table for the not-wrapped plan because the plan for DataFrame 
> {{localRelation}} is not wrapped.
> Some plans like {{LocalRelation}}, {{LogicalRDD}}, etc. override 
> {{sameResult}} method, but not use canonicalized plan to compare so the 
> {{CacheManager}} can't detect the plans are the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15915) CacheManager should use canonicalized plan for planToCache.

2016-06-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15915:


Assignee: Apache Spark

> CacheManager should use canonicalized plan for planToCache.
> ---
>
> Key: SPARK-15915
> URL: https://issues.apache.org/jira/browse/SPARK-15915
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Takuya Ueshin
>Assignee: Apache Spark
>
> {{DataFrame}} with plan overriding {{sameResult}} but not using canonicalized 
> plan to compare can't cacheTable.
> The example is like:
> {code}
> val localRelation = Seq(1, 2, 3).toDF()
> localRelation.createOrReplaceTempView("localRelation")
> spark.catalog.cacheTable("localRelation")
> assert(
>   localRelation.queryExecution.withCachedData.collect {
> case i: InMemoryRelation => i
>   }.size == 1)
> {code}
> and this will fail as:
> {noformat}
> ArrayBuffer() had size 0 instead of expected size 1
> {noformat}
> The reason is that when do {{spark.catalog.cacheTable("localRelation")}}, 
> {{CacheManager}} tries to cache for the plan wrapped by {{SubqueryAlias}} but 
> when planning for the DataFrame {{localRelation}}, {{CacheManager}} tries to 
> find cached table for the not-wrapped plan because the plan for DataFrame 
> {{localRelation}} is not wrapped.
> Some plans like {{LocalRelation}}, {{LogicalRDD}}, etc. override 
> {{sameResult}} method, but not use canonicalized plan to compare so the 
> {{CacheManager}} can't detect the plans are the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15916) JDBC AND/OR operator push down does not respect lower OR operator precedence

2016-06-13 Thread Piotr Czarnas (JIRA)
Piotr Czarnas created SPARK-15916:
-

 Summary: JDBC AND/OR operator push down does not respect lower OR 
operator precedence
 Key: SPARK-15916
 URL: https://issues.apache.org/jira/browse/SPARK-15916
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Piotr Czarnas


A table from sql server Northwind database was registered as a JDBC dataframe.
A query was executed on Spark SQL, the "northwind_dbo_Categories" table is a 
temporary table which is a JDBC dataframe to "[northwind].[dbo].[Categories]" 
sql server table:

SQL executed on Spark sql context:
SELECT CategoryID FROM northwind_dbo_Categories
WHERE (CategoryID = 1 OR CategoryID = 2) AND CategoryName = 'Beverages'


Spark has done a proper predicate pushdown to JDBC, however parenthesis around 
two OR conditions was removed. Instead the following query was sent over JDBC 
to SQL Server:
SELECT "CategoryID" FROM [northwind].[dbo].[Categories] WHERE (CategoryID = 1) 
OR (CategoryID = 2) AND CategoryName = 'Beverages'


As a result, the last two conditions (around the AND operator) were considered 
as the highest precedence: (CategoryID = 2) AND CategoryName = 'Beverages'

Finally SQL Server has executed a query like this:
SELECT "CategoryID" FROM [northwind].[dbo].[Categories] WHERE CategoryID = 1 OR 
(CategoryID = 2 AND CategoryName = 'Beverages')




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14503) spark.ml API for FPGrowth

2016-06-13 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15326977#comment-15326977
 ] 

Jeff Zhang commented on SPARK-14503:


[~GayathriMurali] [~yuhaoyan] Do you still work on this ? If not, I can help to 
continue

> spark.ml API for FPGrowth
> -
>
> Key: SPARK-14503
> URL: https://issues.apache.org/jira/browse/SPARK-14503
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>
> This task is the first port of spark.mllib.fpm functionality to spark.ml 
> (Scala).
> This will require a brief design doc to confirm a reasonable DataFrame-based 
> API, with details for this class.  The doc could also look ahead to the other 
> fpm classes, especially if their API decisions will affect FPGrowth.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15796) Reduce spark.memory.fraction default to avoid overrunning old gen in JVM default config

2016-06-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15326979#comment-15326979
 ] 

Sean Owen commented on SPARK-15796:
---

A new parameter like that would just be going back to the old behavior, and I 
think there was a good reason to simplify the settings (see above). 
I agree that it seems like we need more breathing room, so I would argue for 
the 0.6 limit as well now, and some more extensive documentation about what to 
do to NewRatio when increasing this. NewRation N needs to be large enough so 
that N/(N+1) comfortably exceeds {{spark.memory.fraction}}.

> Reduce spark.memory.fraction default to avoid overrunning old gen in JVM 
> default config
> ---
>
> Key: SPARK-15796
> URL: https://issues.apache.org/jira/browse/SPARK-15796
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.6.0, 1.6.1
>Reporter: Gabor Feher
>Priority: Minor
> Attachments: baseline.txt, memfrac06.txt, memfrac063.txt, 
> memfrac066.txt
>
>
> While debugging performance issues in a Spark program, I've found a simple 
> way to slow down Spark 1.6 significantly by filling the RDD memory cache. 
> This seems to be a regression, because setting 
> "spark.memory.useLegacyMode=true" fixes the problem. Here is a repro that is 
> just a simple program that fills the memory cache of Spark using a 
> MEMORY_ONLY cached RDD (but of course this comes up in more complex 
> situations, too):
> {code}
> import org.apache.spark.SparkContext
> import org.apache.spark.SparkConf
> import org.apache.spark.storage.StorageLevel
> object CacheDemoApp { 
>   def main(args: Array[String]) {
> val conf = new SparkConf().setAppName("Cache Demo Application")   
> 
> val sc = new SparkContext(conf)
> val startTime = System.currentTimeMillis()
>   
> 
> val cacheFiller = sc.parallelize(1 to 5, 1000)
> 
>   .mapPartitionsWithIndex {
> case (ix, it) =>
>   println(s"CREATE DATA PARTITION ${ix}") 
> 
>   val r = new scala.util.Random(ix)
>   it.map(x => (r.nextLong, r.nextLong))
>   }
> cacheFiller.persist(StorageLevel.MEMORY_ONLY)
> cacheFiller.foreach(identity)
> val finishTime = System.currentTimeMillis()
> val elapsedTime = (finishTime - startTime) / 1000
> println(s"TIME= $elapsedTime s")
>   }
> }
> {code}
> If I call it the following way, it completes in around 5 minutes on my 
> Laptop, while often stopping for slow Full GC cycles. I can also see with 
> jvisualvm (Visual GC plugin) that the old generation of JVM is 96.8% filled.
> {code}
> sbt package
> ~/spark-1.6.0/bin/spark-submit \
>   --class "CacheDemoApp" \
>   --master "local[2]" \
>   --driver-memory 3g \
>   --driver-java-options "-XX:+PrintGCDetails" \
>   target/scala-2.10/simple-project_2.10-1.0.jar
> {code}
> If I add any one of the below flags, then the run-time drops to around 40-50 
> seconds and the difference is coming from the drop in GC times:
>   --conf "spark.memory.fraction=0.6"
> OR
>   --conf "spark.memory.useLegacyMode=true"
> OR
>   --driver-java-options "-XX:NewRatio=3"
> All the other cache types except for DISK_ONLY produce similar symptoms. It 
> looks like that the problem is that the amount of data Spark wants to store 
> long-term ends up being larger than the old generation size in the JVM and 
> this triggers Full GC repeatedly.
> I did some research:
> * In Spark 1.6, spark.memory.fraction is the upper limit on cache size. It 
> defaults to 0.75.
> * In Spark 1.5, spark.storage.memoryFraction is the upper limit in cache 
> size. It defaults to 0.6 and...
> * http://spark.apache.org/docs/1.5.2/configuration.html even says that it 
> shouldn't be bigger than the size of the old generation.
> * On the other hand, OpenJDK's default NewRatio is 2, which means an old 
> generation size of 66%. Hence the default value in Spark 1.6 contradicts this 
> advice.
> http://spark.apache.org/docs/1.6.1/tuning.html recommends that if the old 
> generation is running close to full, then setting 
> spark.memory.storageFraction to a lower value should help. I have tried with 
> spark.memory.storageFraction=0.1, but it still doesn't fix the issue. This is 
> not a surprise: http://spark.apache.org/docs/1.6.1/configuration.html 
> explains that storageFraction is not an upper-limit but a lower limit-like 
> thing on the size of Spark's cache. The real upper limit is 
> spark.memory.fraction.
> To sum up my questions/issues:
> * At least http://spark.apache.org/

[jira] [Commented] (SPARK-14503) spark.ml API for FPGrowth

2016-06-13 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15326994#comment-15326994
 ] 

yuhao yang commented on SPARK-14503:


Hi Jeff, welcome to contribute. 
I'm discussing with some industry users to see what's the optimal interface for 
FPM, especially what should the output column contains. Appreciate if you can 
share some thoughts.

> spark.ml API for FPGrowth
> -
>
> Key: SPARK-14503
> URL: https://issues.apache.org/jira/browse/SPARK-14503
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>
> This task is the first port of spark.mllib.fpm functionality to spark.ml 
> (Scala).
> This will require a brief design doc to confirm a reasonable DataFrame-based 
> API, with details for this class.  The doc could also look ahead to the other 
> fpm classes, especially if their API decisions will affect FPGrowth.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15796) Reduce spark.memory.fraction default to avoid overrunning old gen in JVM default config

2016-06-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-15796:
--
Priority: Blocker  (was: Minor)

Pardon marking this "Blocker", but I think this needs some attention before 
2.0, if in fact the default memory settings for the new memory manager and JVM 
ergonomics don't play well together. It's an easy resolution one way or the 
other -- mostly a question of defaults and docs.

> Reduce spark.memory.fraction default to avoid overrunning old gen in JVM 
> default config
> ---
>
> Key: SPARK-15796
> URL: https://issues.apache.org/jira/browse/SPARK-15796
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.6.0, 1.6.1
>Reporter: Gabor Feher
>Priority: Blocker
> Attachments: baseline.txt, memfrac06.txt, memfrac063.txt, 
> memfrac066.txt
>
>
> While debugging performance issues in a Spark program, I've found a simple 
> way to slow down Spark 1.6 significantly by filling the RDD memory cache. 
> This seems to be a regression, because setting 
> "spark.memory.useLegacyMode=true" fixes the problem. Here is a repro that is 
> just a simple program that fills the memory cache of Spark using a 
> MEMORY_ONLY cached RDD (but of course this comes up in more complex 
> situations, too):
> {code}
> import org.apache.spark.SparkContext
> import org.apache.spark.SparkConf
> import org.apache.spark.storage.StorageLevel
> object CacheDemoApp { 
>   def main(args: Array[String]) {
> val conf = new SparkConf().setAppName("Cache Demo Application")   
> 
> val sc = new SparkContext(conf)
> val startTime = System.currentTimeMillis()
>   
> 
> val cacheFiller = sc.parallelize(1 to 5, 1000)
> 
>   .mapPartitionsWithIndex {
> case (ix, it) =>
>   println(s"CREATE DATA PARTITION ${ix}") 
> 
>   val r = new scala.util.Random(ix)
>   it.map(x => (r.nextLong, r.nextLong))
>   }
> cacheFiller.persist(StorageLevel.MEMORY_ONLY)
> cacheFiller.foreach(identity)
> val finishTime = System.currentTimeMillis()
> val elapsedTime = (finishTime - startTime) / 1000
> println(s"TIME= $elapsedTime s")
>   }
> }
> {code}
> If I call it the following way, it completes in around 5 minutes on my 
> Laptop, while often stopping for slow Full GC cycles. I can also see with 
> jvisualvm (Visual GC plugin) that the old generation of JVM is 96.8% filled.
> {code}
> sbt package
> ~/spark-1.6.0/bin/spark-submit \
>   --class "CacheDemoApp" \
>   --master "local[2]" \
>   --driver-memory 3g \
>   --driver-java-options "-XX:+PrintGCDetails" \
>   target/scala-2.10/simple-project_2.10-1.0.jar
> {code}
> If I add any one of the below flags, then the run-time drops to around 40-50 
> seconds and the difference is coming from the drop in GC times:
>   --conf "spark.memory.fraction=0.6"
> OR
>   --conf "spark.memory.useLegacyMode=true"
> OR
>   --driver-java-options "-XX:NewRatio=3"
> All the other cache types except for DISK_ONLY produce similar symptoms. It 
> looks like that the problem is that the amount of data Spark wants to store 
> long-term ends up being larger than the old generation size in the JVM and 
> this triggers Full GC repeatedly.
> I did some research:
> * In Spark 1.6, spark.memory.fraction is the upper limit on cache size. It 
> defaults to 0.75.
> * In Spark 1.5, spark.storage.memoryFraction is the upper limit in cache 
> size. It defaults to 0.6 and...
> * http://spark.apache.org/docs/1.5.2/configuration.html even says that it 
> shouldn't be bigger than the size of the old generation.
> * On the other hand, OpenJDK's default NewRatio is 2, which means an old 
> generation size of 66%. Hence the default value in Spark 1.6 contradicts this 
> advice.
> http://spark.apache.org/docs/1.6.1/tuning.html recommends that if the old 
> generation is running close to full, then setting 
> spark.memory.storageFraction to a lower value should help. I have tried with 
> spark.memory.storageFraction=0.1, but it still doesn't fix the issue. This is 
> not a surprise: http://spark.apache.org/docs/1.6.1/configuration.html 
> explains that storageFraction is not an upper-limit but a lower limit-like 
> thing on the size of Spark's cache. The real upper limit is 
> spark.memory.fraction.
> To sum up my questions/issues:
> * At least http://spark.apache.org/docs/1.6.1/tuning.html should be fixed. 
> Maybe the old generation size should also be mentioned in configuration.html 
> near spark.memory.fraction.
> * Is it a goal for Spark to

[jira] [Resolved] (SPARK-15813) Spark Dyn Allocation Cancel log message misleading

2016-06-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-15813.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13552
[https://github.com/apache/spark/pull/13552]

> Spark Dyn Allocation Cancel log message misleading
> --
>
> Key: SPARK-15813
> URL: https://issues.apache.org/jira/browse/SPARK-15813
> Project: Spark
>  Issue Type: Bug
>Reporter: Peter Ableda
>Priority: Trivial
> Fix For: 2.0.0
>
>
> *Driver requested* message is logged before the *Canceling* message but has 
> the updated executor number. The messages are misleading.
> See log snippet:
> {code}
> 16/06/07 18:53:48 INFO yarn.YarnAllocator: Driver requested a total number of 
> 619 executor(s).
> 16/06/07 18:53:48 INFO yarn.YarnAllocator: Canceling requests for 4 executor 
> containers
> 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 382.0 in stage 
> 0.0 (TID 382) in 22 ms on lava-2.vpc.cloudera.com (382/1000)
> 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 383.0 in stage 
> 0.0 (TID 383, lava-2.vpc.cloudera.com, partition 383,PROCESS_LOCAL, 1980 
> bytes)
> 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 383.0 in stage 
> 0.0 (TID 383) in 24 ms on lava-2.vpc.cloudera.com (383/1000)
> 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 384.0 in stage 
> 0.0 (TID 384, lava-2.vpc.cloudera.com, partition 384,PROCESS_LOCAL, 1980 
> bytes)
> 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 384.0 in stage 
> 0.0 (TID 384) in 19 ms on lava-2.vpc.cloudera.com (384/1000)
> 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 385.0 in stage 
> 0.0 (TID 385, lava-2.vpc.cloudera.com, partition 385,PROCESS_LOCAL, 1980 
> bytes)
> 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 385.0 in stage 
> 0.0 (TID 385) in 22 ms on lava-2.vpc.cloudera.com (385/1000)
> 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 386.0 in stage 
> 0.0 (TID 386, lava-2.vpc.cloudera.com, partition 386,PROCESS_LOCAL, 1980 
> bytes)
> 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 386.0 in stage 
> 0.0 (TID 386) in 20 ms on lava-2.vpc.cloudera.com (386/1000)
> 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 387.0 in stage 
> 0.0 (TID 387, lava-2.vpc.cloudera.com, partition 387,PROCESS_LOCAL, 1980 
> bytes)
> 16/06/07 18:53:48 INFO yarn.YarnAllocator: Driver requested a total number of 
> 614 executor(s).
> 16/06/07 18:53:48 INFO yarn.YarnAllocator: Canceling requests for 5 executor 
> containers
> 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 388.0 in stage 
> 0.0 (TID 388, lava-4.vpc.cloudera.com, partition 388,PROCESS_LOCAL, 1980 
> bytes)
> {code}
> The easy solution is to update the message to use past tense. This is 
> consistent with the other messages there.
> *Canceled requests for 5 executor container(s).*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15813) Spark Dyn Allocation Cancel log message misleading

2016-06-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-15813:
--
Assignee: Peter Ableda

> Spark Dyn Allocation Cancel log message misleading
> --
>
> Key: SPARK-15813
> URL: https://issues.apache.org/jira/browse/SPARK-15813
> Project: Spark
>  Issue Type: Bug
>Reporter: Peter Ableda
>Assignee: Peter Ableda
>Priority: Trivial
> Fix For: 2.0.0
>
>
> *Driver requested* message is logged before the *Canceling* message but has 
> the updated executor number. The messages are misleading.
> See log snippet:
> {code}
> 16/06/07 18:53:48 INFO yarn.YarnAllocator: Driver requested a total number of 
> 619 executor(s).
> 16/06/07 18:53:48 INFO yarn.YarnAllocator: Canceling requests for 4 executor 
> containers
> 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 382.0 in stage 
> 0.0 (TID 382) in 22 ms on lava-2.vpc.cloudera.com (382/1000)
> 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 383.0 in stage 
> 0.0 (TID 383, lava-2.vpc.cloudera.com, partition 383,PROCESS_LOCAL, 1980 
> bytes)
> 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 383.0 in stage 
> 0.0 (TID 383) in 24 ms on lava-2.vpc.cloudera.com (383/1000)
> 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 384.0 in stage 
> 0.0 (TID 384, lava-2.vpc.cloudera.com, partition 384,PROCESS_LOCAL, 1980 
> bytes)
> 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 384.0 in stage 
> 0.0 (TID 384) in 19 ms on lava-2.vpc.cloudera.com (384/1000)
> 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 385.0 in stage 
> 0.0 (TID 385, lava-2.vpc.cloudera.com, partition 385,PROCESS_LOCAL, 1980 
> bytes)
> 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 385.0 in stage 
> 0.0 (TID 385) in 22 ms on lava-2.vpc.cloudera.com (385/1000)
> 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 386.0 in stage 
> 0.0 (TID 386, lava-2.vpc.cloudera.com, partition 386,PROCESS_LOCAL, 1980 
> bytes)
> 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 386.0 in stage 
> 0.0 (TID 386) in 20 ms on lava-2.vpc.cloudera.com (386/1000)
> 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 387.0 in stage 
> 0.0 (TID 387, lava-2.vpc.cloudera.com, partition 387,PROCESS_LOCAL, 1980 
> bytes)
> 16/06/07 18:53:48 INFO yarn.YarnAllocator: Driver requested a total number of 
> 614 executor(s).
> 16/06/07 18:53:48 INFO yarn.YarnAllocator: Canceling requests for 5 executor 
> containers
> 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 388.0 in stage 
> 0.0 (TID 388, lava-4.vpc.cloudera.com, partition 388,PROCESS_LOCAL, 1980 
> bytes)
> {code}
> The easy solution is to update the message to use past tense. This is 
> consistent with the other messages there.
> *Canceled requests for 5 executor container(s).*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15813) Spark Dyn Allocation Cancel log message misleading

2016-06-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-15813:
--
Issue Type: Improvement  (was: Bug)

> Spark Dyn Allocation Cancel log message misleading
> --
>
> Key: SPARK-15813
> URL: https://issues.apache.org/jira/browse/SPARK-15813
> Project: Spark
>  Issue Type: Improvement
>Reporter: Peter Ableda
>Assignee: Peter Ableda
>Priority: Trivial
> Fix For: 2.0.0
>
>
> *Driver requested* message is logged before the *Canceling* message but has 
> the updated executor number. The messages are misleading.
> See log snippet:
> {code}
> 16/06/07 18:53:48 INFO yarn.YarnAllocator: Driver requested a total number of 
> 619 executor(s).
> 16/06/07 18:53:48 INFO yarn.YarnAllocator: Canceling requests for 4 executor 
> containers
> 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 382.0 in stage 
> 0.0 (TID 382) in 22 ms on lava-2.vpc.cloudera.com (382/1000)
> 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 383.0 in stage 
> 0.0 (TID 383, lava-2.vpc.cloudera.com, partition 383,PROCESS_LOCAL, 1980 
> bytes)
> 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 383.0 in stage 
> 0.0 (TID 383) in 24 ms on lava-2.vpc.cloudera.com (383/1000)
> 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 384.0 in stage 
> 0.0 (TID 384, lava-2.vpc.cloudera.com, partition 384,PROCESS_LOCAL, 1980 
> bytes)
> 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 384.0 in stage 
> 0.0 (TID 384) in 19 ms on lava-2.vpc.cloudera.com (384/1000)
> 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 385.0 in stage 
> 0.0 (TID 385, lava-2.vpc.cloudera.com, partition 385,PROCESS_LOCAL, 1980 
> bytes)
> 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 385.0 in stage 
> 0.0 (TID 385) in 22 ms on lava-2.vpc.cloudera.com (385/1000)
> 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 386.0 in stage 
> 0.0 (TID 386, lava-2.vpc.cloudera.com, partition 386,PROCESS_LOCAL, 1980 
> bytes)
> 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 386.0 in stage 
> 0.0 (TID 386) in 20 ms on lava-2.vpc.cloudera.com (386/1000)
> 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 387.0 in stage 
> 0.0 (TID 387, lava-2.vpc.cloudera.com, partition 387,PROCESS_LOCAL, 1980 
> bytes)
> 16/06/07 18:53:48 INFO yarn.YarnAllocator: Driver requested a total number of 
> 614 executor(s).
> 16/06/07 18:53:48 INFO yarn.YarnAllocator: Canceling requests for 5 executor 
> containers
> 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 388.0 in stage 
> 0.0 (TID 388, lava-4.vpc.cloudera.com, partition 388,PROCESS_LOCAL, 1980 
> bytes)
> {code}
> The easy solution is to update the message to use past tense. This is 
> consistent with the other messages there.
> *Canceled requests for 5 executor container(s).*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6320) Adding new query plan strategy to SQLContext

2016-06-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6320:
-
Assignee: Takuya Ueshin

> Adding new query plan strategy to SQLContext
> 
>
> Key: SPARK-6320
> URL: https://issues.apache.org/jira/browse/SPARK-6320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Youssef Hatem
>Assignee: Takuya Ueshin
>Priority: Minor
> Fix For: 2.0.0
>
>
> Hi,
> I would like to add a new strategy to {{SQLContext}}. To do this I created a 
> new class which extends {{Strategy}}. In my new class I need to call 
> {{planLater}} function. However this method is defined in {{SparkPlanner}} 
> (which itself inherits the method from {{QueryPlanner}}).
> To my knowledge the only way to make {{planLater}} function visible to my new 
> strategy is to define my strategy inside another class that extends 
> {{SparkPlanner}} and inherits {{planLater}} as a result, by doing so I will 
> have to extend the {{SQLContext}} such that I can override the {{planner}} 
> field with the new {{Planner}} class I created.
> It seems that this is a design problem because adding a new strategy seems to 
> require extending {{SQLContext}} (unless I am doing it wrong and there is a 
> better way to do it).
> Thanks a lot,
> Youssef



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15788) PySpark IDFModel missing "idf" property

2016-06-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-15788:
--
Assignee: Jeff Zhang

> PySpark IDFModel missing "idf" property
> ---
>
> Key: SPARK-15788
> URL: https://issues.apache.org/jira/browse/SPARK-15788
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Nick Pentreath
>Assignee: Jeff Zhang
>Priority: Trivial
> Fix For: 2.0.0
>
>
> Scala {{IDFModel}} has a method {{def idf: Vector = idfModel.idf.asML}} - 
> this should be exposed on the Python side as a property



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15489) Dataset kryo encoder won't load custom user settings

2016-06-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-15489:
--
Assignee: Amit Sela

> Dataset kryo encoder won't load custom user settings 
> -
>
> Key: SPARK-15489
> URL: https://issues.apache.org/jira/browse/SPARK-15489
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Amit Sela
>Assignee: Amit Sela
> Fix For: 2.0.0
>
>
> When setting a custom "spark.kryo.registrator" (or any other configuration 
> for that matter) through the API, this configuration will not propagate to 
> the encoder that uses a KryoSerializer since it instantiates with "new 
> SparkConf()".
> See:  
> https://github.com/apache/spark/blob/07c36a2f07fcf5da6fb395f830ebbfc10eb27dcc/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala#L554
> This could be hacked by providing those configurations as System properties, 
> but this probably should be passed to the encoder and set in the 
> SerializerInstance after creation.
> Example:
> When using Encoders with kryo to encode generically typed Objects in the 
> following manner:
> public static  Encoder encoder() {
>   return Encoders.kryo((Class) Object.class);
> }
> I get a decoding exception when trying to decode 
> `java.util.Collections$UnmodifiableCollection`, which probably comes from 
> Guava's `ImmutableList`.
> This happens when running with master = local[1]. Same code had no problems 
> with RDD api.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15743) Prevent saving with all-column partitioning

2016-06-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-15743:
--
Assignee: Dongjoon Hyun

> Prevent saving with all-column partitioning
> ---
>
> Key: SPARK-15743
> URL: https://issues.apache.org/jira/browse/SPARK-15743
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>  Labels: releasenotes
> Fix For: 2.0.0
>
>
> When saving datasets on storage, `partitionBy` provides an easy way to 
> construct the directory structure. However, if a user choose all columns as 
> partition columns, some exceptions occurs.
> - ORC: `AnalysisException` on **future read** due to schema inference failure.
> - Parquet: `InvalidSchemaException` on **write execution** due to Parquet 
> limitation.
> The followings are the examples.
> **ORC with all column partitioning**
> {code}
> scala> 
> spark.range(10).write.format("orc").mode("overwrite").partitionBy("id").save("/tmp/data")
>   
>   
> scala> spark.read.format("orc").load("/tmp/data").collect()
> org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC at 
> /tmp/data. It must be specified manually;
> {code}
> **Parquet with all-column partitioning**
> {code}
> scala> 
> spark.range(100).write.format("parquet").mode("overwrite").partitionBy("id").save("/tmp/data")
> [Stage 0:>  (0 + 8) / 
> 8]16/06/02 16:51:17 ERROR Utils: Aborting task
> org.apache.parquet.schema.InvalidSchemaException: A group type can not be 
> empty. Parquet does not support empty group without leaves. Empty group: 
> spark_schema
> ... (lots of error messages)
> {code}
> Although some formats like JSON support all-column partitioning without any 
> problem, it seems not a good idea to make lots of empty directories. 
> This issue prevents this by consistently raising `AnalysisException` before 
> saving. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15790) Audit @Since annotations in ML

2016-06-13 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327028#comment-15327028
 ] 

Nick Pentreath commented on SPARK-15790:


Ah thanks - missed that umbrella. It's actually really the {{ml.feature}} 
classes mostly, and that PR seems to have stalled. I've started on a new one to 
cover the feature package.

> Audit @Since annotations in ML
> --
>
> Key: SPARK-15790
> URL: https://issues.apache.org/jira/browse/SPARK-15790
> Project: Spark
>  Issue Type: Documentation
>  Components: ML, PySpark
>Reporter: Nick Pentreath
>Assignee: Nick Pentreath
>
> Many classes & methods in ML are missing {{@Since}} annotations. Audit what's 
> missing and add annotations to public API constructors, vals and methods.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6628) ClassCastException occurs when executing sql statement "insert into" on hbase table

2016-06-13 Thread Murshid Chalaev (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327065#comment-15327065
 ] 

Murshid Chalaev commented on SPARK-6628:


Spark 1.6.1 is affected as well, is there any workaround for this?

> ClassCastException occurs when executing sql statement "insert into" on hbase 
> table
> ---
>
> Key: SPARK-6628
> URL: https://issues.apache.org/jira/browse/SPARK-6628
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: meiyoula
>
> Error: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 1 in stage 3.0 failed 4 times, most recent failure: Lost task 1.3 in 
> stage 3.0 (TID 12, vm-17): java.lang.ClassCastException: 
> org.apache.hadoop.hive.hbase.HiveHBaseTableOutputFormat cannot be cast to 
> org.apache.hadoop.hive.ql.io.HiveOutputFormat
> at 
> org.apache.spark.sql.hive.SparkHiveWriterContainer.outputFormat$lzycompute(hiveWriterContainers.scala:72)
> at 
> org.apache.spark.sql.hive.SparkHiveWriterContainer.outputFormat(hiveWriterContainers.scala:71)
> at 
> org.apache.spark.sql.hive.SparkHiveWriterContainer.getOutputName(hiveWriterContainers.scala:91)
> at 
> org.apache.spark.sql.hive.SparkHiveWriterContainer.initWriters(hiveWriterContainers.scala:115)
> at 
> org.apache.spark.sql.hive.SparkHiveWriterContainer.executorSideSetup(hiveWriterContainers.scala:84)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.scala:112)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:93)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:93)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
> at org.apache.spark.scheduler.Task.run(Task.scala:56)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:197)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15904) High Memory Pressure using MLlib K-means

2016-06-13 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327082#comment-15327082
 ] 

yuhao yang commented on SPARK-15904:


Hi [~Purple]] What's your k and vector size? Btw, this should not be a major 
bug.

> High Memory Pressure using MLlib K-means
> 
>
> Key: SPARK-15904
> URL: https://issues.apache.org/jira/browse/SPARK-15904
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.1
> Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB 
> of RAM.
>Reporter: Alessio
>
> Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on 
> Memory and Disk.
> Everything's fine, although at the end of K-Means, after the number of 
> iterations, the cost function value and the running time there's a nice 
> "Removing RDD  from persistent list" stage. However, during this stage 
> there's a high memory pressure. Weird, since RDDs are about to be removed. 
> Full log of this stage:
> 16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations
> 16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds.
> 16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations.
> 16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is 
> 49784.87126751288.
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780
> I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. 
> My machine has an i5 hyperthreaded dual-core, thus [*] means 4.
> I'm launching this application though spark-submit with --driver-memory 10G



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15904) High Memory Pressure using MLlib K-means

2016-06-13 Thread Alessio (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327090#comment-15327090
 ] 

Alessio commented on SPARK-15904:
-

Hi [~yuhaoyan]], the dataset size is 9120 rows and 2125 columns.
This problem appears when K>3000.
What do you suggest as priority label? I'm sorry if "major" is not appropriate, 
this is my first post on JIRA

> High Memory Pressure using MLlib K-means
> 
>
> Key: SPARK-15904
> URL: https://issues.apache.org/jira/browse/SPARK-15904
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.1
> Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB 
> of RAM.
>Reporter: Alessio
>
> Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on 
> Memory and Disk.
> Everything's fine, although at the end of K-Means, after the number of 
> iterations, the cost function value and the running time there's a nice 
> "Removing RDD  from persistent list" stage. However, during this stage 
> there's a high memory pressure. Weird, since RDDs are about to be removed. 
> Full log of this stage:
> 16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations
> 16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds.
> 16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations.
> 16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is 
> 49784.87126751288.
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780
> I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. 
> My machine has an i5 hyperthreaded dual-core, thus [*] means 4.
> I'm launching this application though spark-submit with --driver-memory 10G



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15916) JDBC AND/OR operator push down does not respect lower OR operator precedence

2016-06-13 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327106#comment-15327106
 ] 

Hyukjin Kwon commented on SPARK-15916:
--

Indeed. Do you mind if I submit a PR for this?

> JDBC AND/OR operator push down does not respect lower OR operator precedence
> 
>
> Key: SPARK-15916
> URL: https://issues.apache.org/jira/browse/SPARK-15916
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Piotr Czarnas
>
> A table from sql server Northwind database was registered as a JDBC dataframe.
> A query was executed on Spark SQL, the "northwind_dbo_Categories" table is a 
> temporary table which is a JDBC dataframe to "[northwind].[dbo].[Categories]" 
> sql server table:
> SQL executed on Spark sql context:
> SELECT CategoryID FROM northwind_dbo_Categories
> WHERE (CategoryID = 1 OR CategoryID = 2) AND CategoryName = 'Beverages'
> Spark has done a proper predicate pushdown to JDBC, however parenthesis 
> around two OR conditions was removed. Instead the following query was sent 
> over JDBC to SQL Server:
> SELECT "CategoryID" FROM [northwind].[dbo].[Categories] WHERE (CategoryID = 
> 1) OR (CategoryID = 2) AND CategoryName = 'Beverages'
> As a result, the last two conditions (around the AND operator) were 
> considered as the highest precedence: (CategoryID = 2) AND CategoryName = 
> 'Beverages'
> Finally SQL Server has executed a query like this:
> SELECT "CategoryID" FROM [northwind].[dbo].[Categories] WHERE CategoryID = 1 
> OR (CategoryID = 2 AND CategoryName = 'Beverages')



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15904) High Memory Pressure using MLlib K-means

2016-06-13 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327108#comment-15327108
 ] 

yuhao yang commented on SPARK-15904:


Thanks for reporting it. I'm not sure if the issue is valid for now. Maybe Type 
-> Improvement, Priority -> minor as a start.




> High Memory Pressure using MLlib K-means
> 
>
> Key: SPARK-15904
> URL: https://issues.apache.org/jira/browse/SPARK-15904
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.1
> Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB 
> of RAM.
>Reporter: Alessio
>
> Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on 
> Memory and Disk.
> Everything's fine, although at the end of K-Means, after the number of 
> iterations, the cost function value and the running time there's a nice 
> "Removing RDD  from persistent list" stage. However, during this stage 
> there's a high memory pressure. Weird, since RDDs are about to be removed. 
> Full log of this stage:
> 16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations
> 16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds.
> 16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations.
> 16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is 
> 49784.87126751288.
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780
> I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. 
> My machine has an i5 hyperthreaded dual-core, thus [*] means 4.
> I'm launching this application though spark-submit with --driver-memory 10G



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15917) Define the number of executors in standalone mode with an easy-to-use property

2016-06-13 Thread Jonathan Taws (JIRA)
Jonathan Taws created SPARK-15917:
-

 Summary: Define the number of executors in standalone mode with an 
easy-to-use property
 Key: SPARK-15917
 URL: https://issues.apache.org/jira/browse/SPARK-15917
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, Spark Shell, Spark Submit
Affects Versions: 1.6.1
Reporter: Jonathan Taws
Priority: Minor


After stumbling across a few StackOverflow posts around the issue of using a 
fixed number of executors in standalone mode (non-YARN), I was wondering if we 
could not add an easier way to set this parameter than having to resort to some 
calculations based on the number of cores and the memory you have available on 
your worker. 

For example, let's say I have 8 cores and 30GB of memory available.
If no option is passed, one executor will be spawned with 8 cores and 1GB of 
memory allocated.
However, let's say I want to have only *2* executors, and to use 2 cores and 
10GB of memory per executor, I will end up with *3* executors (as the available 
memory will limit the number of executors) instead of the 2 I was hoping for.

Sure, I can set {{spark.cores.max}} as a workaround to get exactly what I want, 
but would it not be easier to add a {{--num-executors}}-like option to 
standalone mode to be able to really fine-tune the configuration ? This option 
is already available in YARN mode.

>From my understanding, I don't see any other option lying around that can help 
>achieve this.  

This seems to be slightly disturbing for newcomers, and standalone mode is 
probably the first thing anyone will use to just try out Spark or test some 
configuration.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15917) Define the number of executors in standalone mode with an easy-to-use property

2016-06-13 Thread Jonathan Taws (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Taws updated SPARK-15917:
--
Description: 
After stumbling across a few StackOverflow posts around the issue of using a 
fixed number of executors in standalone mode (non-YARN), I was wondering if we 
could not add an easier way to set this parameter than having to resort to some 
calculations based on the number of cores and the memory you have available on 
your worker. 

For example, let's say I have 8 cores and 30GB of memory available :
 - If no option is passed, one executor will be spawned with 8 cores and 1GB of 
memory allocated.
 - However, if I want to have only *2* executors, and to use 2 cores and 10GB 
of memory per executor, I will end up with *3* executors (as the available 
memory will limit the number of executors) instead of the 2 I was hoping for.

Sure, I can set {{spark.cores.max}} as a workaround to get exactly what I want, 
but would it not be easier to add a {{--num-executors}}-like option to 
standalone mode to be able to really fine-tune the configuration ? This option 
is already available in YARN mode.

>From my understanding, I don't see any other option lying around that can help 
>achieve this.  

This seems to be slightly disturbing for newcomers, and standalone mode is 
probably the first thing anyone will use to just try out Spark or test some 
configuration.  

  was:
After stumbling across a few StackOverflow posts around the issue of using a 
fixed number of executors in standalone mode (non-YARN), I was wondering if we 
could not add an easier way to set this parameter than having to resort to some 
calculations based on the number of cores and the memory you have available on 
your worker. 

For example, let's say I have 8 cores and 30GB of memory available.
If no option is passed, one executor will be spawned with 8 cores and 1GB of 
memory allocated.
However, let's say I want to have only *2* executors, and to use 2 cores and 
10GB of memory per executor, I will end up with *3* executors (as the available 
memory will limit the number of executors) instead of the 2 I was hoping for.

Sure, I can set {{spark.cores.max}} as a workaround to get exactly what I want, 
but would it not be easier to add a {{--num-executors}}-like option to 
standalone mode to be able to really fine-tune the configuration ? This option 
is already available in YARN mode.

>From my understanding, I don't see any other option lying around that can help 
>achieve this.  

This seems to be slightly disturbing for newcomers, and standalone mode is 
probably the first thing anyone will use to just try out Spark or test some 
configuration.  


> Define the number of executors in standalone mode with an easy-to-use property
> --
>
> Key: SPARK-15917
> URL: https://issues.apache.org/jira/browse/SPARK-15917
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Spark Shell, Spark Submit
>Affects Versions: 1.6.1
>Reporter: Jonathan Taws
>Priority: Minor
>
> After stumbling across a few StackOverflow posts around the issue of using a 
> fixed number of executors in standalone mode (non-YARN), I was wondering if 
> we could not add an easier way to set this parameter than having to resort to 
> some calculations based on the number of cores and the memory you have 
> available on your worker. 
> For example, let's say I have 8 cores and 30GB of memory available :
>  - If no option is passed, one executor will be spawned with 8 cores and 1GB 
> of memory allocated.
>  - However, if I want to have only *2* executors, and to use 2 cores and 10GB 
> of memory per executor, I will end up with *3* executors (as the available 
> memory will limit the number of executors) instead of the 2 I was hoping for.
> Sure, I can set {{spark.cores.max}} as a workaround to get exactly what I 
> want, but would it not be easier to add a {{--num-executors}}-like option to 
> standalone mode to be able to really fine-tune the configuration ? This 
> option is already available in YARN mode.
> From my understanding, I don't see any other option lying around that can 
> help achieve this.  
> This seems to be slightly disturbing for newcomers, and standalone mode is 
> probably the first thing anyone will use to just try out Spark or test some 
> configuration.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15904) High Memory Pressure using MLlib K-means

2016-06-13 Thread Alessio (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessio updated SPARK-15904:

Issue Type: Improvement  (was: Bug)

> High Memory Pressure using MLlib K-means
> 
>
> Key: SPARK-15904
> URL: https://issues.apache.org/jira/browse/SPARK-15904
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.1
> Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB 
> of RAM.
>Reporter: Alessio
>
> Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on 
> Memory and Disk.
> Everything's fine, although at the end of K-Means, after the number of 
> iterations, the cost function value and the running time there's a nice 
> "Removing RDD  from persistent list" stage. However, during this stage 
> there's a high memory pressure. Weird, since RDDs are about to be removed. 
> Full log of this stage:
> 16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations
> 16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds.
> 16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations.
> 16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is 
> 49784.87126751288.
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780
> I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. 
> My machine has an i5 hyperthreaded dual-core, thus [*] means 4.
> I'm launching this application though spark-submit with --driver-memory 10G



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15904) High Memory Pressure using MLlib K-means

2016-06-13 Thread Alessio (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessio updated SPARK-15904:

Priority: Minor  (was: Major)

> High Memory Pressure using MLlib K-means
> 
>
> Key: SPARK-15904
> URL: https://issues.apache.org/jira/browse/SPARK-15904
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.1
> Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB 
> of RAM.
>Reporter: Alessio
>Priority: Minor
>
> Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on 
> Memory and Disk.
> Everything's fine, although at the end of K-Means, after the number of 
> iterations, the cost function value and the running time there's a nice 
> "Removing RDD  from persistent list" stage. However, during this stage 
> there's a high memory pressure. Weird, since RDDs are about to be removed. 
> Full log of this stage:
> 16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations
> 16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds.
> 16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations.
> 16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is 
> 49784.87126751288.
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780
> I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. 
> My machine has an i5 hyperthreaded dual-core, thus [*] means 4.
> I'm launching this application though spark-submit with --driver-memory 10G



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15916) JDBC AND/OR operator push down does not respect lower OR operator precedence

2016-06-13 Thread Piotr Czarnas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327118#comment-15327118
 ] 

Piotr Czarnas commented on SPARK-15916:
---

Hi,

I wish so. This issue is failing a lot of tests in my project.

Best Regards,
Piotr

On Mon, Jun 13, 2016 at 12:00 PM, Hyukjin Kwon (JIRA) 



> JDBC AND/OR operator push down does not respect lower OR operator precedence
> 
>
> Key: SPARK-15916
> URL: https://issues.apache.org/jira/browse/SPARK-15916
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Piotr Czarnas
>
> A table from sql server Northwind database was registered as a JDBC dataframe.
> A query was executed on Spark SQL, the "northwind_dbo_Categories" table is a 
> temporary table which is a JDBC dataframe to "[northwind].[dbo].[Categories]" 
> sql server table:
> SQL executed on Spark sql context:
> SELECT CategoryID FROM northwind_dbo_Categories
> WHERE (CategoryID = 1 OR CategoryID = 2) AND CategoryName = 'Beverages'
> Spark has done a proper predicate pushdown to JDBC, however parenthesis 
> around two OR conditions was removed. Instead the following query was sent 
> over JDBC to SQL Server:
> SELECT "CategoryID" FROM [northwind].[dbo].[Categories] WHERE (CategoryID = 
> 1) OR (CategoryID = 2) AND CategoryName = 'Beverages'
> As a result, the last two conditions (around the AND operator) were 
> considered as the highest precedence: (CategoryID = 2) AND CategoryName = 
> 'Beverages'
> Finally SQL Server has executed a query like this:
> SELECT "CategoryID" FROM [northwind].[dbo].[Categories] WHERE CategoryID = 1 
> OR (CategoryID = 2 AND CategoryName = 'Beverages')



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15916) JDBC AND/OR operator push down does not respect lower OR operator precedence

2016-06-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327144#comment-15327144
 ] 

Apache Spark commented on SPARK-15916:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/13640

> JDBC AND/OR operator push down does not respect lower OR operator precedence
> 
>
> Key: SPARK-15916
> URL: https://issues.apache.org/jira/browse/SPARK-15916
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Piotr Czarnas
>
> A table from sql server Northwind database was registered as a JDBC dataframe.
> A query was executed on Spark SQL, the "northwind_dbo_Categories" table is a 
> temporary table which is a JDBC dataframe to "[northwind].[dbo].[Categories]" 
> sql server table:
> SQL executed on Spark sql context:
> SELECT CategoryID FROM northwind_dbo_Categories
> WHERE (CategoryID = 1 OR CategoryID = 2) AND CategoryName = 'Beverages'
> Spark has done a proper predicate pushdown to JDBC, however parenthesis 
> around two OR conditions was removed. Instead the following query was sent 
> over JDBC to SQL Server:
> SELECT "CategoryID" FROM [northwind].[dbo].[Categories] WHERE (CategoryID = 
> 1) OR (CategoryID = 2) AND CategoryName = 'Beverages'
> As a result, the last two conditions (around the AND operator) were 
> considered as the highest precedence: (CategoryID = 2) AND CategoryName = 
> 'Beverages'
> Finally SQL Server has executed a query like this:
> SELECT "CategoryID" FROM [northwind].[dbo].[Categories] WHERE CategoryID = 1 
> OR (CategoryID = 2 AND CategoryName = 'Beverages')



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15916) JDBC AND/OR operator push down does not respect lower OR operator precedence

2016-06-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15916:


Assignee: (was: Apache Spark)

> JDBC AND/OR operator push down does not respect lower OR operator precedence
> 
>
> Key: SPARK-15916
> URL: https://issues.apache.org/jira/browse/SPARK-15916
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Piotr Czarnas
>
> A table from sql server Northwind database was registered as a JDBC dataframe.
> A query was executed on Spark SQL, the "northwind_dbo_Categories" table is a 
> temporary table which is a JDBC dataframe to "[northwind].[dbo].[Categories]" 
> sql server table:
> SQL executed on Spark sql context:
> SELECT CategoryID FROM northwind_dbo_Categories
> WHERE (CategoryID = 1 OR CategoryID = 2) AND CategoryName = 'Beverages'
> Spark has done a proper predicate pushdown to JDBC, however parenthesis 
> around two OR conditions was removed. Instead the following query was sent 
> over JDBC to SQL Server:
> SELECT "CategoryID" FROM [northwind].[dbo].[Categories] WHERE (CategoryID = 
> 1) OR (CategoryID = 2) AND CategoryName = 'Beverages'
> As a result, the last two conditions (around the AND operator) were 
> considered as the highest precedence: (CategoryID = 2) AND CategoryName = 
> 'Beverages'
> Finally SQL Server has executed a query like this:
> SELECT "CategoryID" FROM [northwind].[dbo].[Categories] WHERE CategoryID = 1 
> OR (CategoryID = 2 AND CategoryName = 'Beverages')



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15916) JDBC AND/OR operator push down does not respect lower OR operator precedence

2016-06-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15916:


Assignee: Apache Spark

> JDBC AND/OR operator push down does not respect lower OR operator precedence
> 
>
> Key: SPARK-15916
> URL: https://issues.apache.org/jira/browse/SPARK-15916
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Piotr Czarnas
>Assignee: Apache Spark
>
> A table from sql server Northwind database was registered as a JDBC dataframe.
> A query was executed on Spark SQL, the "northwind_dbo_Categories" table is a 
> temporary table which is a JDBC dataframe to "[northwind].[dbo].[Categories]" 
> sql server table:
> SQL executed on Spark sql context:
> SELECT CategoryID FROM northwind_dbo_Categories
> WHERE (CategoryID = 1 OR CategoryID = 2) AND CategoryName = 'Beverages'
> Spark has done a proper predicate pushdown to JDBC, however parenthesis 
> around two OR conditions was removed. Instead the following query was sent 
> over JDBC to SQL Server:
> SELECT "CategoryID" FROM [northwind].[dbo].[Categories] WHERE (CategoryID = 
> 1) OR (CategoryID = 2) AND CategoryName = 'Beverages'
> As a result, the last two conditions (around the AND operator) were 
> considered as the highest precedence: (CategoryID = 2) AND CategoryName = 
> 'Beverages'
> Finally SQL Server has executed a query like this:
> SELECT "CategoryID" FROM [northwind].[dbo].[Categories] WHERE CategoryID = 1 
> OR (CategoryID = 2 AND CategoryName = 'Beverages')



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-15345) SparkSession's conf doesn't take effect when there's already an existing SparkContext

2016-06-13 Thread Piotr Milanowski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Piotr Milanowski reopened SPARK-15345:
--

Does not work as expected when using spark-submit; for example, this works fine 
and prints all databases in Hive storage
{code}
# file test_db.py
from pyspark.sql import SparkSession
from pyspark import SparkConf

if __name__ == "__main__":
conf = SparkConf()
hive_context = (SparkSession.builder.config(conf=conf)  
   .enableHiveSupport().getOrCreate())
print(hive_context.sql("show databases").collect())
{code}

However, using HiveContext yields only 'default' database:
{code}
#file test.py
from pyspark.sql import HiveContext
from pyspark improt SparkContext, SparkConf

if __name__ == "__main__":
conf = SparkConrf()
sc = SparkContext(conf=conf)
hive_context = HiveContext(sc)
print(hive_context.sql("show databases").collect())

# The result is
#[Row(result='default')]
{code}

Is there something I am still missing? I am using the newest branch-2.0

> SparkSession's conf doesn't take effect when there's already an existing 
> SparkContext
> -
>
> Key: SPARK-15345
> URL: https://issues.apache.org/jira/browse/SPARK-15345
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Reporter: Piotr Milanowski
>Assignee: Reynold Xin
>Priority: Blocker
> Fix For: 2.0.0
>
>
> I am working with branch-2.0, spark is compiled with hive support (-Phive and 
> -Phvie-thriftserver).
> I am trying to access databases using this snippet:
> {code}
> from pyspark.sql import HiveContext
> hc = HiveContext(sc)
> hc.sql("show databases").collect()
> [Row(result='default')]
> {code}
> This means that spark doesn't find any databases specified in configuration.
> Using the same configuration (i.e. hive-site.xml and core-site.xml) in spark 
> 1.6, and launching above snippet, I can print out existing databases.
> When run in DEBUG mode this is what spark (2.0) prints out:
> {code}
> 16/05/16 12:17:47 INFO SparkSqlParser: Parsing command: show databases
> 16/05/16 12:17:47 DEBUG SimpleAnalyzer: 
> === Result of Batch Resolution ===
> !'Project [unresolveddeserializer(createexternalrow(if (isnull(input[0, 
> string])) null else input[0, string].toString, 
> StructField(result,StringType,false)), result#2) AS #3]   Project 
> [createexternalrow(if (isnull(result#2)) null else result#2.toString, 
> StructField(result,StringType,false)) AS #3]
>  +- LocalRelation [result#2]  
>   
>  +- LocalRelation [result#2]
> 
> 16/05/16 12:17:47 DEBUG ClosureCleaner: +++ Cleaning closure  
> (org.apache.spark.sql.Dataset$$anonfun$53) +++
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + declared fields: 2
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public static final long 
> org.apache.spark.sql.Dataset$$anonfun$53.serialVersionUID
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  private final 
> org.apache.spark.sql.types.StructType 
> org.apache.spark.sql.Dataset$$anonfun$53.structType$1
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + declared methods: 2
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public final java.lang.Object 
> org.apache.spark.sql.Dataset$$anonfun$53.apply(java.lang.Object)
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public final java.lang.Object 
> org.apache.spark.sql.Dataset$$anonfun$53.apply(org.apache.spark.sql.catalyst.InternalRow)
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + inner classes: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + outer classes: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + outer objects: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + populating accessed fields because 
> this is the starting closure
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + fields accessed by starting 
> closure: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + there are no enclosing objects!
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  +++ closure  
> (org.apache.spark.sql.Dataset$$anonfun$53) is now cleaned +++
> 16/05/16 12:17:47 DEBUG ClosureCleaner: +++ Cleaning closure  
> (org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1)
>  +++
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + declared fields: 1
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public static final long 
> org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1.serialVersionUID
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + declared methods: 2
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public final java.lang.Object 
> org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToP

[jira] [Created] (SPARK-15918) unionAll returns wrong result when two dataframes has schema in different order

2016-06-13 Thread Prabhu Joseph (JIRA)
Prabhu Joseph created SPARK-15918:
-

 Summary: unionAll returns wrong result when two dataframes has 
schema in different order
 Key: SPARK-15918
 URL: https://issues.apache.org/jira/browse/SPARK-15918
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.1
 Environment: CentOS
Reporter: Prabhu Joseph
 Fix For: 1.6.1


On applying unionAll operation between A and B dataframes, they both has same 
schema but in different order and hence the result has column value mapping 
changed.

Repro:

{code}

A.show()
+---++---+--+--+-++---+--+---+---+-+
|tag|year_day|tm_hour|tm_min|tm_sec|dtype|time|tm_mday|tm_mon|tm_yday|tm_year|value|
+---++---+--+--+-++---+--+---+---+-+
+---++---+--+--+-++---+--+---+---+-+

B.show()
+-+---+--+---+---+--+--+--+---+---+--++
|dtype|tag|  
time|tm_hour|tm_mday|tm_min|tm_mon|tm_sec|tm_yday|tm_year| value|year_day|
+-+---+--+---+---+--+--+--+---+---+--++
|F|C_FNHXUT701Z.CNSTLO|1443790800| 13|  2| 0|10| 0|
275|   2015|1.2345| 2015275|
|F|C_FNHXUDP713.CNSTHI|1443790800| 13|  2| 0|10| 0|
275|   2015|1.2345| 2015275|
|F| C_FNHXUT718.CNSTHI|1443790800| 13|  2| 0|10| 0|
275|   2015|1.2345| 2015275|
|F|C_FNHXUT703Z.CNSTLO|1443790800| 13|  2| 0|10| 0|
275|   2015|1.2345| 2015275|
|F|C_FNHXUR716A.CNSTLO|1443790800| 13|  2| 0|10| 0|
275|   2015|1.2345| 2015275|
|F|C_FNHXUT803Z.CNSTHI|1443790800| 13|  2| 0|10| 0|
275|   2015|1.2345| 2015275|
|F| C_FNHXUT728.CNSTHI|1443790800| 13|  2| 0|10| 0|
275|   2015|1.2345| 2015275|
|F| C_FNHXUR806.CNSTHI|1443790800| 13|  2| 0|10| 0|
275|   2015|1.2345| 2015275|
+-+---+--+---+---+--+--+--+---+---+--++

A = A.unionAll(B)
A.show()
+---+---+--+--+--+-++---+--+---+---+-+
|tag|   year_day|   
tm_hour|tm_min|tm_sec|dtype|time|tm_mday|tm_mon|tm_yday|tm_year|value|
+---+---+--+--+--+-++---+--+---+---+-+
|  F|C_FNHXUT701Z.CNSTLO|1443790800|13| 2|0|  10|  0|   275|   
2015| 1.2345|2015275.0|
|  F|C_FNHXUDP713.CNSTHI|1443790800|13| 2|0|  10|  0|   275|   
2015| 1.2345|2015275.0|
|  F| C_FNHXUT718.CNSTHI|1443790800|13| 2|0|  10|  0|   275|   
2015| 1.2345|2015275.0|
|  F|C_FNHXUT703Z.CNSTLO|1443790800|13| 2|0|  10|  0|   275|   
2015| 1.2345|2015275.0|
|  F|C_FNHXUR716A.CNSTLO|1443790800|13| 2|0|  10|  0|   275|   
2015| 1.2345|2015275.0|
|  F|C_FNHXUT803Z.CNSTHI|1443790800|13| 2|0|  10|  0|   275|   
2015| 1.2345|2015275.0|
|  F| C_FNHXUT728.CNSTHI|1443790800|13| 2|0|  10|  0|   275|   
2015| 1.2345|2015275.0|
|  F| C_FNHXUR806.CNSTHI|1443790800|13| 2|0|  10|  0|   275|   
2015| 1.2345|2015275.0|
+---+---+--+--+--+-++---+--+---+---+-+
{code}

On changing the schema of A according to B and doing unionAll works fine

{code}

C = 
A.select("dtype","tag","time","tm_hour","tm_mday","tm_min",”tm_mon”,"tm_sec","tm_yday","tm_year","value","year_day")

A = C.unionAll(B)
A.show()

+-+---+--+---+---+--+--+--+---+---+--++
|dtype|tag|  
time|tm_hour|tm_mday|tm_min|tm_mon|tm_sec|tm_yday|tm_year| value|year_day|
+-+---+--+---+---+--+--+--+---+---+--++
|F|C_FNHXUT701Z.CNSTLO|1443790800| 13|  2| 0|10| 0|
275|   2015|1.2345| 2015275|
|F|C_FNHXUDP713.CNSTHI|1443790800| 13|  2| 0|10| 0|
275|   2015|1.2345| 2015275|
|F| C_FNHXUT718.CNSTHI|1443790800| 13|  2| 0|10| 0|
275|   2015|1.2345| 2015275|
|F|C_FNHXUT703Z.CNSTLO|1443790800| 13|  2| 0|10| 0|
275|   2015|1.2345| 2015275|
|F|C_FNHXUR716A.CNSTLO|1443790800| 13|  2| 0|10| 0|
275|   2015|1.2345| 2015275|
|F|C_FNHXUT803Z.CNSTHI|1443790800| 13|  2| 0|10| 0|
275|   2015|1.2345| 2015275|
|F| C_FNHXUT728.CNSTHI|1443790800| 13|  2| 0|10| 0|
275|   2015|1.2345| 2015275|
|F| C_FNHXUR806.CNSTHI|1443790800| 13|  2| 0|

[jira] [Created] (SPARK-15919) DStream "saveAsTextFile" doesn't update the prefix after each checkpoint

2016-06-13 Thread Aamir Abbas (JIRA)
Aamir Abbas created SPARK-15919:
---

 Summary: DStream "saveAsTextFile" doesn't update the prefix after 
each checkpoint
 Key: SPARK-15919
 URL: https://issues.apache.org/jira/browse/SPARK-15919
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 1.6.1
 Environment: Amazon EMR
Reporter: Aamir Abbas


I have a Spark streaming job that reads a data stream, and saves it as a text 
file after a predefined time interval. In the function 

stream.dstream().repartition(1).saveAsTextFiles(getOutputPath(), "");

The function getOutputPath() generates a new path every time the function is 
called, depending on the current system time.
However, the output path prefix remains the same for all the batches, which 
effectively means that function is not called again for the next batch of the 
stream, although the files are being saved after each checkpoint interval. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8546) PMML export for Naive Bayes

2016-06-13 Thread Radoslaw Gasiorek (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327167#comment-15327167
 ] 

Radoslaw Gasiorek commented on SPARK-8546:
--

hi there, [~josephkb]
We would like to use Mllib built models to classify outside spark therefore 
without Spark context available. We would like to export the models built in 
spark into PMML format, that then would be read by a stand alone java 
application without spark context (but with Mllib jar). 
The java application would load the model from the PMML file and would use the 
model to 'predict'  or rather 'classify' the new data we get. 
This feature would enable us to proceed without big architectural and 
operational changes, without this feature we might need get the the 
sparkContext available to the standalone application that would be bigger 
operational and architectural overhead.

We might need to use the plain java serialization for the proof of concept 
anyways, but surely not for produtionized product.

Can we prioritize this feature as well as 
https://issues.apache.org/jira/browse/SPARK-8542 and 
https://issues.apache.org/jira/browse/SPARK-8543 ?
What would be LOE and EAT for these?
thanks guys in advance for responses, and feedback.

> PMML export for Naive Bayes
> ---
>
> Key: SPARK-8546
> URL: https://issues.apache.org/jira/browse/SPARK-8546
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Assignee: Xusen Yin
>Priority: Minor
>
> The naive Bayes section of PMML standard can be found at 
> http://www.dmg.org/v4-1/NaiveBayes.html. We should first figure out how to 
> generate PMML for both binomial and multinomial naive Bayes models using 
> JPMML (maybe [~vfed] can help).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15920) Using map on DataFrame

2016-06-13 Thread Piotr Milanowski (JIRA)
Piotr Milanowski created SPARK-15920:


 Summary: Using map on DataFrame
 Key: SPARK-15920
 URL: https://issues.apache.org/jira/browse/SPARK-15920
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.0.0
 Environment: branch-2.0
Reporter: Piotr Milanowski


In Spark 1.6 there was a method {{DataFrame.map}} as an alias to 
{{DataFrame.rdd.map}}. In spark 2.0 this functionality no longer exists.

Is there a preferred way of doing map on a DataFrame without explicitly calling 
{{DataFrame.rdd.map}}? Maybe this functionality should be kept, just for 
backward compatibility purpose?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-15293) 'collect_list' function undefined

2016-06-13 Thread Piotr Milanowski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Piotr Milanowski closed SPARK-15293.


Works fine, thanks.

> 'collect_list' function undefined
> -
>
> Key: SPARK-15293
> URL: https://issues.apache.org/jira/browse/SPARK-15293
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.0
>Reporter: Piotr Milanowski
>Assignee: Herman van Hovell
> Fix For: 2.0.0
>
>
> When using pyspark.sql.functions.collect_list function in sql queries, an 
> error occurs - Undefined function collect_list
> Example:
> {code}
> >>> from pyspark.sql import Row
> >>> #The same with SQLContext
> >>> from pyspark.sql import HiveContext
> >>> from pyspark.sql.functions import collect_list
> >>> sql = HiveContext(sc)
> >>> rows = [Row(age=20, job='Programmer', name='Alice'), Row(age=21, 
> >>> job='Programmer', name='Bob'), Row(age=30, job='Hacker', name='Fred'), 
> >>> Row(age=29, job='PM', name='Tom'), Row(age=50, job='CEO', name='Daisy')]
> >>> df = sql.createDataFrame(rows)
> >>> df.groupby(df.job).agg(df.job, collect_list(df.age))
> Traceback (most recent call last):
>   File "/mnt/mfs/spark-2.0/python/pyspark/sql/utils.py", line 57, in deco
> return f(*a, **kw)
>   File "/mnt/mfs/spark-2.0/python/lib/py4j-0.9.2-src.zip/py4j/protocol.py", 
> line 310, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o193.agg.
> : org.apache.spark.sql.AnalysisException: Undefined function: 'collect_list'. 
> This function is neither a registered temporary function nor a permanent 
> function registered in the database 'default'.;
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.failFunctionLookup(SessionCatalog.scala:719)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.lookupFunction(SessionCatalog.scala:781)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$13$$anonfun$applyOrElse$6$$anonfun$applyOrElse$38.apply(Analyzer.scala:907)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$13$$anonfun$applyOrElse$6$$anonfun$applyOrElse$38.apply(Analyzer.scala:907)
>   at 
> org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$13$$anonfun$applyOrElse$6.applyOrElse(Analyzer.scala:906)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$13$$anonfun$applyOrElse$6.applyOrElse(Analyzer.scala:894)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:265)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:265)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:68)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:264)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:270)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:270)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:307)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1336)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:356)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:270)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionDown$1(QueryPlan.scala:156)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:166)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$

[jira] [Created] (SPARK-15921) Spark unable to read partitioned table in avro format and column name in upper case

2016-06-13 Thread Rajkumar Singh (JIRA)
Rajkumar Singh created SPARK-15921:
--

 Summary: Spark unable to read partitioned table in avro format and 
column name in upper case
 Key: SPARK-15921
 URL: https://issues.apache.org/jira/browse/SPARK-15921
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL
Affects Versions: 1.6.0
 Environment: Centos 6.6
Spark 1.6
Reporter: Rajkumar Singh


Reproduce:
{code}
[root@sandbox ~]# cat file1.csv 
rks,2016
[root@sandbox ~]# cat file2.csv 
raj,2015

hive> CREATE TABLE `sample_table`(
>   `name` string)
> PARTITIONED BY ( 
>   `year` int)
> ROW FORMAT DELIMITED 
>   FIELDS TERMINATED BY ',' 
> STORED AS INPUTFORMAT 
>   'org.apache.hadoop.mapred.TextInputFormat' 
> OUTPUTFORMAT 
>   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
> LOCATION
>   'hdfs://sandbox.hortonworks.com:8020/apps/hive/warehouse/sample_table'
> TBLPROPERTIES (
>   'transient_lastDdlTime'='1465816403')
> ;
load data local inpath '/root/file2.csv' overwrite into table sample_table 
partition(year='2015');
load data local inpath '/root/file1.csv' overwrite into table sample_table 
partition(year='2016');

hive> CREATE TABLE sample_table_uppercase
> PARTITIONeD BY ( YEAR INT)
> ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
> STORED AS INPUTFORMAT 
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
> OUTPUTFORMAT 
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
> TBLPROPERTIES (
>'avro.schema.literal'='{
>   "namespace": "com.rishav.avro",
>"name": "student_marks",
>"type": "record",
>   "fields": [ { "name":"NANME","type":"string"}]
> }');

INSERT OVERWRITE TABLE  sample_table_uppercase partition(Year) select name,year 
from sample_table;

hive> select * from sample_table_uppercase;
OK
raj 2015
rks 2016

now using spark-shell
scala>val tbl = sqlContext.table("default.sample_table_uppercase");
scala>tbl.show
+++
|name|year|
+++
|null|2015|
|null|2016|
+++
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15921) Spark unable to read partitioned table in avro format and column name in upper case

2016-06-13 Thread Rajkumar Singh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajkumar Singh updated SPARK-15921:
---
Description: 
Spark return null value if the field name is uppercase in hive avro partitioned 
table.
Reproduce:
{code}
[root@sandbox ~]# cat file1.csv 
rks,2016
[root@sandbox ~]# cat file2.csv 
raj,2015

hive> CREATE TABLE `sample_table`(
>   `name` string)
> PARTITIONED BY ( 
>   `year` int)
> ROW FORMAT DELIMITED 
>   FIELDS TERMINATED BY ',' 
> STORED AS INPUTFORMAT 
>   'org.apache.hadoop.mapred.TextInputFormat' 
> OUTPUTFORMAT 
>   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
> LOCATION
>   'hdfs://sandbox.hortonworks.com:8020/apps/hive/warehouse/sample_table'
> TBLPROPERTIES (
>   'transient_lastDdlTime'='1465816403')
> ;
load data local inpath '/root/file2.csv' overwrite into table sample_table 
partition(year='2015');
load data local inpath '/root/file1.csv' overwrite into table sample_table 
partition(year='2016');

hive> CREATE TABLE sample_table_uppercase
> PARTITIONeD BY ( YEAR INT)
> ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
> STORED AS INPUTFORMAT 
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
> OUTPUTFORMAT 
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
> TBLPROPERTIES (
>'avro.schema.literal'='{
>   "namespace": "com.rishav.avro",
>"name": "student_marks",
>"type": "record",
>   "fields": [ { "name":"NANME","type":"string"}]
> }');

INSERT OVERWRITE TABLE  sample_table_uppercase partition(Year) select name,year 
from sample_table;

hive> select * from sample_table_uppercase;
OK
raj 2015
rks 2016

now using spark-shell
scala>val tbl = sqlContext.table("default.sample_table_uppercase");
scala>tbl.show
+++
|name|year|
+++
|null|2015|
|null|2016|
+++
{code}

  was:
Reproduce:
{code}
[root@sandbox ~]# cat file1.csv 
rks,2016
[root@sandbox ~]# cat file2.csv 
raj,2015

hive> CREATE TABLE `sample_table`(
>   `name` string)
> PARTITIONED BY ( 
>   `year` int)
> ROW FORMAT DELIMITED 
>   FIELDS TERMINATED BY ',' 
> STORED AS INPUTFORMAT 
>   'org.apache.hadoop.mapred.TextInputFormat' 
> OUTPUTFORMAT 
>   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
> LOCATION
>   'hdfs://sandbox.hortonworks.com:8020/apps/hive/warehouse/sample_table'
> TBLPROPERTIES (
>   'transient_lastDdlTime'='1465816403')
> ;
load data local inpath '/root/file2.csv' overwrite into table sample_table 
partition(year='2015');
load data local inpath '/root/file1.csv' overwrite into table sample_table 
partition(year='2016');

hive> CREATE TABLE sample_table_uppercase
> PARTITIONeD BY ( YEAR INT)
> ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
> STORED AS INPUTFORMAT 
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
> OUTPUTFORMAT 
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
> TBLPROPERTIES (
>'avro.schema.literal'='{
>   "namespace": "com.rishav.avro",
>"name": "student_marks",
>"type": "record",
>   "fields": [ { "name":"NANME","type":"string"}]
> }');

INSERT OVERWRITE TABLE  sample_table_uppercase partition(Year) select name,year 
from sample_table;

hive> select * from sample_table_uppercase;
OK
raj 2015
rks 2016

now using spark-shell
scala>val tbl = sqlContext.table("default.sample_table_uppercase");
scala>tbl.show
+++
|name|year|
+++
|null|2015|
|null|2016|
+++
{code}


> Spark unable to read partitioned table in avro format and column name in 
> upper case
> ---
>
> Key: SPARK-15921
> URL: https://issues.apache.org/jira/browse/SPARK-15921
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.6.0
> Environment: Centos 6.6
> Spark 1.6
>Reporter: Rajkumar Singh
>
> Spark return null value if the field name is uppercase in hive avro 
> partitioned table.
> Reproduce:
> {code}
> [root@sandbox ~]# cat file1.csv 
> rks,2016
> [root@sandbox ~]# cat file2.csv 
> raj,2015
> hive> CREATE TABLE `sample_table`(
> >   `name` string)
> > PARTITIONED BY ( 
> >   `year` int)
> > ROW FORMAT DELIMITED 
> >   FIELDS TERMINATED BY ',' 
> > STORED AS INPUTFORMAT 
> >   'org.apache.hadoop.mapred.TextInputFormat' 
> > OUTPUTFORMAT 
> >   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
> > LOCATION
> >   'hdfs://sandbox.hortonworks.com:8020/apps/hive/warehouse

[jira] [Commented] (SPARK-15790) Audit @Since annotations in ML

2016-06-13 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327193#comment-15327193
 ] 

Nick Pentreath commented on SPARK-15790:


Yes, I've just looked at things in the concrete classes - params & methods 
defined in the traits etc are not annotated.

> Audit @Since annotations in ML
> --
>
> Key: SPARK-15790
> URL: https://issues.apache.org/jira/browse/SPARK-15790
> Project: Spark
>  Issue Type: Documentation
>  Components: ML, PySpark
>Reporter: Nick Pentreath
>Assignee: Nick Pentreath
>
> Many classes & methods in ML are missing {{@Since}} annotations. Audit what's 
> missing and add annotations to public API constructors, vals and methods.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10258) Add @Since annotation to ml.feature

2016-06-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327197#comment-15327197
 ] 

Apache Spark commented on SPARK-10258:
--

User 'MLnick' has created a pull request for this issue:
https://github.com/apache/spark/pull/13641

> Add @Since annotation to ml.feature
> ---
>
> Key: SPARK-10258
> URL: https://issues.apache.org/jira/browse/SPARK-10258
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML
>Reporter: Xiangrui Meng
>Assignee: Martin Brown
>Priority: Minor
>  Labels: starter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6628) ClassCastException occurs when executing sql statement "insert into" on hbase table

2016-06-13 Thread Teng Qiu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327201#comment-15327201
 ] 

Teng Qiu commented on SPARK-6628:
-

this is caused by missing interface implementation in 
HiveHBaseTableOutputFormat (or HiveAccumuloTableOutputFormat), i created this 
issue in hive project: https://issues.apache.org/jira/browse/HIVE-13170 and 
made this PR for hive-accumulo connector (AccumuloStorageHandler): 
https://github.com/apache/hive/pull/66/files

you can do some similar changes for hive-hbase as well.

> ClassCastException occurs when executing sql statement "insert into" on hbase 
> table
> ---
>
> Key: SPARK-6628
> URL: https://issues.apache.org/jira/browse/SPARK-6628
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: meiyoula
>
> Error: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 1 in stage 3.0 failed 4 times, most recent failure: Lost task 1.3 in 
> stage 3.0 (TID 12, vm-17): java.lang.ClassCastException: 
> org.apache.hadoop.hive.hbase.HiveHBaseTableOutputFormat cannot be cast to 
> org.apache.hadoop.hive.ql.io.HiveOutputFormat
> at 
> org.apache.spark.sql.hive.SparkHiveWriterContainer.outputFormat$lzycompute(hiveWriterContainers.scala:72)
> at 
> org.apache.spark.sql.hive.SparkHiveWriterContainer.outputFormat(hiveWriterContainers.scala:71)
> at 
> org.apache.spark.sql.hive.SparkHiveWriterContainer.getOutputName(hiveWriterContainers.scala:91)
> at 
> org.apache.spark.sql.hive.SparkHiveWriterContainer.initWriters(hiveWriterContainers.scala:115)
> at 
> org.apache.spark.sql.hive.SparkHiveWriterContainer.executorSideSetup(hiveWriterContainers.scala:84)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.scala:112)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:93)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:93)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
> at org.apache.spark.scheduler.Task.run(Task.scala:56)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:197)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15920) Using map on DataFrame

2016-06-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-15920.
---
  Resolution: Not A Problem
Target Version/s:   (was: 2.0.0)

Don't set Target please, and this question should go to user@

> Using map on DataFrame
> --
>
> Key: SPARK-15920
> URL: https://issues.apache.org/jira/browse/SPARK-15920
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
> Environment: branch-2.0
>Reporter: Piotr Milanowski
>
> In Spark 1.6 there was a method {{DataFrame.map}} as an alias to 
> {{DataFrame.rdd.map}}. In spark 2.0 this functionality no longer exists.
> Is there a preferred way of doing map on a DataFrame without explicitly 
> calling {{DataFrame.rdd.map}}? Maybe this functionality should be kept, just 
> for backward compatibility purpose?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8546) PMML export for Naive Bayes

2016-06-13 Thread Villu Ruusmann (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327205#comment-15327205
 ] 

Villu Ruusmann commented on SPARK-8546:
---

Hi [~rgasiorek] - would it be an option to re-build your models in Spark ML 
instead of MLlib? I have been working on Spark ML pipelines-to-PMML converter 
called JPMML-SparkML (https://github.com/jpmml/jpmml-sparkml), which could 
fully address your use case then. JPMML-SparkML supports all tree-based models 
and the majority of non-NLP domain transformations. It would be possible to add 
support for the `classification.NaiveBayesModel` model type in a day or two if 
needed.

> PMML export for Naive Bayes
> ---
>
> Key: SPARK-8546
> URL: https://issues.apache.org/jira/browse/SPARK-8546
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Assignee: Xusen Yin
>Priority: Minor
>
> The naive Bayes section of PMML standard can be found at 
> http://www.dmg.org/v4-1/NaiveBayes.html. We should first figure out how to 
> generate PMML for both binomial and multinomial naive Bayes models using 
> JPMML (maybe [~vfed] can help).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15919) DStream "saveAsTextFile" doesn't update the prefix after each checkpoint

2016-06-13 Thread binde (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327209#comment-15327209
 ] 

binde commented on SPARK-15919:
---

this is not a bug, getOutputPath() will be invoked on the job start run.

> DStream "saveAsTextFile" doesn't update the prefix after each checkpoint
> 
>
> Key: SPARK-15919
> URL: https://issues.apache.org/jira/browse/SPARK-15919
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 1.6.1
> Environment: Amazon EMR
>Reporter: Aamir Abbas
>
> I have a Spark streaming job that reads a data stream, and saves it as a text 
> file after a predefined time interval. In the function 
> stream.dstream().repartition(1).saveAsTextFiles(getOutputPath(), "");
> The function getOutputPath() generates a new path every time the function is 
> called, depending on the current system time.
> However, the output path prefix remains the same for all the batches, which 
> effectively means that function is not called again for the next batch of the 
> stream, although the files are being saved after each checkpoint interval. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15919) DStream "saveAsTextFile" doesn't update the prefix after each checkpoint

2016-06-13 Thread Aamir Abbas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327212#comment-15327212
 ] 

Aamir Abbas commented on SPARK-15919:
-

I need to save the output of each batch in a different place. This is available 
for a regular Spark job, should be available for streaming data as well. Should 
I add this as a feature requirement?

> DStream "saveAsTextFile" doesn't update the prefix after each checkpoint
> 
>
> Key: SPARK-15919
> URL: https://issues.apache.org/jira/browse/SPARK-15919
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 1.6.1
> Environment: Amazon EMR
>Reporter: Aamir Abbas
>
> I have a Spark streaming job that reads a data stream, and saves it as a text 
> file after a predefined time interval. In the function 
> stream.dstream().repartition(1).saveAsTextFiles(getOutputPath(), "");
> The function getOutputPath() generates a new path every time the function is 
> called, depending on the current system time.
> However, the output path prefix remains the same for all the batches, which 
> effectively means that function is not called again for the next batch of the 
> stream, although the files are being saved after each checkpoint interval. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15904) High Memory Pressure using MLlib K-means

2016-06-13 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327220#comment-15327220
 ] 

Nick Pentreath commented on SPARK-15904:


Could you explain why you're using K>3000 when your dataset has dimension ~2000?

> High Memory Pressure using MLlib K-means
> 
>
> Key: SPARK-15904
> URL: https://issues.apache.org/jira/browse/SPARK-15904
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.1
> Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB 
> of RAM.
>Reporter: Alessio
>Priority: Minor
>
> Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on 
> Memory and Disk.
> Everything's fine, although at the end of K-Means, after the number of 
> iterations, the cost function value and the running time there's a nice 
> "Removing RDD  from persistent list" stage. However, during this stage 
> there's a high memory pressure. Weird, since RDDs are about to be removed. 
> Full log of this stage:
> 16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations
> 16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds.
> 16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations.
> 16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is 
> 49784.87126751288.
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780
> I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. 
> My machine has an i5 hyperthreaded dual-core, thus [*] means 4.
> I'm launching this application though spark-submit with --driver-memory 10G



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6628) ClassCastException occurs when executing sql statement "insert into" on hbase table

2016-06-13 Thread Murshid Chalaev (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327224#comment-15327224
 ] 

Murshid Chalaev commented on SPARK-6628:


Thank you

> ClassCastException occurs when executing sql statement "insert into" on hbase 
> table
> ---
>
> Key: SPARK-6628
> URL: https://issues.apache.org/jira/browse/SPARK-6628
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: meiyoula
>
> Error: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 1 in stage 3.0 failed 4 times, most recent failure: Lost task 1.3 in 
> stage 3.0 (TID 12, vm-17): java.lang.ClassCastException: 
> org.apache.hadoop.hive.hbase.HiveHBaseTableOutputFormat cannot be cast to 
> org.apache.hadoop.hive.ql.io.HiveOutputFormat
> at 
> org.apache.spark.sql.hive.SparkHiveWriterContainer.outputFormat$lzycompute(hiveWriterContainers.scala:72)
> at 
> org.apache.spark.sql.hive.SparkHiveWriterContainer.outputFormat(hiveWriterContainers.scala:71)
> at 
> org.apache.spark.sql.hive.SparkHiveWriterContainer.getOutputName(hiveWriterContainers.scala:91)
> at 
> org.apache.spark.sql.hive.SparkHiveWriterContainer.initWriters(hiveWriterContainers.scala:115)
> at 
> org.apache.spark.sql.hive.SparkHiveWriterContainer.executorSideSetup(hiveWriterContainers.scala:84)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.scala:112)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:93)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:93)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
> at org.apache.spark.scheduler.Task.run(Task.scala:56)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:197)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15919) DStream "saveAsTextFile" doesn't update the prefix after each checkpoint

2016-06-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-15919.
---
Resolution: Not A Problem

No, this is simple to accomplish in Spark already. You need to use foreachRDD 
to get an RDD and timestamp, and use that in your call to saveAsTextFiles

> DStream "saveAsTextFile" doesn't update the prefix after each checkpoint
> 
>
> Key: SPARK-15919
> URL: https://issues.apache.org/jira/browse/SPARK-15919
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 1.6.1
> Environment: Amazon EMR
>Reporter: Aamir Abbas
>
> I have a Spark streaming job that reads a data stream, and saves it as a text 
> file after a predefined time interval. In the function 
> stream.dstream().repartition(1).saveAsTextFiles(getOutputPath(), "");
> The function getOutputPath() generates a new path every time the function is 
> called, depending on the current system time.
> However, the output path prefix remains the same for all the batches, which 
> effectively means that function is not called again for the next batch of the 
> stream, although the files are being saved after each checkpoint interval. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15919) DStream "saveAsTextFile" doesn't update the prefix after each checkpoint

2016-06-13 Thread Aamir Abbas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327229#comment-15327229
 ] 

Aamir Abbas commented on SPARK-15919:
-

ForeachRDD is fine in case you want to save individual RDDs separately. I need 
to do this for entire batch of stream. Could you please share the relevant link 
to the documentation that can help me save the entire batch of the stream like 
this?

> DStream "saveAsTextFile" doesn't update the prefix after each checkpoint
> 
>
> Key: SPARK-15919
> URL: https://issues.apache.org/jira/browse/SPARK-15919
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 1.6.1
> Environment: Amazon EMR
>Reporter: Aamir Abbas
>
> I have a Spark streaming job that reads a data stream, and saves it as a text 
> file after a predefined time interval. In the function 
> stream.dstream().repartition(1).saveAsTextFiles(getOutputPath(), "");
> The function getOutputPath() generates a new path every time the function is 
> called, depending on the current system time.
> However, the output path prefix remains the same for all the batches, which 
> effectively means that function is not called again for the next batch of the 
> stream, although the files are being saved after each checkpoint interval. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15904) High Memory Pressure using MLlib K-means

2016-06-13 Thread Alessio (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327234#comment-15327234
 ] 

Alessio commented on SPARK-15904:
-

My dataset has 9000+ patterns, each of which has 2000+ attributes. Thus it's 
perfectly legal to search for  K>3000 and (of course) smaller than or equal to 
the number of patterns (9120)

> High Memory Pressure using MLlib K-means
> 
>
> Key: SPARK-15904
> URL: https://issues.apache.org/jira/browse/SPARK-15904
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.1
> Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB 
> of RAM.
>Reporter: Alessio
>Priority: Minor
>
> Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on 
> Memory and Disk.
> Everything's fine, although at the end of K-Means, after the number of 
> iterations, the cost function value and the running time there's a nice 
> "Removing RDD  from persistent list" stage. However, during this stage 
> there's a high memory pressure. Weird, since RDDs are about to be removed. 
> Full log of this stage:
> 16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations
> 16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds.
> 16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations.
> 16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is 
> 49784.87126751288.
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780
> I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. 
> My machine has an i5 hyperthreaded dual-core, thus [*] means 4.
> I'm launching this application though spark-submit with --driver-memory 10G



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12623) map key_values to values

2016-06-13 Thread Elazar Gershuni (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327236#comment-15327236
 ] 

Elazar Gershuni commented on SPARK-12623:
-

At the very least, it should have a "won't fix" status, rather than "resolved".

How can I suggest this change to Spark 2.0?

> map key_values to values
> 
>
> Key: SPARK-12623
> URL: https://issues.apache.org/jira/browse/SPARK-12623
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Elazar Gershuni
>Priority: Minor
>  Labels: easyfix, features, performance
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> Why doesn't the argument to mapValues() take a key as an agument? 
> Alternatively, can we have a "mapKeyValuesToValues" that does?
> Use case: I want to write a simpler analyzer that takes the argument to 
> map(), and analyze it to see whether it (trivially) doesn't change the key, 
> e.g. 
> g = lambda kv: (kv[0], f(kv[0], kv[1]))
> rdd.map(g)
> Problem is, if I find that it is the case, I can't call mapValues() with that 
> function, as in `rdd.mapValues(lambda kv: g(kv)[1])`, since mapValues 
> receives only `v` as an argument.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15746) SchemaUtils.checkColumnType with VectorUDT prints instance details in error message

2016-06-13 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327237#comment-15327237
 ] 

Nick Pentreath commented on SPARK-15746:


I think you can go ahead now - I also vote for the {{case object VectorUDT}} 
approach.

> SchemaUtils.checkColumnType with VectorUDT prints instance details in error 
> message
> ---
>
> Key: SPARK-15746
> URL: https://issues.apache.org/jira/browse/SPARK-15746
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Nick Pentreath
>Priority: Minor
>
> Currently, many feature transformers in {{ml}} use 
> {{SchemaUtils.checkColumnType(schema, ..., new VectorUDT)}} to check the 
> column type is a ({{ml.linalg}}) vector.
> The resulting error message contains "instance" info for the {{VectorUDT}}, 
> i.e. something like this:
> {code}
> java.lang.IllegalArgumentException: requirement failed: Column features must 
> be of type org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but was actually 
> StringType.
> {code}
> A solution would either be to amend {{SchemaUtils.checkColumnType}} to print 
> the error message using {{getClass.getName}}, or to create a {{private[spark] 
> case object VectorUDT extends VectorUDT}} for convenience, since it is used 
> so often (and incidentally this would make it easier to put {{VectorUDT}} 
> into lists of data types e.g. schema validation, UDAFs etc).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-15919) DStream "saveAsTextFile" doesn't update the prefix after each checkpoint

2016-06-13 Thread Aamir Abbas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aamir Abbas reopened SPARK-15919:
-

This is an issue, as I do not actually need the current timestamp to use in 
output path. I need the new path, which doesn't have the current timestamp, but 
a new output path.

> DStream "saveAsTextFile" doesn't update the prefix after each checkpoint
> 
>
> Key: SPARK-15919
> URL: https://issues.apache.org/jira/browse/SPARK-15919
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 1.6.1
> Environment: Amazon EMR
>Reporter: Aamir Abbas
>
> I have a Spark streaming job that reads a data stream, and saves it as a text 
> file after a predefined time interval. In the function 
> stream.dstream().repartition(1).saveAsTextFiles(getOutputPath(), "");
> The function getOutputPath() generates a new path every time the function is 
> called, depending on the current system time.
> However, the output path prefix remains the same for all the batches, which 
> effectively means that function is not called again for the next batch of the 
> stream, although the files are being saved after each checkpoint interval. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12623) map key_values to values

2016-06-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327248#comment-15327248
 ] 

Sean Owen commented on SPARK-12623:
---

The Status can only be "Resolved". You're referring to the Resolution, which is 
Not A Problem. I think that's accurate for the original issue here, even if in 
practice the exact value doesn't matter a lot. 

If you mean exposing preservesPartitioning on map, yeah I think that's a 
legitimate change to consider and you can make another JIRA.

> map key_values to values
> 
>
> Key: SPARK-12623
> URL: https://issues.apache.org/jira/browse/SPARK-12623
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Elazar Gershuni
>Priority: Minor
>  Labels: easyfix, features, performance
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> Why doesn't the argument to mapValues() take a key as an agument? 
> Alternatively, can we have a "mapKeyValuesToValues" that does?
> Use case: I want to write a simpler analyzer that takes the argument to 
> map(), and analyze it to see whether it (trivially) doesn't change the key, 
> e.g. 
> g = lambda kv: (kv[0], f(kv[0], kv[1]))
> rdd.map(g)
> Problem is, if I find that it is the case, I can't call mapValues() with that 
> function, as in `rdd.mapValues(lambda kv: g(kv)[1])`, since mapValues 
> receives only `v` as an argument.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-15919) DStream "saveAsTextFile" doesn't update the prefix after each checkpoint

2016-06-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen closed SPARK-15919.
-

> DStream "saveAsTextFile" doesn't update the prefix after each checkpoint
> 
>
> Key: SPARK-15919
> URL: https://issues.apache.org/jira/browse/SPARK-15919
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 1.6.1
> Environment: Amazon EMR
>Reporter: Aamir Abbas
>
> I have a Spark streaming job that reads a data stream, and saves it as a text 
> file after a predefined time interval. In the function 
> stream.dstream().repartition(1).saveAsTextFiles(getOutputPath(), "");
> The function getOutputPath() generates a new path every time the function is 
> called, depending on the current system time.
> However, the output path prefix remains the same for all the batches, which 
> effectively means that function is not called again for the next batch of the 
> stream, although the files are being saved after each checkpoint interval. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15919) DStream "saveAsTextFile" doesn't update the prefix after each checkpoint

2016-06-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-15919.
---
Resolution: Not A Problem

Look at the implementation of DStream.saveAsTextFiles -- about all it does is 
call foreachRDD as I described. You can make this do whatever you like to name 
the file in your own code, but, you have to do something like this to achieve 
what you want. This JIRA should not be reopened.

> DStream "saveAsTextFile" doesn't update the prefix after each checkpoint
> 
>
> Key: SPARK-15919
> URL: https://issues.apache.org/jira/browse/SPARK-15919
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 1.6.1
> Environment: Amazon EMR
>Reporter: Aamir Abbas
>
> I have a Spark streaming job that reads a data stream, and saves it as a text 
> file after a predefined time interval. In the function 
> stream.dstream().repartition(1).saveAsTextFiles(getOutputPath(), "");
> The function getOutputPath() generates a new path every time the function is 
> called, depending on the current system time.
> However, the output path prefix remains the same for all the batches, which 
> effectively means that function is not called again for the next batch of the 
> stream, although the files are being saved after each checkpoint interval. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15904) High Memory Pressure using MLlib K-means

2016-06-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327255#comment-15327255
 ] 

Sean Owen commented on SPARK-15904:
---

Yeah it's coherent, though typically k << number of points. 
It would help to know more about how you're running, what slows down, what 
-verbose:gc says during this time, etc. It may be a problem with memory 
settings rather than some particular problem with this value of k.

> High Memory Pressure using MLlib K-means
> 
>
> Key: SPARK-15904
> URL: https://issues.apache.org/jira/browse/SPARK-15904
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.1
> Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB 
> of RAM.
>Reporter: Alessio
>Priority: Minor
>
> Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on 
> Memory and Disk.
> Everything's fine, although at the end of K-Means, after the number of 
> iterations, the cost function value and the running time there's a nice 
> "Removing RDD  from persistent list" stage. However, during this stage 
> there's a high memory pressure. Weird, since RDDs are about to be removed. 
> Full log of this stage:
> 16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations
> 16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds.
> 16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations.
> 16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is 
> 49784.87126751288.
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780
> I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. 
> My machine has an i5 hyperthreaded dual-core, thus [*] means 4.
> I'm launching this application though spark-submit with --driver-memory 10G



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15904) High Memory Pressure using MLlib K-means

2016-06-13 Thread Alessio (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessio updated SPARK-15904:

Description: 
Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on Memory 
and Disk.
Everything's fine, although at the end of K-Means, after the number of 
iterations, the cost function value and the running time there's a nice 
"Removing RDD  from persistent list" stage. However, during this stage 
there's a high memory pressure. Weird, since RDDs are about to be removed. Full 
log of this stage:

16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations
16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds.
16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations.
16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is 
49784.87126751288.
16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from persistence 
list
16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781
16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from persistence 
list
16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780

I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. My 
machine has an i5 hyperthreaded dual-core, thus [*] means 4.
I'm launching this application though spark-submit with --driver-memory 9G

  was:
Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on Memory 
and Disk.
Everything's fine, although at the end of K-Means, after the number of 
iterations, the cost function value and the running time there's a nice 
"Removing RDD  from persistent list" stage. However, during this stage 
there's a high memory pressure. Weird, since RDDs are about to be removed. Full 
log of this stage:

16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations
16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds.
16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations.
16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is 
49784.87126751288.
16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from persistence 
list
16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781
16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from persistence 
list
16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780

I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. My 
machine has an i5 hyperthreaded dual-core, thus [*] means 4.
I'm launching this application though spark-submit with --driver-memory 10G


> High Memory Pressure using MLlib K-means
> 
>
> Key: SPARK-15904
> URL: https://issues.apache.org/jira/browse/SPARK-15904
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.1
> Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB 
> of RAM.
>Reporter: Alessio
>Priority: Minor
>
> Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on 
> Memory and Disk.
> Everything's fine, although at the end of K-Means, after the number of 
> iterations, the cost function value and the running time there's a nice 
> "Removing RDD  from persistent list" stage. However, during this stage 
> there's a high memory pressure. Weird, since RDDs are about to be removed. 
> Full log of this stage:
> 16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations
> 16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds.
> 16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations.
> 16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is 
> 49784.87126751288.
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780
> I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. 
> My machine has an i5 hyperthreaded dual-core, thus [*] means 4.
> I'm launching this application though spark-submit with --driver-memory 9G



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-15921) Spark unable to read partitioned table in avro format and column name in upper case

2016-06-13 Thread Rajkumar Singh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajkumar Singh closed SPARK-15921.
--
Resolution: Fixed

> Spark unable to read partitioned table in avro format and column name in 
> upper case
> ---
>
> Key: SPARK-15921
> URL: https://issues.apache.org/jira/browse/SPARK-15921
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.6.0
> Environment: Centos 6.6
> Spark 1.6
>Reporter: Rajkumar Singh
>
> Spark return null value if the field name is uppercase in hive avro 
> partitioned table.
> Reproduce:
> {code}
> [root@sandbox ~]# cat file1.csv 
> rks,2016
> [root@sandbox ~]# cat file2.csv 
> raj,2015
> hive> CREATE TABLE `sample_table`(
> >   `name` string)
> > PARTITIONED BY ( 
> >   `year` int)
> > ROW FORMAT DELIMITED 
> >   FIELDS TERMINATED BY ',' 
> > STORED AS INPUTFORMAT 
> >   'org.apache.hadoop.mapred.TextInputFormat' 
> > OUTPUTFORMAT 
> >   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
> > LOCATION
> >   'hdfs://sandbox.hortonworks.com:8020/apps/hive/warehouse/sample_table'
> > TBLPROPERTIES (
> >   'transient_lastDdlTime'='1465816403')
> > ;
> load data local inpath '/root/file2.csv' overwrite into table sample_table 
> partition(year='2015');
> load data local inpath '/root/file1.csv' overwrite into table sample_table 
> partition(year='2016');
> hive> CREATE TABLE sample_table_uppercase
> > PARTITIONeD BY ( YEAR INT)
> > ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
> > STORED AS INPUTFORMAT 
> 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
> > OUTPUTFORMAT 
> 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
> > TBLPROPERTIES (
> >'avro.schema.literal'='{
> >   "namespace": "com.rishav.avro",
> >"name": "student_marks",
> >"type": "record",
> >   "fields": [ { "name":"NANME","type":"string"}]
> > }');
> INSERT OVERWRITE TABLE  sample_table_uppercase partition(Year) select 
> name,year from sample_table;
> hive> select * from sample_table_uppercase;
> OK
> raj   2015
> rks   2016
> now using spark-shell
> scala>val tbl = sqlContext.table("default.sample_table_uppercase");
> scala>tbl.show
> +++
> |name|year|
> +++
> |null|2015|
> |null|2016|
> +++
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15904) High Memory Pressure using MLlib K-means

2016-06-13 Thread Alessio (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327272#comment-15327272
 ] 

Alessio commented on SPARK-15904:
-

Dear Sean,
I must certainly agree with you on k< High Memory Pressure using MLlib K-means
> 
>
> Key: SPARK-15904
> URL: https://issues.apache.org/jira/browse/SPARK-15904
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.1
> Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB 
> of RAM.
>Reporter: Alessio
>Priority: Minor
>
> Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on 
> Memory and Disk.
> Everything's fine, although at the end of K-Means, after the number of 
> iterations, the cost function value and the running time there's a nice 
> "Removing RDD  from persistent list" stage. However, during this stage 
> there's a high memory pressure. Weird, since RDDs are about to be removed. 
> Full log of this stage:
> 16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations
> 16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds.
> 16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations.
> 16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is 
> 49784.87126751288.
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780
> I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. 
> My machine has an i5 hyperthreaded dual-core, thus [*] means 4.
> I'm launching this application though spark-submit with --driver-memory 9G



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-15904) High Memory Pressure using MLlib K-means

2016-06-13 Thread Alessio (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327272#comment-15327272
 ] 

Alessio edited comment on SPARK-15904 at 6/13/16 12:41 PM:
---

Dear Sean,
I must certainly agree with you on k< High Memory Pressure using MLlib K-means
> 
>
> Key: SPARK-15904
> URL: https://issues.apache.org/jira/browse/SPARK-15904
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.1
> Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB 
> of RAM.
>Reporter: Alessio
>Priority: Minor
>
> Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on 
> Memory and Disk.
> Everything's fine, although at the end of K-Means, after the number of 
> iterations, the cost function value and the running time there's a nice 
> "Removing RDD  from persistent list" stage. However, during this stage 
> there's a high memory pressure. Weird, since RDDs are about to be removed. 
> Full log of this stage:
> 16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations
> 16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds.
> 16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations.
> 16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is 
> 49784.87126751288.
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780
> I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. 
> My machine has an i5 hyperthreaded dual-core, thus [*] means 4.
> I'm launching this application though spark-submit with --driver-memory 9G



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-8546) PMML export for Naive Bayes

2016-06-13 Thread Radoslaw Gasiorek (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327167#comment-15327167
 ] 

Radoslaw Gasiorek edited comment on SPARK-8546 at 6/13/16 12:43 PM:


hi there, [~josephkb], [~apachespark]
We would like to use Mllib built models to classify outside spark therefore 
without Spark context available. We would like to export the models built in 
spark into PMML format, that then would be read by a stand alone java 
application without spark context (but with Mllib jar). 
The java application would load the model from the PMML file and would use the 
model to 'predict'  or rather 'classify' the new data we get. 
This feature would enable us to proceed without big architectural and 
operational changes, without this feature we might need get the the 
sparkContext available to the standalone application that would be bigger 
operational and architectural overhead.

We might need to use the plain java serialization for the proof of concept 
anyways, but surely not for produtionized product.

Can we prioritize this feature as well as 
https://issues.apache.org/jira/browse/SPARK-8542 and 
https://issues.apache.org/jira/browse/SPARK-8543 ?
What would be LOE and EAT for these?
thanks guys in advance for responses, and feedback.


was (Author: rgasiorek):
hi there, [~josephkb]
We would like to use Mllib built models to classify outside spark therefore 
without Spark context available. We would like to export the models built in 
spark into PMML format, that then would be read by a stand alone java 
application without spark context (but with Mllib jar). 
The java application would load the model from the PMML file and would use the 
model to 'predict'  or rather 'classify' the new data we get. 
This feature would enable us to proceed without big architectural and 
operational changes, without this feature we might need get the the 
sparkContext available to the standalone application that would be bigger 
operational and architectural overhead.

We might need to use the plain java serialization for the proof of concept 
anyways, but surely not for produtionized product.

Can we prioritize this feature as well as 
https://issues.apache.org/jira/browse/SPARK-8542 and 
https://issues.apache.org/jira/browse/SPARK-8543 ?
What would be LOE and EAT for these?
thanks guys in advance for responses, and feedback.

> PMML export for Naive Bayes
> ---
>
> Key: SPARK-8546
> URL: https://issues.apache.org/jira/browse/SPARK-8546
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Assignee: Xusen Yin
>Priority: Minor
>
> The naive Bayes section of PMML standard can be found at 
> http://www.dmg.org/v4-1/NaiveBayes.html. We should first figure out how to 
> generate PMML for both binomial and multinomial naive Bayes models using 
> JPMML (maybe [~vfed] can help).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-15904) High Memory Pressure using MLlib K-means

2016-06-13 Thread Alessio (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327272#comment-15327272
 ] 

Alessio edited comment on SPARK-15904 at 6/13/16 12:44 PM:
---

Dear Sean,
I must certainly agree with you on k< High Memory Pressure using MLlib K-means
> 
>
> Key: SPARK-15904
> URL: https://issues.apache.org/jira/browse/SPARK-15904
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.1
> Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB 
> of RAM.
>Reporter: Alessio
>Priority: Minor
>
> Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on 
> Memory and Disk.
> Everything's fine, although at the end of K-Means, after the number of 
> iterations, the cost function value and the running time there's a nice 
> "Removing RDD  from persistent list" stage. However, during this stage 
> there's a high memory pressure. Weird, since RDDs are about to be removed. 
> Full log of this stage:
> 16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations
> 16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds.
> 16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations.
> 16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is 
> 49784.87126751288.
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780
> I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. 
> My machine has an i5 hyperthreaded dual-core, thus [*] means 4.
> I'm launching this application though spark-submit with --driver-memory 9G



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-15904) High Memory Pressure using MLlib K-means

2016-06-13 Thread Alessio (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327272#comment-15327272
 ] 

Alessio edited comment on SPARK-15904 at 6/13/16 12:45 PM:
---

Dear Sean,
I must certainly agree with you on k< High Memory Pressure using MLlib K-means
> 
>
> Key: SPARK-15904
> URL: https://issues.apache.org/jira/browse/SPARK-15904
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.1
> Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB 
> of RAM.
>Reporter: Alessio
>Priority: Minor
>
> Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on 
> Memory and Disk.
> Everything's fine, although at the end of K-Means, after the number of 
> iterations, the cost function value and the running time there's a nice 
> "Removing RDD  from persistent list" stage. However, during this stage 
> there's a high memory pressure. Weird, since RDDs are about to be removed. 
> Full log of this stage:
> 16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations
> 16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds.
> 16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations.
> 16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is 
> 49784.87126751288.
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780
> I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. 
> My machine has an i5 hyperthreaded dual-core, thus [*] means 4.
> I'm launching this application though spark-submit with --driver-memory 9G



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15904) High Memory Pressure using MLlib K-means

2016-06-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327369#comment-15327369
 ] 

Sean Owen commented on SPARK-15904:
---

-verbose:gc is a JVM option and should write to stderr. You'd definitely see 
it; it's pretty verbose.
But, are you saying things are running out of memory or just referring to the 
RDDs being unpersisted? the latter is not necessarily a sign of memory 
shortage. What does memory pressure mean here?

> High Memory Pressure using MLlib K-means
> 
>
> Key: SPARK-15904
> URL: https://issues.apache.org/jira/browse/SPARK-15904
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.1
> Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB 
> of RAM.
>Reporter: Alessio
>Priority: Minor
>
> Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on 
> Memory and Disk.
> Everything's fine, although at the end of K-Means, after the number of 
> iterations, the cost function value and the running time there's a nice 
> "Removing RDD  from persistent list" stage. However, during this stage 
> there's a high memory pressure. Weird, since RDDs are about to be removed. 
> Full log of this stage:
> 16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations
> 16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds.
> 16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations.
> 16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is 
> 49784.87126751288.
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780
> I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. 
> My machine has an i5 hyperthreaded dual-core, thus [*] means 4.
> I'm launching this application though spark-submit with --driver-memory 9G



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15904) High Memory Pressure using MLlib K-means

2016-06-13 Thread Alessio (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327397#comment-15327397
 ] 

Alessio commented on SPARK-15904:
-

Dear [~srowen], 
at the beginning I noticed that "Cleaning RDD” phase (as in the original post) 
took a lot of time (10~15 minutes).
So I was curious and I opened the Activity Monitor on Mac OS X. That’s when I 
noticed the Memory Pressure indicator going crazy. The swap memory increases up 
to 10GB (when K=9120). And after this Cleaning RDD stage…everything’s back to 
normal. Swap memory will be reduced to 1GB or 2GBs. No more memory pressure and 
ready for the next K.
Moreover, Spark does not stop the execution. I do not receive any 
“Out-of-memory” errors from either Java, Python or Spark.

> High Memory Pressure using MLlib K-means
> 
>
> Key: SPARK-15904
> URL: https://issues.apache.org/jira/browse/SPARK-15904
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.1
> Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB 
> of RAM.
>Reporter: Alessio
>Priority: Minor
>
> Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on 
> Memory and Disk.
> Everything's fine, although at the end of K-Means, after the number of 
> iterations, the cost function value and the running time there's a nice 
> "Removing RDD  from persistent list" stage. However, during this stage 
> there's a high memory pressure. Weird, since RDDs are about to be removed. 
> Full log of this stage:
> 16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations
> 16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds.
> 16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations.
> 16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is 
> 49784.87126751288.
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780
> I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. 
> My machine has an i5 hyperthreaded dual-core, thus [*] means 4.
> I'm launching this application though spark-submit with --driver-memory 9G



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-15904) High Memory Pressure using MLlib K-means

2016-06-13 Thread Alessio (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327397#comment-15327397
 ] 

Alessio edited comment on SPARK-15904 at 6/13/16 1:49 PM:
--

Dear [~srowen], 
at the beginning I noticed that "Cleaning RDD” phase (as in the original post) 
took a lot of time (10~15 minutes).
So I was curious and I opened the Activity Monitor on Mac OS X. That’s when I 
noticed the Memory Pressure indicator going crazy. The swap memory increases up 
to 10GB (when K=9120). And after this Cleaning RDD stage…everything’s back to 
normal. Swap memory will be reduced to 1GB or 2GBs. No more memory pressure and 
ready for the next K.
Moreover, Spark does not stop the execution. I do not receive any 
“Out-of-memory” errors from either Java, Python or Spark.

Have a look at the screenshot here (http://postimg.org/image/l4pc0vlzr/). 
K-means just finished another run for K=6000. See the memory stat, all of these 
peaks under the Last 24 Hours sections are from Spark, after every K-Means run.
After a couple of minutes, here's the screenshot 
(http://postimg.org/image/qc7re8clt/). The memory pressure indicator is going 
down, but Swap size is 10GB. If I wait a few more minutes, everything will be 
back to normal.


was (Author: purple):
Dear [~srowen], 
at the beginning I noticed that "Cleaning RDD” phase (as in the original post) 
took a lot of time (10~15 minutes).
So I was curious and I opened the Activity Monitor on Mac OS X. That’s when I 
noticed the Memory Pressure indicator going crazy. The swap memory increases up 
to 10GB (when K=9120). And after this Cleaning RDD stage…everything’s back to 
normal. Swap memory will be reduced to 1GB or 2GBs. No more memory pressure and 
ready for the next K.
Moreover, Spark does not stop the execution. I do not receive any 
“Out-of-memory” errors from either Java, Python or Spark.

Have a look at the screenshot here (http://postimg.org/image/l4pc0vlzr/). 
K-means just finished another run for K=6000. See the memory stat, all of these 
peaks under the Last 24 Hours sections are from Spark, after every K-Means run.

> High Memory Pressure using MLlib K-means
> 
>
> Key: SPARK-15904
> URL: https://issues.apache.org/jira/browse/SPARK-15904
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.1
> Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB 
> of RAM.
>Reporter: Alessio
>Priority: Minor
>
> Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on 
> Memory and Disk.
> Everything's fine, although at the end of K-Means, after the number of 
> iterations, the cost function value and the running time there's a nice 
> "Removing RDD  from persistent list" stage. However, during this stage 
> there's a high memory pressure. Weird, since RDDs are about to be removed. 
> Full log of this stage:
> 16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations
> 16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds.
> 16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations.
> 16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is 
> 49784.87126751288.
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780
> I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. 
> My machine has an i5 hyperthreaded dual-core, thus [*] means 4.
> I'm launching this application though spark-submit with --driver-memory 9G



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15904) High Memory Pressure using MLlib K-means

2016-06-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327405#comment-15327405
 ] 

Sean Owen commented on SPARK-15904:
---

Hm, but that only means Spark used a lot of memory, and you gave it permission 
to use a lot of memory -- too much, if you're swapping. That sounds like the 
problem to me. It's happily consuming memory you've told it is there, but it's 
really not. Swapping makes things go very slowly of course. 

> High Memory Pressure using MLlib K-means
> 
>
> Key: SPARK-15904
> URL: https://issues.apache.org/jira/browse/SPARK-15904
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.1
> Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB 
> of RAM.
>Reporter: Alessio
>Priority: Minor
>
> Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on 
> Memory and Disk.
> Everything's fine, although at the end of K-Means, after the number of 
> iterations, the cost function value and the running time there's a nice 
> "Removing RDD  from persistent list" stage. However, during this stage 
> there's a high memory pressure. Weird, since RDDs are about to be removed. 
> Full log of this stage:
> 16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations
> 16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds.
> 16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations.
> 16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is 
> 49784.87126751288.
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780
> I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. 
> My machine has an i5 hyperthreaded dual-core, thus [*] means 4.
> I'm launching this application though spark-submit with --driver-memory 9G



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-15904) High Memory Pressure using MLlib K-means

2016-06-13 Thread Alessio (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327397#comment-15327397
 ] 

Alessio edited comment on SPARK-15904 at 6/13/16 1:48 PM:
--

Dear [~srowen], 
at the beginning I noticed that "Cleaning RDD” phase (as in the original post) 
took a lot of time (10~15 minutes).
So I was curious and I opened the Activity Monitor on Mac OS X. That’s when I 
noticed the Memory Pressure indicator going crazy. The swap memory increases up 
to 10GB (when K=9120). And after this Cleaning RDD stage…everything’s back to 
normal. Swap memory will be reduced to 1GB or 2GBs. No more memory pressure and 
ready for the next K.
Moreover, Spark does not stop the execution. I do not receive any 
“Out-of-memory” errors from either Java, Python or Spark.

Have a look at the screenshot here (http://postimg.org/image/l4pc0vlzr/). 
K-means just finished another run for K=6000. See the memory stat, all of these 
peaks under the Last 24 Hours sections are from Spark, after every K-Means run.


was (Author: purple):
Dear [~srowen], 
at the beginning I noticed that "Cleaning RDD” phase (as in the original post) 
took a lot of time (10~15 minutes).
So I was curious and I opened the Activity Monitor on Mac OS X. That’s when I 
noticed the Memory Pressure indicator going crazy. The swap memory increases up 
to 10GB (when K=9120). And after this Cleaning RDD stage…everything’s back to 
normal. Swap memory will be reduced to 1GB or 2GBs. No more memory pressure and 
ready for the next K.
Moreover, Spark does not stop the execution. I do not receive any 
“Out-of-memory” errors from either Java, Python or Spark.

> High Memory Pressure using MLlib K-means
> 
>
> Key: SPARK-15904
> URL: https://issues.apache.org/jira/browse/SPARK-15904
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.1
> Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB 
> of RAM.
>Reporter: Alessio
>Priority: Minor
>
> Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on 
> Memory and Disk.
> Everything's fine, although at the end of K-Means, after the number of 
> iterations, the cost function value and the running time there's a nice 
> "Removing RDD  from persistent list" stage. However, during this stage 
> there's a high memory pressure. Weird, since RDDs are about to be removed. 
> Full log of this stage:
> 16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations
> 16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds.
> 16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations.
> 16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is 
> 49784.87126751288.
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780
> I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. 
> My machine has an i5 hyperthreaded dual-core, thus [*] means 4.
> I'm launching this application though spark-submit with --driver-memory 9G



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15904) High Memory Pressure using MLlib K-means

2016-06-13 Thread Alessio (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327411#comment-15327411
 ] 

Alessio commented on SPARK-15904:
-

This is absolutely weird to me. I gave Spark 9GB and during the K-Means 
execution, if I monitor the memory stat I can see that Spark/Java has 9GB 
(nice) and no Swap whatsoever. After K-means has reached convergence, during 
this last, cleaning stage everything goes wild.

> High Memory Pressure using MLlib K-means
> 
>
> Key: SPARK-15904
> URL: https://issues.apache.org/jira/browse/SPARK-15904
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.1
> Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB 
> of RAM.
>Reporter: Alessio
>Priority: Minor
>
> Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on 
> Memory and Disk.
> Everything's fine, although at the end of K-Means, after the number of 
> iterations, the cost function value and the running time there's a nice 
> "Removing RDD  from persistent list" stage. However, during this stage 
> there's a high memory pressure. Weird, since RDDs are about to be removed. 
> Full log of this stage:
> 16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations
> 16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds.
> 16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations.
> 16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is 
> 49784.87126751288.
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780
> I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. 
> My machine has an i5 hyperthreaded dual-core, thus [*] means 4.
> I'm launching this application though spark-submit with --driver-memory 9G



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15904) High Memory Pressure using MLlib K-means

2016-06-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327430#comment-15327430
 ] 

Sean Owen commented on SPARK-15904:
---

How much RAM does your machine have? 10GB heap means much more than 10GB 
physical memory in the JVM. Not to mention what the OS needs and all other apps 
that are running. If 9GB works OK, this pretty much demonstrates Spark is fine, 
and you overcommitting physical RAM is the problem.

> High Memory Pressure using MLlib K-means
> 
>
> Key: SPARK-15904
> URL: https://issues.apache.org/jira/browse/SPARK-15904
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.1
> Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB 
> of RAM.
>Reporter: Alessio
>Priority: Minor
>
> Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on 
> Memory and Disk.
> Everything's fine, although at the end of K-Means, after the number of 
> iterations, the cost function value and the running time there's a nice 
> "Removing RDD  from persistent list" stage. However, during this stage 
> there's a high memory pressure. Weird, since RDDs are about to be removed. 
> Full log of this stage:
> 16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations
> 16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds.
> 16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations.
> 16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is 
> 49784.87126751288.
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780
> I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. 
> My machine has an i5 hyperthreaded dual-core, thus [*] means 4.
> I'm launching this application though spark-submit with --driver-memory 9G



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-15904) High Memory Pressure using MLlib K-means

2016-06-13 Thread Alessio (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327411#comment-15327411
 ] 

Alessio edited comment on SPARK-15904 at 6/13/16 1:55 PM:
--

This is absolutely weird to me. I gave Spark 9GB and during the K-Means 
execution, if I monitor the memory stat I can see that Spark/Java has 9GB 
(nice) and no Swap whatsoever. After K-means has reached convergence, during 
this last, cleaning stage everything goes wild. Also, for the sake of 
scalability, RDDs are persisted on memory *and disk*. So I can't really 
understand this pressure blowup.


was (Author: purple):
This is absolutely weird to me. I gave Spark 9GB and during the K-Means 
execution, if I monitor the memory stat I can see that Spark/Java has 9GB 
(nice) and no Swap whatsoever. After K-means has reached convergence, during 
this last, cleaning stage everything goes wild.

> High Memory Pressure using MLlib K-means
> 
>
> Key: SPARK-15904
> URL: https://issues.apache.org/jira/browse/SPARK-15904
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.1
> Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB 
> of RAM.
>Reporter: Alessio
>Priority: Minor
>
> Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on 
> Memory and Disk.
> Everything's fine, although at the end of K-Means, after the number of 
> iterations, the cost function value and the running time there's a nice 
> "Removing RDD  from persistent list" stage. However, during this stage 
> there's a high memory pressure. Weird, since RDDs are about to be removed. 
> Full log of this stage:
> 16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations
> 16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds.
> 16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations.
> 16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is 
> 49784.87126751288.
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780
> I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. 
> My machine has an i5 hyperthreaded dual-core, thus [*] means 4.
> I'm launching this application though spark-submit with --driver-memory 9G



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15904) High Memory Pressure using MLlib K-means

2016-06-13 Thread Alessio (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327438#comment-15327438
 ] 

Alessio commented on SPARK-15904:
-

My machine has 16GB of RAM. I also tried closing all the other apps, leaving 
just the Terminal with Spark running. Still no luck.

> High Memory Pressure using MLlib K-means
> 
>
> Key: SPARK-15904
> URL: https://issues.apache.org/jira/browse/SPARK-15904
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.1
> Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB 
> of RAM.
>Reporter: Alessio
>Priority: Minor
>
> Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on 
> Memory and Disk.
> Everything's fine, although at the end of K-Means, after the number of 
> iterations, the cost function value and the running time there's a nice 
> "Removing RDD  from persistent list" stage. However, during this stage 
> there's a high memory pressure. Weird, since RDDs are about to be removed. 
> Full log of this stage:
> 16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations
> 16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds.
> 16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations.
> 16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is 
> 49784.87126751288.
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780
> I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. 
> My machine has an i5 hyperthreaded dual-core, thus [*] means 4.
> I'm launching this application though spark-submit with --driver-memory 9G



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15904) High Memory Pressure using MLlib K-means

2016-06-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-15904.
---
Resolution: Not A Problem

Memory and disk still means it's also persisting in memory. I think you'll see 
the physical memory used by the JVM is much more than 10GB. Because it works 
fine with _less_ RAM, this really has to be the issue. You should never be 
swapping.

> High Memory Pressure using MLlib K-means
> 
>
> Key: SPARK-15904
> URL: https://issues.apache.org/jira/browse/SPARK-15904
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.1
> Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB 
> of RAM.
>Reporter: Alessio
>Priority: Minor
>
> Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on 
> Memory and Disk.
> Everything's fine, although at the end of K-Means, after the number of 
> iterations, the cost function value and the running time there's a nice 
> "Removing RDD  from persistent list" stage. However, during this stage 
> there's a high memory pressure. Weird, since RDDs are about to be removed. 
> Full log of this stage:
> 16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations
> 16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds.
> 16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations.
> 16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is 
> 49784.87126751288.
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780
> I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. 
> My machine has an i5 hyperthreaded dual-core, thus [*] means 4.
> I'm launching this application though spark-submit with --driver-memory 9G



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15904) High Memory Pressure using MLlib K-means

2016-06-13 Thread Alessio (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327443#comment-15327443
 ] 

Alessio commented on SPARK-15904:
-

Correct. Memory and Disk gives priority to Memory...but my dataset is 400MB so 
it shouldn't be a problem. If I give Spark less RAM (I tried with 4GB and 8GB) 
Java throws the Out-of-memory error for K>3000.

> High Memory Pressure using MLlib K-means
> 
>
> Key: SPARK-15904
> URL: https://issues.apache.org/jira/browse/SPARK-15904
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.1
> Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB 
> of RAM.
>Reporter: Alessio
>Priority: Minor
>
> Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on 
> Memory and Disk.
> Everything's fine, although at the end of K-Means, after the number of 
> iterations, the cost function value and the running time there's a nice 
> "Removing RDD  from persistent list" stage. However, during this stage 
> there's a high memory pressure. Weird, since RDDs are about to be removed. 
> Full log of this stage:
> 16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations
> 16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds.
> 16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations.
> 16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is 
> 49784.87126751288.
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780
> I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. 
> My machine has an i5 hyperthreaded dual-core, thus [*] means 4.
> I'm launching this application though spark-submit with --driver-memory 9G



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15904) High Memory Pressure using MLlib K-means

2016-06-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327449#comment-15327449
 ] 

Sean Owen commented on SPARK-15904:
---

It's not your 400MB data set that is the only thing in memory or using memory. 
OK, that's new information, but, you're also just saying that large k needs 
more memory. At the moment it's not clear whether it's unreasonably high, or 
due to Spark or your code. What ran out of memory?

> High Memory Pressure using MLlib K-means
> 
>
> Key: SPARK-15904
> URL: https://issues.apache.org/jira/browse/SPARK-15904
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.1
> Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB 
> of RAM.
>Reporter: Alessio
>Priority: Minor
>
> Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on 
> Memory and Disk.
> Everything's fine, although at the end of K-Means, after the number of 
> iterations, the cost function value and the running time there's a nice 
> "Removing RDD  from persistent list" stage. However, during this stage 
> there's a high memory pressure. Weird, since RDDs are about to be removed. 
> Full log of this stage:
> 16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations
> 16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds.
> 16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations.
> 16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is 
> 49784.87126751288.
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780
> I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. 
> My machine has an i5 hyperthreaded dual-core, thus [*] means 4.
> I'm launching this application though spark-submit with --driver-memory 9G



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15922) BlockMatrix to IndexedRowMatrix throws an error

2016-06-13 Thread Charlie Evans (JIRA)
Charlie Evans created SPARK-15922:
-

 Summary: BlockMatrix to IndexedRowMatrix throws an error
 Key: SPARK-15922
 URL: https://issues.apache.org/jira/browse/SPARK-15922
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 2.0.0
Reporter: Charlie Evans


import org.apache.spark.mllib.linalg.distributed._
import org.apache.spark.mllib.linalg._

val rows = IndexedRow(0L, new DenseVector(Array(1,2,3))) :: IndexedRow(1L, new 
DenseVector(Array(1,2,3))):: IndexedRow(2L, new DenseVector(Array(1,2,3))):: Nil
val rdd = sc.parallelize(rows)
val matrix = new IndexedRowMatrix(rdd, 3, 3)
val bmat = matrix.toBlockMatrix

val imat = bmat.toIndexedRowMatrix
imat.rows.collect // this throws an error - Caused by: 
java.lang.IllegalArgumentException: requirement failed: Vectors must be the 
same length!




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15922) BlockMatrix to IndexedRowMatrix throws an error

2016-06-13 Thread Charlie Evans (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Charlie Evans updated SPARK-15922:
--
Description: 
{code}
import org.apache.spark.mllib.linalg.distributed._
import org.apache.spark.mllib.linalg._

val rows = IndexedRow(0L, new DenseVector(Array(1,2,3))) :: IndexedRow(1L, new 
DenseVector(Array(1,2,3))):: IndexedRow(2L, new DenseVector(Array(1,2,3))):: Nil
val rdd = sc.parallelize(rows)
val matrix = new IndexedRowMatrix(rdd, 3, 3)
val bmat = matrix.toBlockMatrix

val imat = bmat.toIndexedRowMatrix
imat.rows.collect // this throws an error - Caused by: 
java.lang.IllegalArgumentException: requirement failed: Vectors must be the 
same length!


  was:
import org.apache.spark.mllib.linalg.distributed._
import org.apache.spark.mllib.linalg._

val rows = IndexedRow(0L, new DenseVector(Array(1,2,3))) :: IndexedRow(1L, new 
DenseVector(Array(1,2,3))):: IndexedRow(2L, new DenseVector(Array(1,2,3))):: Nil
val rdd = sc.parallelize(rows)
val matrix = new IndexedRowMatrix(rdd, 3, 3)
val bmat = matrix.toBlockMatrix

val imat = bmat.toIndexedRowMatrix
imat.rows.collect // this throws an error - Caused by: 
java.lang.IllegalArgumentException: requirement failed: Vectors must be the 
same length!



> BlockMatrix to IndexedRowMatrix throws an error
> ---
>
> Key: SPARK-15922
> URL: https://issues.apache.org/jira/browse/SPARK-15922
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.0.0
>Reporter: Charlie Evans
>
> {code}
> import org.apache.spark.mllib.linalg.distributed._
> import org.apache.spark.mllib.linalg._
> val rows = IndexedRow(0L, new DenseVector(Array(1,2,3))) :: IndexedRow(1L, 
> new DenseVector(Array(1,2,3))):: IndexedRow(2L, new 
> DenseVector(Array(1,2,3))):: Nil
> val rdd = sc.parallelize(rows)
> val matrix = new IndexedRowMatrix(rdd, 3, 3)
> val bmat = matrix.toBlockMatrix
> val imat = bmat.toIndexedRowMatrix
> imat.rows.collect // this throws an error - Caused by: 
> java.lang.IllegalArgumentException: requirement failed: Vectors must be the 
> same length!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15922) BlockMatrix to IndexedRowMatrix throws an error

2016-06-13 Thread Charlie Evans (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Charlie Evans updated SPARK-15922:
--
Description: 
{code}
import org.apache.spark.mllib.linalg.distributed._
import org.apache.spark.mllib.linalg._

val rows = IndexedRow(0L, new DenseVector(Array(1,2,3))) :: IndexedRow(1L, new 
DenseVector(Array(1,2,3))):: IndexedRow(2L, new DenseVector(Array(1,2,3))):: Nil
val rdd = sc.parallelize(rows)
val matrix = new IndexedRowMatrix(rdd, 3, 3)
val bmat = matrix.toBlockMatrix

val imat = bmat.toIndexedRowMatrix
imat.rows.collect // this throws an error - Caused by: 
java.lang.IllegalArgumentException: requirement failed: Vectors must be the 
same length!
{code}

  was:
{code}
import org.apache.spark.mllib.linalg.distributed._
import org.apache.spark.mllib.linalg._

val rows = IndexedRow(0L, new DenseVector(Array(1,2,3))) :: IndexedRow(1L, new 
DenseVector(Array(1,2,3))):: IndexedRow(2L, new DenseVector(Array(1,2,3))):: Nil
val rdd = sc.parallelize(rows)
val matrix = new IndexedRowMatrix(rdd, 3, 3)
val bmat = matrix.toBlockMatrix

val imat = bmat.toIndexedRowMatrix
imat.rows.collect // this throws an error - Caused by: 
java.lang.IllegalArgumentException: requirement failed: Vectors must be the 
same length!



> BlockMatrix to IndexedRowMatrix throws an error
> ---
>
> Key: SPARK-15922
> URL: https://issues.apache.org/jira/browse/SPARK-15922
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.0.0
>Reporter: Charlie Evans
>
> {code}
> import org.apache.spark.mllib.linalg.distributed._
> import org.apache.spark.mllib.linalg._
> val rows = IndexedRow(0L, new DenseVector(Array(1,2,3))) :: IndexedRow(1L, 
> new DenseVector(Array(1,2,3))):: IndexedRow(2L, new 
> DenseVector(Array(1,2,3))):: Nil
> val rdd = sc.parallelize(rows)
> val matrix = new IndexedRowMatrix(rdd, 3, 3)
> val bmat = matrix.toBlockMatrix
> val imat = bmat.toIndexedRowMatrix
> imat.rows.collect // this throws an error - Caused by: 
> java.lang.IllegalArgumentException: requirement failed: Vectors must be the 
> same length!
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15918) unionAll returns wrong result when two dataframes has schema in different order

2016-06-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-15918:
--
Fix Version/s: (was: 1.6.1)

Don't set fix version; 1.6.1 wouldn't make sense anyway.

> unionAll returns wrong result when two dataframes has schema in different 
> order
> ---
>
> Key: SPARK-15918
> URL: https://issues.apache.org/jira/browse/SPARK-15918
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
> Environment: CentOS
>Reporter: Prabhu Joseph
>
> On applying unionAll operation between A and B dataframes, they both has same 
> schema but in different order and hence the result has column value mapping 
> changed.
> Repro:
> {code}
> A.show()
> +---++---+--+--+-++---+--+---+---+-+
> |tag|year_day|tm_hour|tm_min|tm_sec|dtype|time|tm_mday|tm_mon|tm_yday|tm_year|value|
> +---++---+--+--+-++---+--+---+---+-+
> +---++---+--+--+-++---+--+---+---+-+
> B.show()
> +-+---+--+---+---+--+--+--+---+---+--++
> |dtype|tag|  
> time|tm_hour|tm_mday|tm_min|tm_mon|tm_sec|tm_yday|tm_year| value|year_day|
> +-+---+--+---+---+--+--+--+---+---+--++
> |F|C_FNHXUT701Z.CNSTLO|1443790800| 13|  2| 0|10| 0|   
>  275|   2015|1.2345| 2015275|
> |F|C_FNHXUDP713.CNSTHI|1443790800| 13|  2| 0|10| 0|   
>  275|   2015|1.2345| 2015275|
> |F| C_FNHXUT718.CNSTHI|1443790800| 13|  2| 0|10| 0|   
>  275|   2015|1.2345| 2015275|
> |F|C_FNHXUT703Z.CNSTLO|1443790800| 13|  2| 0|10| 0|   
>  275|   2015|1.2345| 2015275|
> |F|C_FNHXUR716A.CNSTLO|1443790800| 13|  2| 0|10| 0|   
>  275|   2015|1.2345| 2015275|
> |F|C_FNHXUT803Z.CNSTHI|1443790800| 13|  2| 0|10| 0|   
>  275|   2015|1.2345| 2015275|
> |F| C_FNHXUT728.CNSTHI|1443790800| 13|  2| 0|10| 0|   
>  275|   2015|1.2345| 2015275|
> |F| C_FNHXUR806.CNSTHI|1443790800| 13|  2| 0|10| 0|   
>  275|   2015|1.2345| 2015275|
> +-+---+--+---+---+--+--+--+---+---+--++
> A = A.unionAll(B)
> A.show()
> +---+---+--+--+--+-++---+--+---+---+-+
> |tag|   year_day|   
> tm_hour|tm_min|tm_sec|dtype|time|tm_mday|tm_mon|tm_yday|tm_year|value|
> +---+---+--+--+--+-++---+--+---+---+-+
> |  F|C_FNHXUT701Z.CNSTLO|1443790800|13| 2|0|  10|  0|   275|  
>  2015| 1.2345|2015275.0|
> |  F|C_FNHXUDP713.CNSTHI|1443790800|13| 2|0|  10|  0|   275|  
>  2015| 1.2345|2015275.0|
> |  F| C_FNHXUT718.CNSTHI|1443790800|13| 2|0|  10|  0|   275|  
>  2015| 1.2345|2015275.0|
> |  F|C_FNHXUT703Z.CNSTLO|1443790800|13| 2|0|  10|  0|   275|  
>  2015| 1.2345|2015275.0|
> |  F|C_FNHXUR716A.CNSTLO|1443790800|13| 2|0|  10|  0|   275|  
>  2015| 1.2345|2015275.0|
> |  F|C_FNHXUT803Z.CNSTHI|1443790800|13| 2|0|  10|  0|   275|  
>  2015| 1.2345|2015275.0|
> |  F| C_FNHXUT728.CNSTHI|1443790800|13| 2|0|  10|  0|   275|  
>  2015| 1.2345|2015275.0|
> |  F| C_FNHXUR806.CNSTHI|1443790800|13| 2|0|  10|  0|   275|  
>  2015| 1.2345|2015275.0|
> +---+---+--+--+--+-++---+--+---+---+-+
> {code}
> On changing the schema of A according to B and doing unionAll works fine
> {code}
> C = 
> A.select("dtype","tag","time","tm_hour","tm_mday","tm_min",”tm_mon”,"tm_sec","tm_yday","tm_year","value","year_day")
> A = C.unionAll(B)
> A.show()
> +-+---+--+---+---+--+--+--+---+---+--++
> |dtype|tag|  
> time|tm_hour|tm_mday|tm_min|tm_mon|tm_sec|tm_yday|tm_year| value|year_day|
> +-+---+--+---+---+--+--+--+---+---+--++
> |F|C_FNHXUT701Z.CNSTLO|1443790800| 13|  2| 0|10| 0|   
>  275|   2015|1.2345| 2015275|
> |F|C_FNHXUDP713.CNSTHI|1443790800| 13|  2| 0|10| 0|   
>  275|   2015|1.2345| 2015275|
> |F| C_FNHXUT718.CNSTHI|1443790800| 13|  2| 0|10| 0|   
>  275|   2015|1.2345| 2015275|
> |F|C_FNHXUT703Z.CNSTLO|1443790800| 13|  2| 0|10| 0|   
>  275|   2015|1.2345| 20152

[jira] [Commented] (SPARK-15904) High Memory Pressure using MLlib K-means

2016-06-13 Thread Alessio (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327476#comment-15327476
 ] 

Alessio commented on SPARK-15904:
-

If anyone's interested, the dataset I'm working on is freely available from UCI 
ML Repository 
(http://archive.ics.uci.edu/ml/datasets/Daily+and+Sports+Activities).

I tried just now running the above K-Means for K=9120, with --driver-memory 4G. 
The full traceback can be found here (https://ghostbin.com/paste/9pu9k).

The code is absolutely simple, I don't think there's nothing wrong with it:

sc = SparkContext("local[*]", "Spark K-Means")
data = sc.textFile()
parsedData = data.map(lambda line: array([float(x) for x in line.split(',')]))
parsedDataNOID=parsedData.map(lambda pattern: pattern[1:])
parsedDataNOID.persist(StorageLevel.MEMORY_AND_DISK)

K_CANDIDATES=

initCentroids=scipy.io.loadmat(<.mat file with initial seeds>)
datatmp=numpy.genfromtxt(,delimiter=",")

for K in K_CANDIDATES:
 clusters = KMeans.train(parsedDataNOID, K, maxIterations=2000, runs=1, 
epsilon=0.0, initialModel = 
KMeansModel(datatmp[initCentroids['initSeedsA'][0][k_tmp][0]-1,:]))

> High Memory Pressure using MLlib K-means
> 
>
> Key: SPARK-15904
> URL: https://issues.apache.org/jira/browse/SPARK-15904
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.1
> Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB 
> of RAM.
>Reporter: Alessio
>Priority: Minor
>
> Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on 
> Memory and Disk.
> Everything's fine, although at the end of K-Means, after the number of 
> iterations, the cost function value and the running time there's a nice 
> "Removing RDD  from persistent list" stage. However, during this stage 
> there's a high memory pressure. Weird, since RDDs are about to be removed. 
> Full log of this stage:
> 16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations
> 16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds.
> 16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations.
> 16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is 
> 49784.87126751288.
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780
> I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. 
> My machine has an i5 hyperthreaded dual-core, thus [*] means 4.
> I'm launching this application though spark-submit with --driver-memory 9G



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15904) High Memory Pressure using MLlib K-means

2016-06-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327510#comment-15327510
 ] 

Sean Owen commented on SPARK-15904:
---

Yes, that just means "out of memory". The question is whether this is unusual 
or not. You might try storing the serialized representation in memory, not the 
'raw' object form, which is often bigger. You almost certainly need more 
partitions in the source data, since I expect it's just 1 or 2 partitions 
according to the block size, but, you probably want the problem to be broken 
down into smaller chunks rather than process big chunks at once in memory. It's 
the second arg to textFile.

Finally you may get better results with 2.0, or, by using the ML + Dataset 
APIs. Those are bigger changes though.

> High Memory Pressure using MLlib K-means
> 
>
> Key: SPARK-15904
> URL: https://issues.apache.org/jira/browse/SPARK-15904
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.1
> Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB 
> of RAM.
>Reporter: Alessio
>Priority: Minor
>
> Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on 
> Memory and Disk.
> Everything's fine, although at the end of K-Means, after the number of 
> iterations, the cost function value and the running time there's a nice 
> "Removing RDD  from persistent list" stage. However, during this stage 
> there's a high memory pressure. Weird, since RDDs are about to be removed. 
> Full log of this stage:
> 16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations
> 16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds.
> 16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations.
> 16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is 
> 49784.87126751288.
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780
> I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. 
> My machine has an i5 hyperthreaded dual-core, thus [*] means 4.
> I'm launching this application though spark-submit with --driver-memory 9G



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-15904) High Memory Pressure using MLlib K-means

2016-06-13 Thread Alessio (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327476#comment-15327476
 ] 

Alessio edited comment on SPARK-15904 at 6/13/16 2:48 PM:
--

If anyone's interested, the dataset I'm working on is freely available from UCI 
ML Repository 
(http://archive.ics.uci.edu/ml/datasets/Daily+and+Sports+Activities).

I tried just now running the above K-Means for K=9120, with --driver-memory 4G. 
The full traceback can be found here (https://ghostbin.com/paste/9pu9k).

The code is absolutely simple, I don't think there's something wrong with it:

sc = SparkContext("local[*]", "Spark K-Means")
data = sc.textFile()
parsedData = data.map(lambda line: array([float(x) for x in line.split(',')]))
parsedDataNOID=parsedData.map(lambda pattern: pattern[1:])
parsedDataNOID.persist(StorageLevel.MEMORY_AND_DISK)

K_CANDIDATES=

initCentroids=scipy.io.loadmat(<.mat file with initial seeds>)
datatmp=numpy.genfromtxt(,delimiter=",")

for K in K_CANDIDATES:
 clusters = KMeans.train(parsedDataNOID, K, maxIterations=2000, runs=1, 
epsilon=0.0, initialModel = 
KMeansModel(datatmp[initCentroids['initSeedsA'][0][k_tmp][0]-1,:]))


was (Author: purple):
If anyone's interested, the dataset I'm working on is freely available from UCI 
ML Repository 
(http://archive.ics.uci.edu/ml/datasets/Daily+and+Sports+Activities).

I tried just now running the above K-Means for K=9120, with --driver-memory 4G. 
The full traceback can be found here (https://ghostbin.com/paste/9pu9k).

The code is absolutely simple, I don't think there's nothing wrong with it:

sc = SparkContext("local[*]", "Spark K-Means")
data = sc.textFile()
parsedData = data.map(lambda line: array([float(x) for x in line.split(',')]))
parsedDataNOID=parsedData.map(lambda pattern: pattern[1:])
parsedDataNOID.persist(StorageLevel.MEMORY_AND_DISK)

K_CANDIDATES=

initCentroids=scipy.io.loadmat(<.mat file with initial seeds>)
datatmp=numpy.genfromtxt(,delimiter=",")

for K in K_CANDIDATES:
 clusters = KMeans.train(parsedDataNOID, K, maxIterations=2000, runs=1, 
epsilon=0.0, initialModel = 
KMeansModel(datatmp[initCentroids['initSeedsA'][0][k_tmp][0]-1,:]))

> High Memory Pressure using MLlib K-means
> 
>
> Key: SPARK-15904
> URL: https://issues.apache.org/jira/browse/SPARK-15904
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.1
> Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB 
> of RAM.
>Reporter: Alessio
>Priority: Minor
>
> Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on 
> Memory and Disk.
> Everything's fine, although at the end of K-Means, after the number of 
> iterations, the cost function value and the running time there's a nice 
> "Removing RDD  from persistent list" stage. However, during this stage 
> there's a high memory pressure. Weird, since RDDs are about to be removed. 
> Full log of this stage:
> 16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations
> 16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds.
> 16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations.
> 16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is 
> 49784.87126751288.
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780
> I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. 
> My machine has an i5 hyperthreaded dual-core, thus [*] means 4.
> I'm launching this application though spark-submit with --driver-memory 9G



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15904) High Memory Pressure using MLlib K-means

2016-06-13 Thread Alessio (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327542#comment-15327542
 ] 

Alessio commented on SPARK-15904:
-

With the --driver-memory 4G switch I've tried both. With no luck. At first I 
changed the storage level to serialized, then I also increased the number of 
partitions (from 12 - default - to 20). Still "out of memory". I guess I'll 
wait for 2.0

> High Memory Pressure using MLlib K-means
> 
>
> Key: SPARK-15904
> URL: https://issues.apache.org/jira/browse/SPARK-15904
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.1
> Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB 
> of RAM.
>Reporter: Alessio
>Priority: Minor
>
> Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on 
> Memory and Disk.
> Everything's fine, although at the end of K-Means, after the number of 
> iterations, the cost function value and the running time there's a nice 
> "Removing RDD  from persistent list" stage. However, during this stage 
> there's a high memory pressure. Weird, since RDDs are about to be removed. 
> Full log of this stage:
> 16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations
> 16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds.
> 16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations.
> 16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is 
> 49784.87126751288.
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780
> I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. 
> My machine has an i5 hyperthreaded dual-core, thus [*] means 4.
> I'm launching this application though spark-submit with --driver-memory 9G



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15118) spark couldn't get hive properyties in hive-site.xml

2016-06-13 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327674#comment-15327674
 ] 

Herman van Hovell commented on SPARK-15118:
---

[~eksmile] any update on this?

> spark couldn't get hive properyties in hive-site.xml 
> -
>
> Key: SPARK-15118
> URL: https://issues.apache.org/jira/browse/SPARK-15118
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Deploy
>Affects Versions: 1.6.1
> Environment: hadoop-2.7.1.tar.gz;
> apache-hive-2.0.0-bin.tar.gz; 
> spark-1.6.1-bin-hadoop2.6.tgz; 
> scala-2.11.8.tgz
>Reporter: eksmile
>Priority: Blocker
>
> I have three question.
> First:
> I've already put "hive-site.xml" in $SPARK_HOME/conf, but when I run 
> spark-sql, it tell me "HiveConf of name *** does not exist", and repeat many 
> times.
> All of these "HiveConf" are in "hive-site.xml", why these warnings appear?
> I'm not sure this is a bug or not.
> Second:
> In the middle of logs as follow, there's a paragraph : "Failed to get 
> database default, returning NoSuchObjectException", 
> I don't know is there something worng?
> Third:
> In the middle of logs, there's a paragraph : " metastore.MetaStoreDirectSql: 
> Using direct SQL, underlying DB is DERBY", 
> but, in the end of logs, there's a paragraph : "metastore.MetaStoreDirectSql: 
> Using direct SQL, underlying DB is MYSQL"
> My Hive metastore is MYSQL. Is this something wrong?
> spark-env.sh as follow: 
> export JAVA_HOME=/usr/java/jdk1.8.0_73
> export SCALA_HOME=/home/scala
> export SPARK_MASTER_IP=192.168.124.129
> export SPARK_WORKER_MEMORY=1g
> export HADOOP_CONF_DIR=/usr/hadoop/etc/hadoop
> export HIVE_HOME=/opt/hive
> export HIVE_CONF_DIR=/opt/hive/conf
> export 
> SPARK_CLASSPATH=$SPARK_CLASSPATH:/opt/hive/lib/mysql-connector-java-5.1.38-bin.jar
> export HADOOP_HOME=/usr/hadoop
> Thanks for reading 
> Here're the logs:
> [yezt@Master spark]$ bin/spark-sql --master spark://master:7077   
> 16/05/04 16:17:16 WARN conf.HiveConf: HiveConf of name 
> hive.metastore.hbase.aggregate.stats.false.positive.probability does not exist
> 16/05/04 16:17:16 WARN conf.HiveConf: HiveConf of name 
> hive.llap.io.orc.time.counters does not exist
> 16/05/04 16:17:16 WARN conf.HiveConf: HiveConf of name 
> hive.server2.metrics.enabled does not exist
> 16/05/04 16:17:16 WARN conf.HiveConf: HiveConf of name 
> hive.llap.am.liveness.connection.timeout.ms does not exist
> 16/05/04 16:17:16 WARN conf.HiveConf: HiveConf of name 
> hive.server2.thrift.client.connect.retry.limit does not exist
> 16/05/04 16:17:16 WARN conf.HiveConf: HiveConf of name 
> hive.llap.io.allocator.direct does not exist
> 16/05/04 16:17:16 WARN conf.HiveConf: HiveConf of name 
> hive.llap.auto.enforce.stats does not exist
> 16/05/04 16:17:16 WARN conf.HiveConf: HiveConf of name 
> hive.llap.client.consistent.splits does not exist
> 16/05/04 16:17:16 WARN conf.HiveConf: HiveConf of name 
> hive.server2.tez.session.lifetime does not exist
> 16/05/04 16:17:16 WARN conf.HiveConf: HiveConf of name 
> hive.timedout.txn.reaper.start does not exist
> 16/05/04 16:17:16 WARN conf.HiveConf: HiveConf of name 
> hive.metastore.hbase.cache.ttl does not exist
> 16/05/04 16:17:16 WARN conf.HiveConf: HiveConf of name 
> hive.llap.management.acl does not exist
> 16/05/04 16:17:16 WARN conf.HiveConf: HiveConf of name 
> hive.llap.daemon.delegation.token.lifetime does not exist
> 16/05/04 16:17:16 WARN conf.HiveConf: HiveConf of name 
> hive.strict.checks.large.query does not exist
> 16/05/04 16:17:16 WARN conf.HiveConf: HiveConf of name 
> hive.llap.io.allocator.alloc.min does not exist
> 16/05/04 16:17:16 WARN conf.HiveConf: HiveConf of name 
> hive.server2.thrift.client.user does not exist
> 16/05/04 16:17:16 WARN conf.HiveConf: HiveConf of name 
> hive.llap.daemon.wait.queue.comparator.class.name does not exist
> 16/05/04 16:17:16 WARN conf.HiveConf: HiveConf of name 
> hive.llap.daemon.am.liveness.heartbeat.interval.ms does not exist
> 16/05/04 16:17:16 WARN conf.HiveConf: HiveConf of name 
> hive.llap.object.cache.enabled does not exist
> 16/05/04 16:17:16 WARN conf.HiveConf: HiveConf of name 
> hive.server2.webui.use.ssl does not exist
> 16/05/04 16:17:16 WARN conf.HiveConf: HiveConf of name hive.metastore.local 
> does not exist
> 16/05/04 16:17:16 WARN conf.HiveConf: HiveConf of name 
> hive.service.metrics.file.location does not exist
> 16/05/04 16:17:16 WARN conf.HiveConf: HiveConf of name 
> hive.server2.thrift.client.retry.delay.seconds does not exist
> 16/05/04 16:17:16 WARN conf.HiveConf: HiveConf of name 
> hive.llap.daemon.num.file.cleaner.threads does not exist
> 16/05/04 16:17:16 WARN conf.HiveConf: HiveConf of name 
> hive.test.fail.compaction does not exist
> 16/05/04 16:17:16 WARN conf.HiveConf: HiveConf of 

[jira] [Commented] (SPARK-15370) Some correlated subqueries return incorrect answers

2016-06-13 Thread Luciano Resende (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327683#comment-15327683
 ] 

Luciano Resende commented on SPARK-15370:
-

[~hvanhovell] You might need to add [~freiss] to contributor group in Spark 
jira admin console in order to assign the ticket to Fred. If you don't have 
access to it, maybe [~rxin] might be able to help sort this out.

> Some correlated subqueries return incorrect answers
> ---
>
> Key: SPARK-15370
> URL: https://issues.apache.org/jira/browse/SPARK-15370
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Frederick Reiss
>
> The rewrite introduced in SPARK-14785 has the COUNT bug. The rewrite changes 
> the semantics of some correlated subqueries when there are tuples from the 
> outer query block that do not join with the subquery. For example:
> {noformat}
> spark-sql> create table R(a integer) as values (1);
> spark-sql> create table S(b integer);
> spark-sql> select R.a from R 
>  > where (select count(*) from S where R.a = S.b) = 0;
> Time taken: 2.139 seconds 
>   
> spark-sql> 
> (returns zero rows; the answer should be one row of '1')
> {noformat}
> This problem also affects the SELECT clause:
> {noformat}
> spark-sql> select R.a, 
>  > (select count(*) from S where R.a = S.b) as cnt 
>  > from R;
> 1 NULL
> (the answer should be "1 0")
> {noformat}
> Some subqueries with COUNT aggregates are *not* affected:
> {noformat}
> spark-sql> select R.a from R 
>  > where (select count(*) from S where R.a = S.b) > 0;
> Time taken: 0.609 seconds
> spark-sql>
> (Correct answer)
> spark-sql> select R.a from R 
>  > where (select count(*) + sum(S.b) from S where R.a = S.b) = 0;
> Time taken: 0.553 seconds
> spark-sql> 
> (Correct answer)
> {noformat}
> Other cases can trigger the variant of the COUNT bug for expressions 
> involving NULL checks:
> {noformat}
> spark-sql> select R.a from R 
>  > where (select sum(S.b) is null from S where R.a = S.b);
> (returns zero rows, should return one row)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15822) segmentation violation in o.a.s.unsafe.types.UTF8String

2016-06-13 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327703#comment-15327703
 ] 

Herman van Hovell commented on SPARK-15822:
---

[~robbinspg] You can dump the plan to the console by using calling 
{{explain(true)}} on a DataFrame or by prepending {{EXPLAIN EXTENDED ...}} to 
your SQL statement.

> segmentation violation in o.a.s.unsafe.types.UTF8String 
> 
>
> Key: SPARK-15822
> URL: https://issues.apache.org/jira/browse/SPARK-15822
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
> Environment: linux amd64
> openjdk version "1.8.0_91"
> OpenJDK Runtime Environment (build 1.8.0_91-b14)
> OpenJDK 64-Bit Server VM (build 25.91-b14, mixed mode)
>Reporter: Pete Robbins
>Assignee: Herman van Hovell
>Priority: Blocker
>
> Executors fail with segmentation violation while running application with
> spark.memory.offHeap.enabled true
> spark.memory.offHeap.size 512m
> Also now reproduced with 
> spark.memory.offHeap.enabled false
> {noformat}
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x7f4559b4d4bd, pid=14182, tid=139935319750400
> #
> # JRE version: OpenJDK Runtime Environment (8.0_91-b14) (build 1.8.0_91-b14)
> # Java VM: OpenJDK 64-Bit Server VM (25.91-b14 mixed mode linux-amd64 
> compressed oops)
> # Problematic frame:
> # J 4816 C2 
> org.apache.spark.unsafe.types.UTF8String.compareTo(Lorg/apache/spark/unsafe/types/UTF8String;)I
>  (64 bytes) @ 0x7f4559b4d4bd [0x7f4559b4d460+0x5d]
> {noformat}
> We initially saw this on IBM java on PowerPC box but is recreatable on linux 
> with OpenJDK. On linux with IBM Java 8 we see a null pointer exception at the 
> same code point:
> {noformat}
> 16/06/08 11:14:58 ERROR Executor: Exception in task 1.0 in stage 5.0 (TID 48)
> java.lang.NullPointerException
>   at 
> org.apache.spark.unsafe.types.UTF8String.compareTo(UTF8String.java:831)
>   at org.apache.spark.unsafe.types.UTF8String.compare(UTF8String.java:844)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$doExecute$2$$anon$2.hasNext(WholeStageCodegenExec.scala:377)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> scala.collection.convert.Wrappers$IteratorWrapper.hasNext(Wrappers.scala:30)
>   at org.spark_project.guava.collect.Ordering.leastOf(Ordering.java:664)
>   at org.apache.spark.util.collection.Utils$.takeOrdered(Utils.scala:37)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$30.apply(RDD.scala:1365)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$30.apply(RDD.scala:1362)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:757)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:757)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1153)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>   at java.lang.Thread.run(Thread.java:785)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15902) Add a deprecation warning for Python 2.6

2016-06-13 Thread Krishna Kalyan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327723#comment-15327723
 ] 

Krishna Kalyan commented on SPARK-15902:


Hi [~holdenk],
I have some questions, where do I add this warning? [here 
below](https://github.com/apache/spark/blob/master/python/pyspark/context.py)
I need to add something like  
{code}
if sys.version < 2.6:
warnings.warn("Deprecated in 2.1.0. Use Python 2.7+ instead", 
DeprecationWarning)
{code}
Thanks

> Add a deprecation warning for Python 2.6
> 
>
> Key: SPARK-15902
> URL: https://issues.apache.org/jira/browse/SPARK-15902
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Reporter: holdenk
>Priority: Minor
>
> As we move to Python 2.7+ in Spark 2.1+ it would be good to add a deprecation 
> warning if we detect we are running in Python 2.6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15923) Spark Application rest api returns "no such app: "

2016-06-13 Thread Yesha Vora (JIRA)
Yesha Vora created SPARK-15923:
--

 Summary: Spark Application rest api returns "no such app: "
 Key: SPARK-15923
 URL: https://issues.apache.org/jira/browse/SPARK-15923
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.6.1
Reporter: Yesha Vora


Env : secure cluster

Scenario:

* Run SparkPi application in yarn-client or yarn-cluster mode
* After application finishes, check Spark HS rest api to get details like jobs 
/ executor etc. 

{code}
http://:18080/api/v1/applications/application_1465778870517_0001/1/executors{code}
 

Rest api return HTTP Code: 404 and prints "HTTP Data: no such app: 
application_1465778870517_0001"





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15923) Spark Application rest api returns "no such app: "

2016-06-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327750#comment-15327750
 ] 

Sean Owen commented on SPARK-15923:
---

[~tgraves] or [~ste...@apache.org] will probably know better, but I'm not sure 
all of that is the app ID?

> Spark Application rest api returns "no such app: "
> -
>
> Key: SPARK-15923
> URL: https://issues.apache.org/jira/browse/SPARK-15923
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1
>Reporter: Yesha Vora
>
> Env : secure cluster
> Scenario:
> * Run SparkPi application in yarn-client or yarn-cluster mode
> * After application finishes, check Spark HS rest api to get details like 
> jobs / executor etc. 
> {code}
> http://:18080/api/v1/applications/application_1465778870517_0001/1/executors{code}
>  
> Rest api return HTTP Code: 404 and prints "HTTP Data: no such app: 
> application_1465778870517_0001"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15814) Aggregator can return null result

2016-06-13 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-15814.
---
Resolution: Resolved

> Aggregator can return null result
> -
>
> Key: SPARK-15814
> URL: https://issues.apache.org/jira/browse/SPARK-15814
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15163) Mark experimental algorithms experimental in PySpark

2016-06-13 Thread Krishna Kalyan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327748#comment-15327748
 ] 

Krishna Kalyan commented on SPARK-15163:


Hi [~holdenk],
Is this task still up for grabs?. 

Thanks

> Mark experimental algorithms experimental in PySpark
> 
>
> Key: SPARK-15163
> URL: https://issues.apache.org/jira/browse/SPARK-15163
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: holdenk
>Priority: Trivial
>
> While we are going through them anyways might as well mark the PySpark 
> algorithm as experimental that are marked so in Scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15924) SparkR parser bug with backslash in comments

2016-06-13 Thread Xuan Wang (JIRA)
Xuan Wang created SPARK-15924:
-

 Summary: SparkR parser bug with backslash in comments
 Key: SPARK-15924
 URL: https://issues.apache.org/jira/browse/SPARK-15924
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 1.6.1
Reporter: Xuan Wang


When I run an R cell with the following comments:
{code} 
#   p <- p + scale_fill_manual(values = set2[groups])
#   # p <- p + scale_fill_brewer(palette = "Set2") + scale_color_brewer(palette 
= "Set2")
#   p <- p + scale_x_date(labels = date_format("%m/%d\n%a"))
#   p
{code}

I get the following error message

Error in parse(text = DATABRICKS_CURRENT_TEMP_CMD__) : 
  :16:1: unexpected input
15: #   p <- p + scale_x_date(labels = date_format("%m/%d
16: %a"))
^




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15924) SparkR parser bug with backslash in comments

2016-06-13 Thread Xuan Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Wang updated SPARK-15924:
--
Description: 
When I run an R cell with the following comments:
{code} 
#   p <- p + scale_fill_manual(values = set2[groups])
#   # p <- p + scale_fill_brewer(palette = "Set2") + scale_color_brewer(palette 
= "Set2")
#   p <- p + scale_x_date(labels = date_format("%m/%d\n%a"))
#   p
{code}

I get the following error message

{quote}
Error in parse(text = DATABRICKS_CURRENT_TEMP_CMD__) : 
  :16:1: unexpected input
15: #   p <- p + scale_x_date(labels = date_format("%m/%d
16: %a"))
^
{quote}

  was:
When I run an R cell with the following comments:
{code} 
#   p <- p + scale_fill_manual(values = set2[groups])
#   # p <- p + scale_fill_brewer(palette = "Set2") + scale_color_brewer(palette 
= "Set2")
#   p <- p + scale_x_date(labels = date_format("%m/%d\n%a"))
#   p
{code}

I get the following error message

Error in parse(text = DATABRICKS_CURRENT_TEMP_CMD__) : 
  :16:1: unexpected input
15: #   p <- p + scale_x_date(labels = date_format("%m/%d
16: %a"))
^



> SparkR parser bug with backslash in comments
> 
>
> Key: SPARK-15924
> URL: https://issues.apache.org/jira/browse/SPARK-15924
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.1
>Reporter: Xuan Wang
>
> When I run an R cell with the following comments:
> {code} 
> #   p <- p + scale_fill_manual(values = set2[groups])
> #   # p <- p + scale_fill_brewer(palette = "Set2") + 
> scale_color_brewer(palette = "Set2")
> #   p <- p + scale_x_date(labels = date_format("%m/%d\n%a"))
> #   p
> {code}
> I get the following error message
> {quote}
> Error in parse(text = DATABRICKS_CURRENT_TEMP_CMD__) : 
>   :16:1: unexpected input
> 15: #   p <- p + scale_x_date(labels = date_format("%m/%d
> 16: %a"))
> ^
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15924) SparkR parser bug with backslash in comments

2016-06-13 Thread Xuan Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Wang updated SPARK-15924:
--
Description: 
When I run an R cell with the following comments:
{code} 
#   p <- p + scale_fill_manual(values = set2[groups])
#   # p <- p + scale_fill_brewer(palette = "Set2") + scale_color_brewer(palette 
= "Set2")
#   p <- p + scale_x_date(labels = date_format("%m/%d\n%a"))
#   p
{code}

I get the following error message

{quote}
Error in parse(text = DATABRICKS_CURRENT_TEMP_CMD__) : 
  :16:1: unexpected input
15: #   p <- p + scale_x_date(labels = date_format("%m/%d
16: %a"))
^
{quote}

After I remove the backslash in "date_format("%m/%d\n%a"))", it works fine.


  was:
When I run an R cell with the following comments:
{code} 
#   p <- p + scale_fill_manual(values = set2[groups])
#   # p <- p + scale_fill_brewer(palette = "Set2") + scale_color_brewer(palette 
= "Set2")
#   p <- p + scale_x_date(labels = date_format("%m/%d\n%a"))
#   p
{code}

I get the following error message

{quote}
Error in parse(text = DATABRICKS_CURRENT_TEMP_CMD__) : 
  :16:1: unexpected input
15: #   p <- p + scale_x_date(labels = date_format("%m/%d
16: %a"))
^
{quote}


> SparkR parser bug with backslash in comments
> 
>
> Key: SPARK-15924
> URL: https://issues.apache.org/jira/browse/SPARK-15924
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.1
>Reporter: Xuan Wang
>
> When I run an R cell with the following comments:
> {code} 
> #   p <- p + scale_fill_manual(values = set2[groups])
> #   # p <- p + scale_fill_brewer(palette = "Set2") + 
> scale_color_brewer(palette = "Set2")
> #   p <- p + scale_x_date(labels = date_format("%m/%d\n%a"))
> #   p
> {code}
> I get the following error message
> {quote}
> Error in parse(text = DATABRICKS_CURRENT_TEMP_CMD__) : 
>   :16:1: unexpected input
> 15: #   p <- p + scale_x_date(labels = date_format("%m/%d
> 16: %a"))
> ^
> {quote}
> After I remove the backslash in "date_format("%m/%d\n%a"))", it works fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15924) SparkR parser bug with backslash in comments

2016-06-13 Thread Xuan Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Wang updated SPARK-15924:
--
Description: 
When I run an R cell with the following comments:
{code} 
#   p <- p + scale_fill_manual(values = set2[groups])
#   # p <- p + scale_fill_brewer(palette = "Set2") + scale_color_brewer(palette 
= "Set2")
#   p <- p + scale_x_date(labels = date_format("%m/%d\n%a"))
#   p
{code}

I get the following error message

{quote}
  :16:1: unexpected input
15: #   p <- p + scale_x_date(labels = date_format("%m/%d
16: %a"))
^
{quote}

After I remove the backslash in "date_format("%m/%d\n%a"))", it works fine.


  was:
When I run an R cell with the following comments:
{code} 
#   p <- p + scale_fill_manual(values = set2[groups])
#   # p <- p + scale_fill_brewer(palette = "Set2") + scale_color_brewer(palette 
= "Set2")
#   p <- p + scale_x_date(labels = date_format("%m/%d\n%a"))
#   p
{code}

I get the following error message

{quote}
Error in parse(text = DATABRICKS_CURRENT_TEMP_CMD__) : 
  :16:1: unexpected input
15: #   p <- p + scale_x_date(labels = date_format("%m/%d
16: %a"))
^
{quote}

After I remove the backslash in "date_format("%m/%d\n%a"))", it works fine.



> SparkR parser bug with backslash in comments
> 
>
> Key: SPARK-15924
> URL: https://issues.apache.org/jira/browse/SPARK-15924
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.1
>Reporter: Xuan Wang
>
> When I run an R cell with the following comments:
> {code} 
> #   p <- p + scale_fill_manual(values = set2[groups])
> #   # p <- p + scale_fill_brewer(palette = "Set2") + 
> scale_color_brewer(palette = "Set2")
> #   p <- p + scale_x_date(labels = date_format("%m/%d\n%a"))
> #   p
> {code}
> I get the following error message
> {quote}
>   :16:1: unexpected input
> 15: #   p <- p + scale_x_date(labels = date_format("%m/%d
> 16: %a"))
> ^
> {quote}
> After I remove the backslash in "date_format("%m/%d\n%a"))", it works fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15913) Dispatcher.stopped should be enclosed by synchronized block.

2016-06-13 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-15913.

   Resolution: Fixed
 Assignee: Dongjoon Hyun
Fix Version/s: 2.0.0

> Dispatcher.stopped should be enclosed by synchronized block.
> 
>
> Key: SPARK-15913
> URL: https://issues.apache.org/jira/browse/SPARK-15913
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.0.0
>
>
> Dispatcher.stopped is guarded by `this`, but it is used without 
> synchronization in `postMessage` function. This issue fixes this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15826) PipedRDD to allow configurable char encoding (default: UTF-8)

2016-06-13 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil updated SPARK-15826:

Summary: PipedRDD to allow configurable char encoding (default: UTF-8)  
(was: PipedRDD to strictly use UTF-8 and not rely on default encoding)

> PipedRDD to allow configurable char encoding (default: UTF-8)
> -
>
> Key: SPARK-15826
> URL: https://issues.apache.org/jira/browse/SPARK-15826
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Tejas Patil
>Priority: Trivial
>
> Encountered an issue wherein the code works in some cluster but fails on 
> another one for the same input. After debugging realised that PipedRDD is 
> picking default char encoding from the JVM which may be different across 
> different platforms. Making it use UTF-8 encoding just like 
> `ScriptTransformation` does.
> Stack trace:
> {noformat}
> Caused by: java.nio.charset.MalformedInputException: Input length = 1
>   at java.nio.charset.CoderResult.throwException(CoderResult.java:281)
>   at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:339)
>   at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
>   at java.io.InputStreamReader.read(InputStreamReader.java:184)
>   at java.io.BufferedReader.fill(BufferedReader.java:161)
>   at java.io.BufferedReader.readLine(BufferedReader.java:324)
>   at java.io.BufferedReader.readLine(BufferedReader.java:389)
>   at 
> scala.io.BufferedSource$BufferedLineIterator.hasNext(BufferedSource.scala:67)
>   at org.apache.spark.rdd.PipedRDD$$anon$1.hasNext(PipedRDD.scala:185)
>   at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1612)
>   at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1160)
>   at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1160)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$6.apply(SparkContext.scala:1868)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$6.apply(SparkContext.scala:1868)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:89)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15345) SparkSession's conf doesn't take effect when there's already an existing SparkContext

2016-06-13 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327803#comment-15327803
 ] 

Herman van Hovell commented on SPARK-15345:
---

[~m1lan] Just to be sure, is this the actual code you have copy & pasted here? 
There is a typo in {{conf = SparkConrf()}}, should be  {{conf = SparkConf()}}.

> SparkSession's conf doesn't take effect when there's already an existing 
> SparkContext
> -
>
> Key: SPARK-15345
> URL: https://issues.apache.org/jira/browse/SPARK-15345
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Reporter: Piotr Milanowski
>Assignee: Reynold Xin
>Priority: Blocker
> Fix For: 2.0.0
>
>
> I am working with branch-2.0, spark is compiled with hive support (-Phive and 
> -Phvie-thriftserver).
> I am trying to access databases using this snippet:
> {code}
> from pyspark.sql import HiveContext
> hc = HiveContext(sc)
> hc.sql("show databases").collect()
> [Row(result='default')]
> {code}
> This means that spark doesn't find any databases specified in configuration.
> Using the same configuration (i.e. hive-site.xml and core-site.xml) in spark 
> 1.6, and launching above snippet, I can print out existing databases.
> When run in DEBUG mode this is what spark (2.0) prints out:
> {code}
> 16/05/16 12:17:47 INFO SparkSqlParser: Parsing command: show databases
> 16/05/16 12:17:47 DEBUG SimpleAnalyzer: 
> === Result of Batch Resolution ===
> !'Project [unresolveddeserializer(createexternalrow(if (isnull(input[0, 
> string])) null else input[0, string].toString, 
> StructField(result,StringType,false)), result#2) AS #3]   Project 
> [createexternalrow(if (isnull(result#2)) null else result#2.toString, 
> StructField(result,StringType,false)) AS #3]
>  +- LocalRelation [result#2]  
>   
>  +- LocalRelation [result#2]
> 
> 16/05/16 12:17:47 DEBUG ClosureCleaner: +++ Cleaning closure  
> (org.apache.spark.sql.Dataset$$anonfun$53) +++
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + declared fields: 2
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public static final long 
> org.apache.spark.sql.Dataset$$anonfun$53.serialVersionUID
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  private final 
> org.apache.spark.sql.types.StructType 
> org.apache.spark.sql.Dataset$$anonfun$53.structType$1
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + declared methods: 2
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public final java.lang.Object 
> org.apache.spark.sql.Dataset$$anonfun$53.apply(java.lang.Object)
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public final java.lang.Object 
> org.apache.spark.sql.Dataset$$anonfun$53.apply(org.apache.spark.sql.catalyst.InternalRow)
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + inner classes: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + outer classes: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + outer objects: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + populating accessed fields because 
> this is the starting closure
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + fields accessed by starting 
> closure: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + there are no enclosing objects!
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  +++ closure  
> (org.apache.spark.sql.Dataset$$anonfun$53) is now cleaned +++
> 16/05/16 12:17:47 DEBUG ClosureCleaner: +++ Cleaning closure  
> (org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1)
>  +++
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + declared fields: 1
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public static final long 
> org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1.serialVersionUID
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + declared methods: 2
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public final java.lang.Object 
> org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1.apply(java.lang.Object)
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public final 
> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler 
> org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1.apply(scala.collection.Iterator)
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + inner classes: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + outer classes: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + outer objects: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + populating accessed fields because 
> this is the starting closure
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + fields accessed by starting 
> closure: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + there are no enclosing objects!

[jira] [Commented] (SPARK-15666) Join on two tables generated from a same table throwing query analyzer issue

2016-06-13 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327818#comment-15327818
 ] 

Herman van Hovell commented on SPARK-15666:
---

[~mkbond777] Is this also a problem on 2.0? Any chance you could provide a 
reproducible example?

> Join on two tables generated from a same table throwing query analyzer issue
> 
>
> Key: SPARK-15666
> URL: https://issues.apache.org/jira/browse/SPARK-15666
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1
> Environment: AWS EMR
>Reporter: Manish Kumar
>Priority: Blocker
>
> If two dataframes (named leftdf and rightdf) which are created by performimg 
> some opeartions on a single dataframe are joined then we are getting some 
> analyzer issue:
> leftdf schema
> {noformat}
> root
>  |-- affinity_monitor_copay: string (nullable = true)
>  |-- affinity_monitor_digital_pull: string (nullable = true)
>  |-- affinity_monitor_digital_push: string (nullable = true)
>  |-- affinity_monitor_direct: string (nullable = true)
>  |-- affinity_monitor_peer: string (nullable = true)
>  |-- affinity_monitor_peer_interaction: string (nullable = true)
>  |-- affinity_monitor_personal_f2f: string (nullable = true)
>  |-- affinity_monitor_personal_remote: string (nullable = true)
>  |-- affinity_monitor_sample: string (nullable = true)
>  |-- affinity_monitor_voucher: string (nullable = true)
>  |-- afltn_id: string (nullable = true)
>  |-- attribute_2_value: string (nullable = true)
>  |-- brand: string (nullable = true)
>  |-- city: string (nullable = true)
>  |-- cycle_time_id: integer (nullable = true)
>  |-- full_name: string (nullable = true)
>  |-- hcp: string (nullable = true)
>  |-- like17_mg17_metric114_aggregated: double (nullable = true)
>  |-- like17_mg17_metric118_aggregated: double (nullable = true)
>  |-- metric_group_sk: integer (nullable = true)
>  |-- metrics: array (nullable = true)
>  ||-- element: struct (containsNull = true)
>  |||-- hcp: string (nullable = true)
>  |||-- brand: string (nullable = true)
>  |||-- rep: string (nullable = true)
>  |||-- month: string (nullable = true)
>  |||-- metric117: string (nullable = true)
>  |||-- metric114: string (nullable = true)
>  |||-- metric118: string (nullable = true)
>  |||-- specialty_1: string (nullable = true)
>  |||-- full_name: string (nullable = true)
>  |||-- pri_st: string (nullable = true)
>  |||-- city: string (nullable = true)
>  |||-- zip_code: string (nullable = true)
>  |||-- prsn_id: string (nullable = true)
>  |||-- afltn_id: string (nullable = true)
>  |||-- npi_id: string (nullable = true)
>  |||-- affinity_monitor_sample: string (nullable = true)
>  |||-- affinity_monitor_personal_f2f: string (nullable = true)
>  |||-- affinity_monitor_peer: string (nullable = true)
>  |||-- affinity_monitor_copay: string (nullable = true)
>  |||-- affinity_monitor_digital_push: string (nullable = true)
>  |||-- affinity_monitor_voucher: string (nullable = true)
>  |||-- affinity_monitor_direct: string (nullable = true)
>  |||-- affinity_monitor_peer_interaction: string (nullable = true)
>  |||-- affinity_monitor_digital_pull: string (nullable = true)
>  |||-- affinity_monitor_personal_remote: string (nullable = true)
>  |||-- attribute_2_value: string (nullable = true)
>  |||-- metric211: double (nullable = false)
>  |-- mg17_metric117_3: double (nullable = true)
>  |-- mg17_metric117_3_actual_metric: double (nullable = true)
>  |-- mg17_metric117_3_planned_metric: double (nullable = true)
>  |-- mg17_metric117_D_suggestion_id: integer (nullable = true)
>  |-- mg17_metric117_D_suggestion_text: string (nullable = true)
>  |-- mg17_metric117_D_suggestion_text_raw: string (nullable = true)
>  |-- mg17_metric117_exp_score: integer (nullable = true)
>  |-- mg17_metric117_severity_index: double (nullable = true)
>  |-- mg17_metric117_test: integer (nullable = true)
>  |-- mg17_metric211_P_suggestion_id: integer (nullable = true)
>  |-- mg17_metric211_P_suggestion_text: string (nullable = true)
>  |-- mg17_metric211_P_suggestion_text_raw: string (nullable = true)
>  |-- mg17_metric211_aggregated: double (nullable = false)
>  |-- mg17_metric211_deviationfrompeers_p_value: double (nullable = true)
>  |-- mg17_metric211_deviationfromtrend_current_mu: double (nullable = true)
>  |-- mg17_metric211_deviationfromtrend_p_value: double (nullable = true)
>  |-- mg17_metric211_deviationfromtrend_previous_mu: double (nullable = true)
>  |-- mg17_metric211_exp_score: integer (nullable = tru

  1   2   3   >