[jira] [Resolved] (SPARK-11273) ArrayData and MapData shouldn't be public in types package

2015-10-23 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-11273.
-
   Resolution: Fixed
Fix Version/s: 1.6.0

> ArrayData and MapData shouldn't be public in types package
> --
>
> Key: SPARK-11273
> URL: https://issues.apache.org/jira/browse/SPARK-11273
> Project: Spark
>  Issue Type: Bug
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11276) SizeEstimator prevents class unloading

2015-10-23 Thread Sem Mulder (JIRA)
Sem Mulder created SPARK-11276:
--

 Summary: SizeEstimator prevents class unloading
 Key: SPARK-11276
 URL: https://issues.apache.org/jira/browse/SPARK-11276
 Project: Spark
  Issue Type: Bug
  Components: Block Manager, Spark Core
Affects Versions: 1.5.1
Reporter: Sem Mulder


The SizeEstimator keeps a cache of ClassInfos, but this cache uses Class 
objects as keys which results in strong references to these class objects.
If these classes are dynamically created this prevents the corresponding 
ClassLoader from being GCed. Leading to PermGen exhaustion.

An easy fix would be to use a WeakRef for the keys. A proposed fix can be found 
here:
[https://github.com/Site2Mobile/spark/commit/21c572cbda5607d0c7c6643bfaf43e53c8aa6f8c]
We are currently running this in production and it seems to resolve the issue.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11276) SizeEstimator prevents class unloading

2015-10-23 Thread Sem Mulder (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sem Mulder updated SPARK-11276:
---
Description: 
The SizeEstimator keeps a cache of ClassInfos but this cache uses Class objects 
as keys which results in strong references to these class objects.
If these classes are dynamically created this prevents the corresponding 
ClassLoader from being GCed. Leading to PermGen exhaustion.

An easy fix would be to use a WeakRef for the keys. A proposed fix can be found 
here:
[https://github.com/Site2Mobile/spark/commit/21c572cbda5607d0c7c6643bfaf43e53c8aa6f8c]
We are currently running this in production and it seems to resolve the issue.


  was:
The SizeEstimator keeps a cache of ClassInfos, but this cache uses Class 
objects as keys which results in strong references to these class objects.
If these classes are dynamically created this prevents the corresponding 
ClassLoader from being GCed. Leading to PermGen exhaustion.

An easy fix would be to use a WeakRef for the keys. A proposed fix can be found 
here:
[https://github.com/Site2Mobile/spark/commit/21c572cbda5607d0c7c6643bfaf43e53c8aa6f8c]
We are currently running this in production and it seems to resolve the issue.



> SizeEstimator prevents class unloading
> --
>
> Key: SPARK-11276
> URL: https://issues.apache.org/jira/browse/SPARK-11276
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Spark Core
>Affects Versions: 1.5.1
>Reporter: Sem Mulder
>
> The SizeEstimator keeps a cache of ClassInfos but this cache uses Class 
> objects as keys which results in strong references to these class objects.
> If these classes are dynamically created this prevents the corresponding 
> ClassLoader from being GCed. Leading to PermGen exhaustion.
> An easy fix would be to use a WeakRef for the keys. A proposed fix can be 
> found here:
> [https://github.com/Site2Mobile/spark/commit/21c572cbda5607d0c7c6643bfaf43e53c8aa6f8c]
> We are currently running this in production and it seems to resolve the issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11276) SizeEstimator prevents class unloading

2015-10-23 Thread Sem Mulder (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sem Mulder updated SPARK-11276:
---
Description: 
The SizeEstimator keeps a cache of ClassInfos but this cache uses Class objects 
as keys.
Which results in strong references to these class objects. If these classes are 
dynamically created
this prevents the corresponding ClassLoader from being GCed. Leading to PermGen 
exhaustion.

An easy fix would be to use a WeakRef for the keys. A proposed fix can be found 
here:
[https://github.com/Site2Mobile/spark/commit/21c572cbda5607d0c7c6643bfaf43e53c8aa6f8c]
We are currently running this in production and it seems to resolve the issue.


  was:
The SizeEstimator keeps a cache of ClassInfos but this cache uses Class objects 
as keys which results in strong references to these class objects.
If these classes are dynamically created this prevents the corresponding 
ClassLoader from being GCed. Leading to PermGen exhaustion.

An easy fix would be to use a WeakRef for the keys. A proposed fix can be found 
here:
[https://github.com/Site2Mobile/spark/commit/21c572cbda5607d0c7c6643bfaf43e53c8aa6f8c]
We are currently running this in production and it seems to resolve the issue.



> SizeEstimator prevents class unloading
> --
>
> Key: SPARK-11276
> URL: https://issues.apache.org/jira/browse/SPARK-11276
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Spark Core
>Affects Versions: 1.5.1
>Reporter: Sem Mulder
>
> The SizeEstimator keeps a cache of ClassInfos but this cache uses Class 
> objects as keys.
> Which results in strong references to these class objects. If these classes 
> are dynamically created
> this prevents the corresponding ClassLoader from being GCed. Leading to 
> PermGen exhaustion.
> An easy fix would be to use a WeakRef for the keys. A proposed fix can be 
> found here:
> [https://github.com/Site2Mobile/spark/commit/21c572cbda5607d0c7c6643bfaf43e53c8aa6f8c]
> We are currently running this in production and it seems to resolve the issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11276) SizeEstimator prevents class unloading

2015-10-23 Thread Sem Mulder (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sem Mulder updated SPARK-11276:
---
Description: 
The SizeEstimator keeps a cache of ClassInfos but this cache uses Class objects 
as keys.
Which results in strong references to the Class objects. If these classes are 
dynamically created
this prevents the corresponding ClassLoader from being GCed. Leading to PermGen 
exhaustion.

An easy fix would be to use a WeakRef for the keys. A proposed fix can be found 
here:
[https://github.com/Site2Mobile/spark/commit/21c572cbda5607d0c7c6643bfaf43e53c8aa6f8c]
We are currently running this in production and it seems to resolve the issue.


  was:
The SizeEstimator keeps a cache of ClassInfos but this cache uses Class objects 
as keys.
Which results in strong references to these class objects. If these classes are 
dynamically created
this prevents the corresponding ClassLoader from being GCed. Leading to PermGen 
exhaustion.

An easy fix would be to use a WeakRef for the keys. A proposed fix can be found 
here:
[https://github.com/Site2Mobile/spark/commit/21c572cbda5607d0c7c6643bfaf43e53c8aa6f8c]
We are currently running this in production and it seems to resolve the issue.



> SizeEstimator prevents class unloading
> --
>
> Key: SPARK-11276
> URL: https://issues.apache.org/jira/browse/SPARK-11276
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Spark Core
>Affects Versions: 1.5.1
>Reporter: Sem Mulder
>
> The SizeEstimator keeps a cache of ClassInfos but this cache uses Class 
> objects as keys.
> Which results in strong references to the Class objects. If these classes are 
> dynamically created
> this prevents the corresponding ClassLoader from being GCed. Leading to 
> PermGen exhaustion.
> An easy fix would be to use a WeakRef for the keys. A proposed fix can be 
> found here:
> [https://github.com/Site2Mobile/spark/commit/21c572cbda5607d0c7c6643bfaf43e53c8aa6f8c]
> We are currently running this in production and it seems to resolve the issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11276) SizeEstimator prevents class unloading

2015-10-23 Thread Sem Mulder (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sem Mulder updated SPARK-11276:
---
Description: 
The SizeEstimator keeps a cache of ClassInfos but this cache uses Class objects 
as keys.
Which results in strong references to the Class objects. If these classes are 
dynamically created
this prevents the corresponding ClassLoader from being GCed. Leading to PermGen 
exhaustion.

An easy fix would be to use a WeakRef for the keys. A proposed fix can be found 
here:
[https://github.com/Site2Mobile/spark/commit/21c572cbda5607d0c7c6643bfaf43e53c8aa6f8c]
We are currently running this in production and it seems to resolve the issue.

I will prepare a pull request ASAP.

  was:
The SizeEstimator keeps a cache of ClassInfos but this cache uses Class objects 
as keys.
Which results in strong references to the Class objects. If these classes are 
dynamically created
this prevents the corresponding ClassLoader from being GCed. Leading to PermGen 
exhaustion.

An easy fix would be to use a WeakRef for the keys. A proposed fix can be found 
here:
[https://github.com/Site2Mobile/spark/commit/21c572cbda5607d0c7c6643bfaf43e53c8aa6f8c]
We are currently running this in production and it seems to resolve the issue.



> SizeEstimator prevents class unloading
> --
>
> Key: SPARK-11276
> URL: https://issues.apache.org/jira/browse/SPARK-11276
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Spark Core
>Affects Versions: 1.5.1
>Reporter: Sem Mulder
>
> The SizeEstimator keeps a cache of ClassInfos but this cache uses Class 
> objects as keys.
> Which results in strong references to the Class objects. If these classes are 
> dynamically created
> this prevents the corresponding ClassLoader from being GCed. Leading to 
> PermGen exhaustion.
> An easy fix would be to use a WeakRef for the keys. A proposed fix can be 
> found here:
> [https://github.com/Site2Mobile/spark/commit/21c572cbda5607d0c7c6643bfaf43e53c8aa6f8c]
> We are currently running this in production and it seems to resolve the issue.
> I will prepare a pull request ASAP.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11234) What's cooking classification

2015-10-23 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14970596#comment-14970596
 ] 

Xusen Yin commented on SPARK-11234:
---

[~mengxr] I add the cooking classification code here: 
https://gist.github.com/yinxusen/ad4372b8c0af5ae54a4a

Here are what I find:

1. Currently, multi-line per record JSON file is hard to handle, I have to load 
the data with JsonInputFormat in the json-pxf-ext package.

2. String indexer is easy to use. But it is hard to do beyond existing 
transformers. Like in the code, when I want to add all vectors that belong to 
the same id together, I have to write an aggregate function.

3. ParamGridBuilder accepts discrete parameter candidates, but I need to add 
some parameters with guess like Array(1.0, 0.1, 0.01). I don't know which 
parameter is suitable and how to fill in the array will get a better result. 
How about giving a range of real numbers so that the ParamGridBuilder can 
generate candidates for me like [0.0001, 1]?

4. The evaluator forces me to select a metric method. But sometimes I want to 
see all the evaluation results, say F1, precision-recall, AUC, etc. 

5. ML transformers will get stuck when facing with Int type. It's strange that 
we have to transform all Int values to double values before hand. I think a 
wise auto casting is helpful.

> What's cooking classification
> -
>
> Key: SPARK-11234
> URL: https://issues.apache.org/jira/browse/SPARK-11234
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Xusen Yin
>
> I add the subtask to post the work on this dataset:  
> https://www.kaggle.com/c/whats-cooking



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11271) MapStatus too large for driver

2015-10-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11271:


Assignee: Apache Spark

> MapStatus too large for driver
> --
>
> Key: SPARK-11271
> URL: https://issues.apache.org/jira/browse/SPARK-11271
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Reporter: Kent Yao
>Assignee: Apache Spark
>
> When I run a spark job contains quite a lot of tasks(in my case is 
> 200k[maptasks]*200k[reducetasks]), the driver occured OOM mainly caused by 
> the object MapStatus,
> RoaringBitmap that used to mark which block is empty seems to use too many 
> memories.
> I try to use org.apache.spark.util.collection.BitSet instead of 
> RoaringBitMap, and it can save about 20% memories.
> For the 200K tasks job, 
> RoaringBitMap uses 3 Long[1024] and 1 Short[3392] 
> =3*64*1024+16*3392=250880(bit) 
> BitSet uses 1 Long[3125] = 3125*64=20(bit) 
> Memory saved = (250880-20) / 250880 ≈20%



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11271) MapStatus too large for driver

2015-10-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11271:


Assignee: (was: Apache Spark)

> MapStatus too large for driver
> --
>
> Key: SPARK-11271
> URL: https://issues.apache.org/jira/browse/SPARK-11271
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Reporter: Kent Yao
>
> When I run a spark job contains quite a lot of tasks(in my case is 
> 200k[maptasks]*200k[reducetasks]), the driver occured OOM mainly caused by 
> the object MapStatus,
> RoaringBitmap that used to mark which block is empty seems to use too many 
> memories.
> I try to use org.apache.spark.util.collection.BitSet instead of 
> RoaringBitMap, and it can save about 20% memories.
> For the 200K tasks job, 
> RoaringBitMap uses 3 Long[1024] and 1 Short[3392] 
> =3*64*1024+16*3392=250880(bit) 
> BitSet uses 1 Long[3125] = 3125*64=20(bit) 
> Memory saved = (250880-20) / 250880 ≈20%



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11271) MapStatus too large for driver

2015-10-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14970626#comment-14970626
 ] 

Apache Spark commented on SPARK-11271:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/9243

> MapStatus too large for driver
> --
>
> Key: SPARK-11271
> URL: https://issues.apache.org/jira/browse/SPARK-11271
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Reporter: Kent Yao
>
> When I run a spark job contains quite a lot of tasks(in my case is 
> 200k[maptasks]*200k[reducetasks]), the driver occured OOM mainly caused by 
> the object MapStatus,
> RoaringBitmap that used to mark which block is empty seems to use too many 
> memories.
> I try to use org.apache.spark.util.collection.BitSet instead of 
> RoaringBitMap, and it can save about 20% memories.
> For the 200K tasks job, 
> RoaringBitMap uses 3 Long[1024] and 1 Short[3392] 
> =3*64*1024+16*3392=250880(bit) 
> BitSet uses 1 Long[3125] = 3125*64=20(bit) 
> Memory saved = (250880-20) / 250880 ≈20%



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11276) SizeEstimator prevents class unloading

2015-10-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14970629#comment-14970629
 ] 

Apache Spark commented on SPARK-11276:
--

User 'SemMulder' has created a pull request for this issue:
https://github.com/apache/spark/pull/9244

> SizeEstimator prevents class unloading
> --
>
> Key: SPARK-11276
> URL: https://issues.apache.org/jira/browse/SPARK-11276
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Spark Core
>Affects Versions: 1.5.1
>Reporter: Sem Mulder
>
> The SizeEstimator keeps a cache of ClassInfos but this cache uses Class 
> objects as keys.
> Which results in strong references to the Class objects. If these classes are 
> dynamically created
> this prevents the corresponding ClassLoader from being GCed. Leading to 
> PermGen exhaustion.
> An easy fix would be to use a WeakRef for the keys. A proposed fix can be 
> found here:
> [https://github.com/Site2Mobile/spark/commit/21c572cbda5607d0c7c6643bfaf43e53c8aa6f8c]
> We are currently running this in production and it seems to resolve the issue.
> I will prepare a pull request ASAP.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11276) SizeEstimator prevents class unloading

2015-10-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11276:


Assignee: (was: Apache Spark)

> SizeEstimator prevents class unloading
> --
>
> Key: SPARK-11276
> URL: https://issues.apache.org/jira/browse/SPARK-11276
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Spark Core
>Affects Versions: 1.5.1
>Reporter: Sem Mulder
>
> The SizeEstimator keeps a cache of ClassInfos but this cache uses Class 
> objects as keys.
> Which results in strong references to the Class objects. If these classes are 
> dynamically created
> this prevents the corresponding ClassLoader from being GCed. Leading to 
> PermGen exhaustion.
> An easy fix would be to use a WeakRef for the keys. A proposed fix can be 
> found here:
> [https://github.com/Site2Mobile/spark/commit/21c572cbda5607d0c7c6643bfaf43e53c8aa6f8c]
> We are currently running this in production and it seems to resolve the issue.
> I will prepare a pull request ASAP.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11276) SizeEstimator prevents class unloading

2015-10-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11276:


Assignee: Apache Spark

> SizeEstimator prevents class unloading
> --
>
> Key: SPARK-11276
> URL: https://issues.apache.org/jira/browse/SPARK-11276
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Spark Core
>Affects Versions: 1.5.1
>Reporter: Sem Mulder
>Assignee: Apache Spark
>
> The SizeEstimator keeps a cache of ClassInfos but this cache uses Class 
> objects as keys.
> Which results in strong references to the Class objects. If these classes are 
> dynamically created
> this prevents the corresponding ClassLoader from being GCed. Leading to 
> PermGen exhaustion.
> An easy fix would be to use a WeakRef for the keys. A proposed fix can be 
> found here:
> [https://github.com/Site2Mobile/spark/commit/21c572cbda5607d0c7c6643bfaf43e53c8aa6f8c]
> We are currently running this in production and it seems to resolve the issue.
> I will prepare a pull request ASAP.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10947) With schema inference from JSON into a Dataframe, add option to infer all primitive object types as strings

2015-10-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10947:


Assignee: Apache Spark

> With schema inference from JSON into a Dataframe, add option to infer all 
> primitive object types as strings
> ---
>
> Key: SPARK-10947
> URL: https://issues.apache.org/jira/browse/SPARK-10947
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Ewan Leith
>Assignee: Apache Spark
>Priority: Minor
>
> Currently, when a schema is inferred from a JSON file using 
> sqlContext.read.json, the primitive object types are inferred as string, 
> long, boolean, etc.
> However, if the inferred type is too specific (JSON obviously does not 
> enforce types itself), this causes issues with merging dataframe schemas.
> Instead, we would like an option in the JSON inferField function to treat all 
> primitive objects as strings.
> We'll create and submit a pull request for this for review.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10947) With schema inference from JSON into a Dataframe, add option to infer all primitive object types as strings

2015-10-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14970631#comment-14970631
 ] 

Apache Spark commented on SPARK-10947:
--

User 'stephend-realitymine' has created a pull request for this issue:
https://github.com/apache/spark/pull/9245

> With schema inference from JSON into a Dataframe, add option to infer all 
> primitive object types as strings
> ---
>
> Key: SPARK-10947
> URL: https://issues.apache.org/jira/browse/SPARK-10947
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Ewan Leith
>Priority: Minor
>
> Currently, when a schema is inferred from a JSON file using 
> sqlContext.read.json, the primitive object types are inferred as string, 
> long, boolean, etc.
> However, if the inferred type is too specific (JSON obviously does not 
> enforce types itself), this causes issues with merging dataframe schemas.
> Instead, we would like an option in the JSON inferField function to treat all 
> primitive objects as strings.
> We'll create and submit a pull request for this for review.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10947) With schema inference from JSON into a Dataframe, add option to infer all primitive object types as strings

2015-10-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10947:


Assignee: (was: Apache Spark)

> With schema inference from JSON into a Dataframe, add option to infer all 
> primitive object types as strings
> ---
>
> Key: SPARK-10947
> URL: https://issues.apache.org/jira/browse/SPARK-10947
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Ewan Leith
>Priority: Minor
>
> Currently, when a schema is inferred from a JSON file using 
> sqlContext.read.json, the primitive object types are inferred as string, 
> long, boolean, etc.
> However, if the inferred type is too specific (JSON obviously does not 
> enforce types itself), this causes issues with merging dataframe schemas.
> Instead, we would like an option in the JSON inferField function to treat all 
> primitive objects as strings.
> We'll create and submit a pull request for this for review.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11277) sort_array throws exception scala.MatchError

2015-10-23 Thread Jia Li (JIRA)
Jia Li created SPARK-11277:
--

 Summary: sort_array throws exception scala.MatchError
 Key: SPARK-11277
 URL: https://issues.apache.org/jira/browse/SPARK-11277
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.0
 Environment: Linux
Reporter: Jia Li
Priority: Minor


I was trying out the sort_array function then hit this exception. 

I looked into the spark source code. I found the root cause is that sort_array 
does not check for an array of NULLs. It's not meaningful to sort an array of 
entirely NULLs anyway.
I already have a fix for this issue and I'm going to create a pull request for 
it. 

scala> sqlContext.sql("select sort_array(array(null, null)) from t1").show()
scala.MatchError: ArrayType(NullType,true) (of class 
org.apache.spark.sql.types.ArrayType)
at 
org.apache.spark.sql.catalyst.expressions.SortArray.lt$lzycompute(collectionOperations.scala:68)
at 
org.apache.spark.sql.catalyst.expressions.SortArray.lt(collectionOperations.scala:67)
at 
org.apache.spark.sql.catalyst.expressions.SortArray.nullSafeEval(collectionOperations.scala:111)
at 
org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:341)
at 
org.apache.spark.sql.catalyst.optimizer.ConstantFolding$$anonfun$apply$9$$anonfun$applyOrElse$2.applyOrElse(Optimizer.scala:440)
at 
org.apache.spark.sql.catalyst.optimizer.ConstantFolding$$anonfun$apply$9$$anonfun$applyOrElse$2.applyOrElse(Optimizer.scala:433)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:227)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:227)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:226)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:232)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:232)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:249)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:279)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:232)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionDown$1(QueryPlan.scala:75)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:85)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:89)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:89)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:93)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
   

[jira] [Updated] (SPARK-11277) sort_array throws exception scala.MatchError

2015-10-23 Thread Jia Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jia Li updated SPARK-11277:
---
Description: 
I was trying out the sort_array function then hit this exception. 

I looked into the spark source code. I found the root cause is that sort_array 
does not check for an array of NULLs. It's not meaningful to sort an array of 
entirely NULLs anyway.
I already have a fix for this issue and I'm going to create a pull request for 
it. 

scala> sqlContext.sql("select sort_array(array(null, null)) from t1").show()
scala.MatchError: ArrayType(NullType,true) (of class 
org.apache.spark.sql.types.ArrayType)
at 
org.apache.spark.sql.catalyst.expressions.SortArray.lt$lzycompute(collectionOperations.scala:68)
at 
org.apache.spark.sql.catalyst.expressions.SortArray.lt(collectionOperations.scala:67)
at 
org.apache.spark.sql.catalyst.expressions.SortArray.nullSafeEval(collectionOperations.scala:111)
at 
org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:341)
at 
org.apache.spark.sql.catalyst.optimizer.ConstantFolding$$anonfun$apply$9$$anonfun$applyOrElse$2.applyOrElse(Optimizer.scala:440)
at 
org.apache.spark.sql.catalyst.optimizer.ConstantFolding$$anonfun$apply$9$$anonfun$applyOrElse$2.applyOrElse(Optimizer.scala:433)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:227)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:227)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:226)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:232)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:232)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:249)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)


  was:
I was trying out the sort_array function then hit this exception. 

I looked into the spark source code. I found the root cause is that sort_array 
does not check for an array of NULLs. It's not meaningful to sort an array of 
entirely NULLs anyway.
I already have a fix for this issue and I'm going to create a pull request for 
it. 

scala> sqlContext.sql("select sort_array(array(null, null)) from t1").show()
scala.MatchError: ArrayType(NullType,true) (of class 
org.apache.spark.sql.types.ArrayType)
at 
org.apache.spark.sql.catalyst.expressions.SortArray.lt$lzycompute(collectionOperations.scala:68)
at 
org.apache.spark.sql.catalyst.expressions.SortArray.lt(collectionOperations.scala:67)
at 
org.apache.spark.sql.catalyst.expressions.SortArray.nullSafeEval(collectionOperations.scala:111)
at 
org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:341)
at 
org.apache.spark.sql.catalyst.optimizer.ConstantFolding$$anonfun$apply$9$$anonfun$applyOrElse$2.applyOrElse(Optimizer.scala:440)
at 
org.apache.spark.sql.catalyst.optimizer.ConstantFolding$$anonfun$apply$9$$anonfun$applyOrElse$2.applyOrElse(Optimizer.scala:433)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:227)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:227)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:226)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:232)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:232)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:249)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)

[jira] [Commented] (SPARK-11277) sort_array throws exception scala.MatchError

2015-10-23 Thread Jia Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14970654#comment-14970654
 ] 

Jia Li commented on SPARK-11277:


I already have a fix for this issue and I'm going to create a pull request for 
it. 

> sort_array throws exception scala.MatchError
> 
>
> Key: SPARK-11277
> URL: https://issues.apache.org/jira/browse/SPARK-11277
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: Linux
>Reporter: Jia Li
>Priority: Minor
>
> I was trying out the sort_array function then hit this exception. 
> I looked into the spark source code. I found the root cause is that 
> sort_array does not check for an array of NULLs. It's not meaningful to sort 
> an array of entirely NULLs anyway.
> I already have a fix for this issue and I'm going to create a pull request 
> for it. 
> scala> sqlContext.sql("select sort_array(array(null, null)) from t1").show()
> scala.MatchError: ArrayType(NullType,true) (of class 
> org.apache.spark.sql.types.ArrayType)
>   at 
> org.apache.spark.sql.catalyst.expressions.SortArray.lt$lzycompute(collectionOperations.scala:68)
>   at 
> org.apache.spark.sql.catalyst.expressions.SortArray.lt(collectionOperations.scala:67)
>   at 
> org.apache.spark.sql.catalyst.expressions.SortArray.nullSafeEval(collectionOperations.scala:111)
>   at 
> org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:341)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ConstantFolding$$anonfun$apply$9$$anonfun$applyOrElse$2.applyOrElse(Optimizer.scala:440)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ConstantFolding$$anonfun$apply$9$$anonfun$applyOrElse$2.applyOrElse(Optimizer.scala:433)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:227)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:227)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:226)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:232)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:232)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:249)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5210) Support log rolling in EventLogger

2015-10-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14970670#comment-14970670
 ] 

Apache Spark commented on SPARK-5210:
-

User 'XuTingjun' has created a pull request for this issue:
https://github.com/apache/spark/pull/9246

> Support log rolling in EventLogger
> --
>
> Key: SPARK-5210
> URL: https://issues.apache.org/jira/browse/SPARK-5210
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, Web UI
>Reporter: Josh Rosen
>
> For long-running Spark applications (e.g. running for days / weeks), the 
> Spark event log may grow to be very large.
> As a result, it would be useful if EventLoggingListener supported log file 
> rolling / rotation.  Adding this feature will involve changes to the 
> HistoryServer in order to be able to load event logs from a sequence of files 
> instead of a single file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11278) PageRank fails with unified memory manager

2015-10-23 Thread Nishkam Ravi (JIRA)
Nishkam Ravi created SPARK-11278:


 Summary: PageRank fails with unified memory manager
 Key: SPARK-11278
 URL: https://issues.apache.org/jira/browse/SPARK-11278
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.5.2, 1.6.0
Reporter: Nishkam Ravi


PageRank (6-nodes, 32GB input) runs very slow and eventually fails with 
ExecutorLostFailure. Traced it back to the 'unified memory manager' commit from 
Oct 13th. Took a quick look at the code and couldn't see the problem (changes 
look pretty good). cc'ing [~andrewor14][~vanzin] who may be able to spot the 
problem quickly. Can be reproduced by running PageRank on a large enough input 
dataset if needed. Sorry for not being of much help here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11279) Add DataFrame#toDF in PySpark

2015-10-23 Thread Jeff Zhang (JIRA)
Jeff Zhang created SPARK-11279:
--

 Summary: Add DataFrame#toDF in PySpark
 Key: SPARK-11279
 URL: https://issues.apache.org/jira/browse/SPARK-11279
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Jeff Zhang
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11277) sort_array throws exception scala.MatchError

2015-10-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11277:


Assignee: Apache Spark

> sort_array throws exception scala.MatchError
> 
>
> Key: SPARK-11277
> URL: https://issues.apache.org/jira/browse/SPARK-11277
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: Linux
>Reporter: Jia Li
>Assignee: Apache Spark
>Priority: Minor
>
> I was trying out the sort_array function then hit this exception. 
> I looked into the spark source code. I found the root cause is that 
> sort_array does not check for an array of NULLs. It's not meaningful to sort 
> an array of entirely NULLs anyway.
> I already have a fix for this issue and I'm going to create a pull request 
> for it. 
> scala> sqlContext.sql("select sort_array(array(null, null)) from t1").show()
> scala.MatchError: ArrayType(NullType,true) (of class 
> org.apache.spark.sql.types.ArrayType)
>   at 
> org.apache.spark.sql.catalyst.expressions.SortArray.lt$lzycompute(collectionOperations.scala:68)
>   at 
> org.apache.spark.sql.catalyst.expressions.SortArray.lt(collectionOperations.scala:67)
>   at 
> org.apache.spark.sql.catalyst.expressions.SortArray.nullSafeEval(collectionOperations.scala:111)
>   at 
> org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:341)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ConstantFolding$$anonfun$apply$9$$anonfun$applyOrElse$2.applyOrElse(Optimizer.scala:440)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ConstantFolding$$anonfun$apply$9$$anonfun$applyOrElse$2.applyOrElse(Optimizer.scala:433)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:227)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:227)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:226)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:232)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:232)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:249)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11277) sort_array throws exception scala.MatchError

2015-10-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14970694#comment-14970694
 ] 

Apache Spark commented on SPARK-11277:
--

User 'jliwork' has created a pull request for this issue:
https://github.com/apache/spark/pull/9247

> sort_array throws exception scala.MatchError
> 
>
> Key: SPARK-11277
> URL: https://issues.apache.org/jira/browse/SPARK-11277
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: Linux
>Reporter: Jia Li
>Priority: Minor
>
> I was trying out the sort_array function then hit this exception. 
> I looked into the spark source code. I found the root cause is that 
> sort_array does not check for an array of NULLs. It's not meaningful to sort 
> an array of entirely NULLs anyway.
> I already have a fix for this issue and I'm going to create a pull request 
> for it. 
> scala> sqlContext.sql("select sort_array(array(null, null)) from t1").show()
> scala.MatchError: ArrayType(NullType,true) (of class 
> org.apache.spark.sql.types.ArrayType)
>   at 
> org.apache.spark.sql.catalyst.expressions.SortArray.lt$lzycompute(collectionOperations.scala:68)
>   at 
> org.apache.spark.sql.catalyst.expressions.SortArray.lt(collectionOperations.scala:67)
>   at 
> org.apache.spark.sql.catalyst.expressions.SortArray.nullSafeEval(collectionOperations.scala:111)
>   at 
> org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:341)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ConstantFolding$$anonfun$apply$9$$anonfun$applyOrElse$2.applyOrElse(Optimizer.scala:440)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ConstantFolding$$anonfun$apply$9$$anonfun$applyOrElse$2.applyOrElse(Optimizer.scala:433)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:227)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:227)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:226)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:232)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:232)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:249)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11277) sort_array throws exception scala.MatchError

2015-10-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11277:


Assignee: (was: Apache Spark)

> sort_array throws exception scala.MatchError
> 
>
> Key: SPARK-11277
> URL: https://issues.apache.org/jira/browse/SPARK-11277
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: Linux
>Reporter: Jia Li
>Priority: Minor
>
> I was trying out the sort_array function then hit this exception. 
> I looked into the spark source code. I found the root cause is that 
> sort_array does not check for an array of NULLs. It's not meaningful to sort 
> an array of entirely NULLs anyway.
> I already have a fix for this issue and I'm going to create a pull request 
> for it. 
> scala> sqlContext.sql("select sort_array(array(null, null)) from t1").show()
> scala.MatchError: ArrayType(NullType,true) (of class 
> org.apache.spark.sql.types.ArrayType)
>   at 
> org.apache.spark.sql.catalyst.expressions.SortArray.lt$lzycompute(collectionOperations.scala:68)
>   at 
> org.apache.spark.sql.catalyst.expressions.SortArray.lt(collectionOperations.scala:67)
>   at 
> org.apache.spark.sql.catalyst.expressions.SortArray.nullSafeEval(collectionOperations.scala:111)
>   at 
> org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:341)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ConstantFolding$$anonfun$apply$9$$anonfun$applyOrElse$2.applyOrElse(Optimizer.scala:440)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ConstantFolding$$anonfun$apply$9$$anonfun$applyOrElse$2.applyOrElse(Optimizer.scala:433)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:227)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:227)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:226)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:232)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:232)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:249)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11279) Add DataFrame#toDF in PySpark

2015-10-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11279:


Assignee: (was: Apache Spark)

> Add DataFrame#toDF in PySpark
> -
>
> Key: SPARK-11279
> URL: https://issues.apache.org/jira/browse/SPARK-11279
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Jeff Zhang
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11279) Add DataFrame#toDF in PySpark

2015-10-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11279:


Assignee: Apache Spark

> Add DataFrame#toDF in PySpark
> -
>
> Key: SPARK-11279
> URL: https://issues.apache.org/jira/browse/SPARK-11279
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Jeff Zhang
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11279) Add DataFrame#toDF in PySpark

2015-10-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14970705#comment-14970705
 ] 

Apache Spark commented on SPARK-11279:
--

User 'zjffdu' has created a pull request for this issue:
https://github.com/apache/spark/pull/9248

> Add DataFrame#toDF in PySpark
> -
>
> Key: SPARK-11279
> URL: https://issues.apache.org/jira/browse/SPARK-11279
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Jeff Zhang
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11280) Mesos cluster deployment using only one node

2015-10-23 Thread Iulian Dragos (JIRA)
Iulian Dragos created SPARK-11280:
-

 Summary: Mesos cluster deployment using only one node
 Key: SPARK-11280
 URL: https://issues.apache.org/jira/browse/SPARK-11280
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Affects Versions: 1.5.1, 1.6.0
Reporter: Iulian Dragos


I submit the SparkPi example in Mesos cluster mode, and I notice that all tasks 
fail except the ones that run on the same node as the driver. The others fail 
with

{code}
sh: 1: 
/tmp/mesos/slaves/1521e408-d8fe-416d-898b-3801e73a8293-S0/frameworks/1521e408-d8fe-416d-898b-3801e73a8293-0003/executors/driver-20151023113121-0006/runs/2abefd29-7386-4d81-a025-9d794780db23/spark-1.5.0-bin-hadoop2.6/bin/spark-class:
 not found
{code}

The path exists only on the machine that launched the driver, and the sandbox 
of the executor where this task died is completely empty.

I launch the task like this:

{code}
 $ spark-submit --deploy-mode cluster --master mesos://sagitarius.local:7077 
--conf 
spark.executor.uri="ftp://sagitarius.local/ftp/spark-1.5.0-bin-hadoop2.6.tgz"; 
--conf spark.mesos.coarse=true --class org.apache.spark.examples.SparkPi 
ftp://sagitarius.local/ftp/spark-examples-1.5.0-hadoop2.6.0.jar
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/10/23 11:31:21 INFO RestSubmissionClient: Submitting a request to launch an 
application in mesos://sagitarius.local:7077.
15/10/23 11:31:21 INFO RestSubmissionClient: Submission successfully created as 
driver-20151023113121-0006. Polling submission state...
15/10/23 11:31:21 INFO RestSubmissionClient: Submitting a request for the 
status of submission driver-20151023113121-0006 in 
mesos://sagitarius.local:7077.
15/10/23 11:31:21 INFO RestSubmissionClient: State of driver 
driver-20151023113121-0006 is now QUEUED.
15/10/23 11:31:21 INFO RestSubmissionClient: Server responded with 
CreateSubmissionResponse:
{
  "action" : "CreateSubmissionResponse",
  "serverSparkVersion" : "1.5.0",
  "submissionId" : "driver-20151023113121-0006",
  "success" : true
}
{code}

I can see the driver in the Dispatcher UI and the job succeeds eventually, but 
running only on the node where the driver was launched (see attachment).





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11280) Mesos cluster deployment using only one node

2015-10-23 Thread Iulian Dragos (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Iulian Dragos updated SPARK-11280:
--
Attachment: Screen Shot 2015-10-23 at 11.37.43.png

> Mesos cluster deployment using only one node
> 
>
> Key: SPARK-11280
> URL: https://issues.apache.org/jira/browse/SPARK-11280
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.5.1, 1.6.0
>Reporter: Iulian Dragos
> Attachments: Screen Shot 2015-10-23 at 11.37.43.png
>
>
> I submit the SparkPi example in Mesos cluster mode, and I notice that all 
> tasks fail except the ones that run on the same node as the driver. The 
> others fail with
> {code}
> sh: 1: 
> /tmp/mesos/slaves/1521e408-d8fe-416d-898b-3801e73a8293-S0/frameworks/1521e408-d8fe-416d-898b-3801e73a8293-0003/executors/driver-20151023113121-0006/runs/2abefd29-7386-4d81-a025-9d794780db23/spark-1.5.0-bin-hadoop2.6/bin/spark-class:
>  not found
> {code}
> The path exists only on the machine that launched the driver, and the sandbox 
> of the executor where this task died is completely empty.
> I launch the task like this:
> {code}
>  $ spark-submit --deploy-mode cluster --master mesos://sagitarius.local:7077 
> --conf 
> spark.executor.uri="ftp://sagitarius.local/ftp/spark-1.5.0-bin-hadoop2.6.tgz"; 
> --conf spark.mesos.coarse=true --class org.apache.spark.examples.SparkPi 
> ftp://sagitarius.local/ftp/spark-examples-1.5.0-hadoop2.6.0.jar
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> 15/10/23 11:31:21 INFO RestSubmissionClient: Submitting a request to launch 
> an application in mesos://sagitarius.local:7077.
> 15/10/23 11:31:21 INFO RestSubmissionClient: Submission successfully created 
> as driver-20151023113121-0006. Polling submission state...
> 15/10/23 11:31:21 INFO RestSubmissionClient: Submitting a request for the 
> status of submission driver-20151023113121-0006 in 
> mesos://sagitarius.local:7077.
> 15/10/23 11:31:21 INFO RestSubmissionClient: State of driver 
> driver-20151023113121-0006 is now QUEUED.
> 15/10/23 11:31:21 INFO RestSubmissionClient: Server responded with 
> CreateSubmissionResponse:
> {
>   "action" : "CreateSubmissionResponse",
>   "serverSparkVersion" : "1.5.0",
>   "submissionId" : "driver-20151023113121-0006",
>   "success" : true
> }
> {code}
> I can see the driver in the Dispatcher UI and the job succeeds eventually, 
> but running only on the node where the driver was launched (see attachment).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11259) Params.validateParams() should be called automatically

2015-10-23 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-11259:

Description: 
Params.validateParams() can not be called automatically currently. Such as the 
following code snippet will not throw exception which is not as expected.
{code}
val df = sqlContext.createDataFrame(
  Seq(
(1, Vectors.dense(0.0, 1.0, 4.0), 1.0),
(2, Vectors.dense(1.0, 0.0, 4.0), 2.0),
(3, Vectors.dense(1.0, 0.0, 5.0), 3.0),
(4, Vectors.dense(0.0, 0.0, 5.0), 4.0))
).toDF("id", "features", "label")

val scaler = new MinMaxScaler()
 .setInputCol("features")
 .setOutputCol("features_scaled")
 .setMin(10)
 .setMax(0)
val pipeline = new Pipeline().setStages(Array(scaler))
pipeline.fit(df)
{code}
validateParams() should be called by 
PipelineStage(Pipeline/Estimator/Transformer) automatically, so I propose to 
put it in transformSchema(). 

  was:
Params.validateParams() not be called automatically currently. Such as the 
following code snippet will not throw exception which is not as expected.
{code}
val df = sqlContext.createDataFrame(
  Seq(
(1, Vectors.dense(0.0, 1.0, 4.0), 1.0),
(2, Vectors.dense(1.0, 0.0, 4.0), 2.0),
(3, Vectors.dense(1.0, 0.0, 5.0), 3.0),
(4, Vectors.dense(0.0, 0.0, 5.0), 4.0))
).toDF("id", "features", "label")

val scaler = new MinMaxScaler()
 .setInputCol("features")
 .setOutputCol("features_scaled")
 .setMin(10)
 .setMax(0)
val pipeline = new Pipeline().setStages(Array(scaler))
pipeline.fit(df)
{code}
validateParams() should be called by 
PipelineStage(Pipeline/Estimator/Transformer) automatically, so I propose to 
put it in transformSchema().


> Params.validateParams() should be called automatically
> --
>
> Key: SPARK-11259
> URL: https://issues.apache.org/jira/browse/SPARK-11259
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Yanbo Liang
>
> Params.validateParams() can not be called automatically currently. Such as 
> the following code snippet will not throw exception which is not as expected.
> {code}
> val df = sqlContext.createDataFrame(
>   Seq(
> (1, Vectors.dense(0.0, 1.0, 4.0), 1.0),
> (2, Vectors.dense(1.0, 0.0, 4.0), 2.0),
> (3, Vectors.dense(1.0, 0.0, 5.0), 3.0),
> (4, Vectors.dense(0.0, 0.0, 5.0), 4.0))
> ).toDF("id", "features", "label")
> val scaler = new MinMaxScaler()
>  .setInputCol("features")
>  .setOutputCol("features_scaled")
>  .setMin(10)
>  .setMax(0)
> val pipeline = new Pipeline().setStages(Array(scaler))
> pipeline.fit(df)
> {code}
> validateParams() should be called by 
> PipelineStage(Pipeline/Estimator/Transformer) automatically, so I propose to 
> put it in transformSchema(). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-11229) NPE in JoinedRow.isNullAt when spark.shuffle.memoryFraction=0

2015-10-23 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reopened SPARK-11229:
---

"Fixed" implies there was a change attached to this JIRA that resolved the 
issue, and we don't have that here. If it were probably resolved by another 
JIRA, "duplicate" would be appropriate. Otherwise, *shrug* doesn't really 
matter but "cannot reproduce" is maybe most accurate. 

> NPE in JoinedRow.isNullAt when spark.shuffle.memoryFraction=0
> -
>
> Key: SPARK-11229
> URL: https://issues.apache.org/jira/browse/SPARK-11229
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
> Environment: 14.04.1-Ubuntu SMP x86_64 GNU/Linux
>Reporter: Romi Kuntsman
>
> Steps to reproduce:
> 1. set spark.shuffle.memoryFraction=0
> 2. load dataframe from parquet file
> 3. see it's read correctly by calling dataframe.show()
> 4. call dataframe.count()
> Expected behaviour:
> get count of rows in dataframe
> OR, if memoryFraction=0 is an invalid setting, get notified about it
> Actual behaviour:
> CatalystReadSupport doesn't read the schema (even thought there is one) and 
> then there's a NullPointerException.
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1283)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1271)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1270)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1270)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
>   at scala.Option.foreach(Option.scala:236)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1496)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1447)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1822)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1835)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1848)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1919)
>   at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:905)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.collect(RDD.scala:904)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:177)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$collect$1.apply(DataFrame.scala:1385)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$collect$1.apply(DataFrame.scala:1385)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
>   at 
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1903)
>   at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1384)
>   at org.apache.spark.sql.DataFrame.count(DataFrame.scala:1402)
>   ... 14 more
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.JoinedRow.isNullAt(JoinedRow.scala:70)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator$$anonfun$generateProcessRow$1.apply(TungstenAggregationIterator.scala:194)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator$$anonfun$generateProcessRow$1.apply(TungstenAggregationIterator.scala:192)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:368)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterat

[jira] [Resolved] (SPARK-11229) NPE in JoinedRow.isNullAt when spark.shuffle.memoryFraction=0

2015-10-23 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-11229.
---
   Resolution: Cannot Reproduce
Fix Version/s: (was: 1.6.0)

> NPE in JoinedRow.isNullAt when spark.shuffle.memoryFraction=0
> -
>
> Key: SPARK-11229
> URL: https://issues.apache.org/jira/browse/SPARK-11229
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
> Environment: 14.04.1-Ubuntu SMP x86_64 GNU/Linux
>Reporter: Romi Kuntsman
>
> Steps to reproduce:
> 1. set spark.shuffle.memoryFraction=0
> 2. load dataframe from parquet file
> 3. see it's read correctly by calling dataframe.show()
> 4. call dataframe.count()
> Expected behaviour:
> get count of rows in dataframe
> OR, if memoryFraction=0 is an invalid setting, get notified about it
> Actual behaviour:
> CatalystReadSupport doesn't read the schema (even thought there is one) and 
> then there's a NullPointerException.
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1283)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1271)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1270)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1270)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
>   at scala.Option.foreach(Option.scala:236)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1496)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1447)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1822)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1835)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1848)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1919)
>   at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:905)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.collect(RDD.scala:904)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:177)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$collect$1.apply(DataFrame.scala:1385)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$collect$1.apply(DataFrame.scala:1385)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
>   at 
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1903)
>   at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1384)
>   at org.apache.spark.sql.DataFrame.count(DataFrame.scala:1402)
>   ... 14 more
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.JoinedRow.isNullAt(JoinedRow.scala:70)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator$$anonfun$generateProcessRow$1.apply(TungstenAggregationIterator.scala:194)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator$$anonfun$generateProcessRow$1.apply(TungstenAggregationIterator.scala:192)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:368)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregat

[jira] [Updated] (SPARK-7021) JUnit output for Python tests

2015-10-23 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-7021:
-
Assignee: Gabor Liptak

> JUnit output for Python tests
> -
>
> Key: SPARK-7021
> URL: https://issues.apache.org/jira/browse/SPARK-7021
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Project Infra
>Reporter: Brennon York
>Assignee: Gabor Liptak
>Priority: Minor
>  Labels: starter
> Fix For: 1.6.0
>
>
> Currently python returns its test output in its own format. What would be 
> preferred is if the Python test runner could output its test results in JUnit 
> format to better match the rest of the Jenkins test output.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11262) Unit test for gradient, loss layers, memory management for multilayer perceptron

2015-10-23 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11262:
--
Target Version/s:   (was: 1.6.0)
   Fix Version/s: (was: 1.5.1)

[~avulanov] don't set Fix/Target version please

> Unit test for gradient, loss layers, memory management for multilayer 
> perceptron
> 
>
> Key: SPARK-11262
> URL: https://issues.apache.org/jira/browse/SPARK-11262
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.5.1
>Reporter: Alexander Ulanov
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Multi-layer perceptron requires more rigorous tests and refactoring of layer 
> interfaces to accommodate development of new features.
> 1)Implement unit test for gradient and loss
> 2)Refactor the internal layer interface to extract "loss function" 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11270) Add improved equality testing for TopicAndPartition from the Kafka Streaming API

2015-10-23 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11270:
--
Target Version/s:   (was: 1.5.1)
   Fix Version/s: (was: 1.5.1)

[~manygrams] have a look at 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark  Among 
other things, don't set Target/Fix version.

> Add improved equality testing for TopicAndPartition from the Kafka Streaming 
> API
> 
>
> Key: SPARK-11270
> URL: https://issues.apache.org/jira/browse/SPARK-11270
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Streaming
>Affects Versions: 1.5.1
>Reporter: Nick Evans
>Priority: Minor
>
> Hey, sorry, new to contributing to Spark! Let me know if I'm doing anything 
> wrong.
> This issue is in relation to equality testing of a TopicAndPartition object. 
> It allows you to test that the topics and partitions of two of these objects 
> are equal, as opposed to checking that the two objects are the same instance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11267) NettyRpcEnv and sparkDriver services report the same port in the logs

2015-10-23 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11267:
--
Component/s: Spark Core

> NettyRpcEnv and sparkDriver services report the same port in the logs
> -
>
> Key: SPARK-11267
> URL: https://issues.apache.org/jira/browse/SPARK-11267
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
> Environment: the version built from today's sources - Spark version 
> 1.6.0-SNAPSHOT
>Reporter: Jacek Laskowski
>Priority: Minor
>
> When starting {{./bin/spark-shell --conf spark.driver.port=}} Spark 
> reports two services - NettyRpcEnv and sparkDriver - using the same {{}} 
> port:
> {code}
> 15/10/22 23:09:32 INFO SparkContext: Running Spark version 1.6.0-SNAPSHOT
> 15/10/22 23:09:32 INFO SparkContext: Spark configuration:
> spark.app.name=Spark shell
> spark.driver.port=
> spark.home=/Users/jacek/dev/oss/spark
> spark.jars=
> spark.logConf=true
> spark.master=local[*]
> spark.repl.class.uri=http://192.168.1.4:52645
> spark.submit.deployMode=client
> ...
> 15/10/22 23:09:33 INFO Utils: Successfully started service 'NettyRpcEnv' on 
> port .
> ...
> 15/10/22 23:09:33 INFO Utils: Successfully started service 'sparkDriver' on 
> port .
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10947) With schema inference from JSON into a Dataframe, add option to infer all primitive object types as strings

2015-10-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14970777#comment-14970777
 ] 

Apache Spark commented on SPARK-10947:
--

User 'stephend-realitymine' has created a pull request for this issue:
https://github.com/apache/spark/pull/9249

> With schema inference from JSON into a Dataframe, add option to infer all 
> primitive object types as strings
> ---
>
> Key: SPARK-10947
> URL: https://issues.apache.org/jira/browse/SPARK-10947
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Ewan Leith
>Priority: Minor
>
> Currently, when a schema is inferred from a JSON file using 
> sqlContext.read.json, the primitive object types are inferred as string, 
> long, boolean, etc.
> However, if the inferred type is too specific (JSON obviously does not 
> enforce types itself), this causes issues with merging dataframe schemas.
> Instead, we would like an option in the JSON inferField function to treat all 
> primitive objects as strings.
> We'll create and submit a pull request for this for review.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11016) Spark fails when running with a task that requires a more recent version of RoaringBitmaps

2015-10-23 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14970783#comment-14970783
 ] 

Sean Owen commented on SPARK-11016:
---

NB: the resolution here may be to simply remove usage of roaringbitmaps: 
https://github.com/apache/spark/pull/9243

> Spark fails when running with a task that requires a more recent version of 
> RoaringBitmaps
> --
>
> Key: SPARK-11016
> URL: https://issues.apache.org/jira/browse/SPARK-11016
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.0
>Reporter: Charles Allen
>
> The following error appears during Kryo init whenever a more recent version 
> (>0.5.0) of Roaring bitmaps is required by a job. 
> org/roaringbitmap/RoaringArray$Element was removed in 0.5.0
> {code}
> A needed class was not found. This could be due to an error in your runpath. 
> Missing class: org/roaringbitmap/RoaringArray$Element
> java.lang.NoClassDefFoundError: org/roaringbitmap/RoaringArray$Element
>   at 
> org.apache.spark.serializer.KryoSerializer$.(KryoSerializer.scala:338)
>   at 
> org.apache.spark.serializer.KryoSerializer$.(KryoSerializer.scala)
>   at 
> org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:93)
>   at 
> org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:237)
>   at 
> org.apache.spark.serializer.KryoSerializerInstance.(KryoSerializer.scala:222)
>   at 
> org.apache.spark.serializer.KryoSerializer.newInstance(KryoSerializer.scala:138)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:201)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:102)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:85)
>   at 
> org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
>   at 
> org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:63)
>   at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1318)
>   at 
> org.apache.spark.SparkContext$$anonfun$hadoopFile$1.apply(SparkContext.scala:1006)
>   at 
> org.apache.spark.SparkContext$$anonfun$hadoopFile$1.apply(SparkContext.scala:1003)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
>   at org.apache.spark.SparkContext.withScope(SparkContext.scala:700)
>   at org.apache.spark.SparkContext.hadoopFile(SparkContext.scala:1003)
>   at 
> org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:818)
>   at 
> org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:816)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
>   at org.apache.spark.SparkContext.withScope(SparkContext.scala:700)
>   at org.apache.spark.SparkContext.textFile(SparkContext.scala:816)
> {code}
> See https://issues.apache.org/jira/browse/SPARK-5949 for related info



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11087) spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate

2015-10-23 Thread patcharee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14970786#comment-14970786
 ] 

patcharee commented on SPARK-11087:
---

[~zzhan] I found the predicate generated in the executor log for the case using 
dataframe (not hiveContext.sql). Sorry for my mistake, and thanks for your help!

> spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate
> -
>
> Key: SPARK-11087
> URL: https://issues.apache.org/jira/browse/SPARK-11087
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
> Environment: orc file version 0.12 with HIVE_8732
> hive version 1.2.1.2.3.0.0-2557
>Reporter: patcharee
>Priority: Minor
>
> I have an external hive table stored as partitioned orc file (see the table 
> schema below). I tried to query from the table with where clause>
> hiveContext.setConf("spark.sql.orc.filterPushdown", "true")
> hiveContext.sql("select u, v from 4D where zone = 2 and x = 320 and y = 
> 117")). 
> But from the log file with debug logging level on, the ORC pushdown predicate 
> was not generated. 
> Unfortunately my table was not sorted when I inserted the data, but I 
> expected the ORC pushdown predicate should be generated (because of the where 
> clause) though
> Table schema
> 
> hive> describe formatted 4D;
> OK
> # col_namedata_type   comment 
>
> date  int 
> hhint 
> x int 
> y int 
> heightfloat   
> u float   
> v float   
> w float   
> phfloat   
> phb   float   
> t float   
> p float   
> pbfloat   
> qvaporfloat   
> qgraupfloat   
> qnice float   
> qnrainfloat   
> tke_pbl   float   
> el_pblfloat   
> qcloudfloat   
>
> # Partition Information
> # col_namedata_type   comment 
>
> zone  int 
> z int 
> year  int 
> month int 
>
> # Detailed Table Information   
> Database: default  
> Owner:patcharee
> CreateTime:   Thu Jul 09 16:46:54 CEST 2015
> LastAccessTime:   UNKNOWN  
> Protect Mode: None 
> Retention:0
> Location: hdfs://helmhdfs/apps/hive/warehouse/wrf_tables/4D   
>  
> Table Type:   EXTERNAL_TABLE   
> Table Parameters:  
>   EXTERNALTRUE
>   comment this table is imported from rwf_data/*/wrf/*
>   last_modified_bypatcharee   
>   last_modified_time  1439806692  
>   orc.compressZLIB
>   transient_lastDdlTime   1439806692  
>
> # Storage Information  
> SerDe Library:org.apache.hadoop.hive.ql.io.orc.OrcSerde
> InputFormat:  org.apache.hadoop.hive.ql.io.orc.OrcInputFormat  
> OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
>  
> Compressed:   No   
> Num Buckets:  -1   
> Bucket Columns:   []   
> Sort Columns: []   
> Storage Desc Params:   
>   serialization.format1   
> Time taken: 0.388 seconds, Fetched: 58 row(s)
> =

[jira] [Closed] (SPARK-11087) spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate

2015-10-23 Thread patcharee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

patcharee closed SPARK-11087.
-
Resolution: Not A Problem

The predicate is indeed generated and can be found in the executor log

> spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate
> -
>
> Key: SPARK-11087
> URL: https://issues.apache.org/jira/browse/SPARK-11087
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
> Environment: orc file version 0.12 with HIVE_8732
> hive version 1.2.1.2.3.0.0-2557
>Reporter: patcharee
>Priority: Minor
>
> I have an external hive table stored as partitioned orc file (see the table 
> schema below). I tried to query from the table with where clause>
> hiveContext.setConf("spark.sql.orc.filterPushdown", "true")
> hiveContext.sql("select u, v from 4D where zone = 2 and x = 320 and y = 
> 117")). 
> But from the log file with debug logging level on, the ORC pushdown predicate 
> was not generated. 
> Unfortunately my table was not sorted when I inserted the data, but I 
> expected the ORC pushdown predicate should be generated (because of the where 
> clause) though
> Table schema
> 
> hive> describe formatted 4D;
> OK
> # col_namedata_type   comment 
>
> date  int 
> hhint 
> x int 
> y int 
> heightfloat   
> u float   
> v float   
> w float   
> phfloat   
> phb   float   
> t float   
> p float   
> pbfloat   
> qvaporfloat   
> qgraupfloat   
> qnice float   
> qnrainfloat   
> tke_pbl   float   
> el_pblfloat   
> qcloudfloat   
>
> # Partition Information
> # col_namedata_type   comment 
>
> zone  int 
> z int 
> year  int 
> month int 
>
> # Detailed Table Information   
> Database: default  
> Owner:patcharee
> CreateTime:   Thu Jul 09 16:46:54 CEST 2015
> LastAccessTime:   UNKNOWN  
> Protect Mode: None 
> Retention:0
> Location: hdfs://helmhdfs/apps/hive/warehouse/wrf_tables/4D   
>  
> Table Type:   EXTERNAL_TABLE   
> Table Parameters:  
>   EXTERNALTRUE
>   comment this table is imported from rwf_data/*/wrf/*
>   last_modified_bypatcharee   
>   last_modified_time  1439806692  
>   orc.compressZLIB
>   transient_lastDdlTime   1439806692  
>
> # Storage Information  
> SerDe Library:org.apache.hadoop.hive.ql.io.orc.OrcSerde
> InputFormat:  org.apache.hadoop.hive.ql.io.orc.OrcInputFormat  
> OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
>  
> Compressed:   No   
> Num Buckets:  -1   
> Bucket Columns:   []   
> Sort Columns: []   
> Storage Desc Params:   
>   serialization.format1   
> Time taken: 0.388 seconds, Fetched: 58 row(s)
> 
> Data was inserted into this table by another spark job>
> df.write.format("org.apache.

[jira] [Commented] (SPARK-9265) Dataframe.limit joined with another dataframe can be non-deterministic

2015-10-23 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14970801#comment-14970801
 ] 

Yanbo Liang commented on SPARK-9265:


I'm working on it.

> Dataframe.limit joined with another dataframe can be non-deterministic
> --
>
> Key: SPARK-9265
> URL: https://issues.apache.org/jira/browse/SPARK-9265
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: Tathagata Das
>Priority: Critical
>
> {code}
> import org.apache.spark.sql._
> import org.apache.spark.sql.functions._
> val recentFailures = table("failed_suites").cache()
> val topRecentFailures = 
> recentFailures.groupBy('suiteName).agg(count("*").as('failCount)).orderBy('failCount.desc).limit(10)
> topRecentFailures.show(100)
> val mot = topRecentFailures.as("a").join(recentFailures.as("b"), 
> $"a.suiteName" === $"b.suiteName")
>   
> (1 to 10).foreach { i => 
>   println(s"$i: " + mot.count())
> }
> {code}
> This shows.
> {code}
> ++-+
> |   suiteName|failCount|
> ++-+
> |org.apache.spark|   85|
> |org.apache.spark|   26|
> |org.apache.spark|   26|
> |org.apache.spark|   17|
> |org.apache.spark|   17|
> |org.apache.spark|   15|
> |org.apache.spark|   13|
> |org.apache.spark|   13|
> |org.apache.spark|   11|
> |org.apache.spark|9|
> ++-+
> 1: 174
> 2: 166
> 3: 174
> 4: 106
> 5: 158
> 6: 110
> 7: 174
> 8: 158
> 9: 166
> 10: 106
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10857) SQL injection bug in JdbcDialect.getTableExistsQuery()

2015-10-23 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14970836#comment-14970836
 ] 

Sean Owen commented on SPARK-10857:
---

Rick you're saying that this code path only comes up when the parser is 
certainly dealing with a table name, like in DDL statements? and not just in 
parsing "SELECT * from (table)"? (You probably know the code best here given 
you've studied it at close range.)

> SQL injection bug in JdbcDialect.getTableExistsQuery()
> --
>
> Key: SPARK-10857
> URL: https://issues.apache.org/jira/browse/SPARK-10857
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Rick Hillegas
>Priority: Minor
>
> All of the implementations of this method involve constructing a query by 
> concatenating boilerplate text with a user-supplied name. This looks like a 
> SQL injection bug to me.
> A better solution would be to call java.sql.DatabaseMetaData.getTables() to 
> implement this method, using the catalog and schema which are available from 
> Connection.getCatalog() and Connection.getSchema(). This would not work on 
> Java 6 because Connection.getSchema() was introduced in Java 7. However, the 
> solution would work for more modern JVMs. Limiting the vulnerability to 
> obsolete JVMs would at least be an improvement over the current situation. 
> Java 6 has been end-of-lifed and is not an appropriate platform for users who 
> are concerned about security.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11167) Incorrect type resolution on heterogeneous data structures

2015-10-23 Thread Maciej Szymkiewicz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14970839#comment-14970839
 ] 

Maciej Szymkiewicz commented on SPARK-11167:


spark-csv has a much simpler job to do and everything it does is already 
covered by basic R behavior.  Tightest type here would probably most likely 
mean Any which is neither allowed or useful.

I think the best solution in this case could be a warning when data frame 
contains complex types and user doesn't provide schema. And maybe some tool 
which could replace debug.TypeCheck. Anyone can explain why it 'no longer 
applies in the new "Tungsten" world'? 

https://github.com/apache/spark/pull/8043


> Incorrect type resolution on heterogeneous data structures
> --
>
> Key: SPARK-11167
> URL: https://issues.apache.org/jira/browse/SPARK-11167
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.0
>Reporter: Maciej Szymkiewicz
>
> If structure contains heterogeneous incorrectly assigns type of the 
> encountered element as type of a whole structure. This problem affects both 
> lists:
> {code}
> SparkR:::infer_type(list(a=1, b="a")
> ## [1] "array"
> SparkR:::infer_type(list(a="a", b=1))
> ##  [1] "array"
> {code}
> and environments:
> {code}
> SparkR:::infer_type(as.environment(list(a=1, b="a")))
> ## [1] "map"
> SparkR:::infer_type(as.environment(list(a="a", b=1)))
> ## [1] "map"
> {code}
> This results in errors during data collection and other operations on 
> DataFrames:
> {code}
> ldf <- data.frame(row.names=1:2)
> ldf$foo <- list(list("1", 2), list(3, 4))
> sdf <- createDataFrame(sqlContext, ldf)
> collect(sdf)
> ## 15/10/17 17:58:57 ERROR Executor: Exception in task 0.0 in stage 9.0 (TID 
> 9)
> ## scala.MatchError: 2.0 (of class java.lang.Double)
> ## ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7008) An implementation of Factorization Machine (LibFM)

2015-10-23 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14970870#comment-14970870
 ] 

Nick Pentreath commented on SPARK-7008:
---

Is this now going in 1.6 (as per SPARK-10324)? If so is there a PR, since I 
cannot find one related.

> An implementation of Factorization Machine (LibFM)
> --
>
> Key: SPARK-7008
> URL: https://issues.apache.org/jira/browse/SPARK-7008
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: zhengruifeng
>  Labels: features
> Attachments: FM_CR.xlsx, FM_convergence_rate.xlsx, QQ20150421-1.png, 
> QQ20150421-2.png
>
>
> An implementation of Factorization Machines based on Scala and Spark MLlib.
> FM is a kind of machine learning algorithm for multi-linear regression, and 
> is widely used for recommendation.
> FM works well in recent years' recommendation competitions.
> Ref:
> http://libfm.org/
> http://doi.acm.org/10.1145/2168752.2168771
> http://www.inf.uni-konstanz.de/~rendle/pdf/Rendle2010FM.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6724) Model import/export for FPGrowth

2015-10-23 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14970887#comment-14970887
 ] 

Yanbo Liang commented on SPARK-6724:


[~MeethuMathew] I will take over this task and send a PR, welcome to comment on 
my PR.

> Model import/export for FPGrowth
> 
>
> Key: SPARK-6724
> URL: https://issues.apache.org/jira/browse/SPARK-6724
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Note: experimental model API



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6270) Standalone Master hangs when streaming job completes and event logging is enabled

2015-10-23 Thread Jim Haughwout (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14970915#comment-14970915
 ] 

Jim Haughwout commented on SPARK-6270:
--

[~tdas]: Can the team update this issue to reflect that this _also_ affects 
Versions 1.3.1, 1.4.0, 1.4.1, 1.5.0, and 1.5.1?

> Standalone Master hangs when streaming job completes and event logging is 
> enabled
> -
>
> Key: SPARK-6270
> URL: https://issues.apache.org/jira/browse/SPARK-6270
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Streaming
>Affects Versions: 1.2.0, 1.2.1, 1.3.0
>Reporter: Tathagata Das
>Priority: Critical
>
> If the event logging is enabled, the Spark Standalone Master tries to 
> recreate the web UI of a completed Spark application from its event logs. 
> However if this event log is huge (e.g. for a Spark Streaming application), 
> then the master hangs in its attempt to read and recreate the web ui. This 
> hang causes the whole standalone cluster to be unusable. 
> Workaround is to disable the event logging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9325) Support `collect` on DataFrame columns

2015-10-23 Thread Russell Pierce (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14970917#comment-14970917
 ] 

Russell Pierce commented on SPARK-9325:
---

You're right, Spark had been producing an error because the df$col in question 
was a TINYINT stored in Parquet, not that the command itself didn't work.

> Support `collect` on DataFrame columns
> --
>
> Key: SPARK-9325
> URL: https://issues.apache.org/jira/browse/SPARK-9325
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>
> This is to support code of the form 
> ```
> ages <- collect(df$Age)
> ```
> Right now `df$Age` returns a Column, which has no functions supported.
> Similarly we might consider supporting `head(df$Age)` etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9325) Support `collect` on DataFrame columns

2015-10-23 Thread Russell Pierce (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14970917#comment-14970917
 ] 

Russell Pierce edited comment on SPARK-9325 at 10/23/15 12:53 PM:
--

You're right, Spark had been producing an error because the df$col in question 
was a TINYINT stored in Parquet, not that the command itself didn't work; that 
problem seems to have been addressed in another Issue 
(https://issues.apache.org/jira/browse/SPARK-3575).


was (Author: rpierce):
You're right, Spark had been producing an error because the df$col in question 
was a TINYINT stored in Parquet, not that the command itself didn't work.

> Support `collect` on DataFrame columns
> --
>
> Key: SPARK-9325
> URL: https://issues.apache.org/jira/browse/SPARK-9325
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>
> This is to support code of the form 
> ```
> ages <- collect(df$Age)
> ```
> Right now `df$Age` returns a Column, which has no functions supported.
> Similarly we might consider supporting `head(df$Age)` etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11281) Issue with creating and collecting DataFrame using environments

2015-10-23 Thread Maciej Szymkiewicz (JIRA)
Maciej Szymkiewicz created SPARK-11281:
--

 Summary: Issue with creating and collecting DataFrame using 
environments 
 Key: SPARK-11281
 URL: https://issues.apache.org/jira/browse/SPARK-11281
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 1.6.0
 Environment: R 3.2.2, Spark build from master  
487d409e71767c76399217a07af8de1bb0da7aa8
Reporter: Maciej Szymkiewicz


It is not possible to to access Map field created from an environment. Assuming 
local data frame is created as follows:

{code}
ldf <- data.frame(row.names=1:2)
ldf$x <- c(as.environment(list(a=1, b=2)), as.environment(list(c=3)))
str(ldf)
## 'data.frame':2 obs. of  1 variable:
##  $ x:List of 2
##   ..$ : 
##   ..$ : 

get("a", ldf$x[[1]])
## [1] 1

get("c", ldf$x[[2]])
## [1] 3
{code}

It is possible to create a Spark data frame:

{code}
sdf <- createDataFrame(sqlContext, ldf)
printSchema(sdf)

## root
##  |-- x: array (nullable = true)
##  ||-- element: map (containsNull = true)
##  |||-- key: string
##  |||-- value: double (valueContainsNull = true)
{code}

but it throws:

{code}
java.lang.IllegalArgumentException: Invalid array type e
{code}

on collect / head. 

Problem seems to be specific to environments and cannot be reproduced when Map 
comes for example from Cassandra table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11282) Very strange broadcast join behaviour

2015-10-23 Thread JIRA
Maciej Bryński created SPARK-11282:
--

 Summary: Very strange broadcast join behaviour
 Key: SPARK-11282
 URL: https://issues.apache.org/jira/browse/SPARK-11282
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.5.1
Reporter: Maciej Bryński
Priority: Critical


Hi,
I found very strange broadcast join behaviour.

According to this Jira https://issues.apache.org/jira/browse/SPARK-10577
I'm using hint for broadcast join. (I patched 1.5.1 with 
https://github.com/apache/spark/pull/8801/files )

I found that working of this feature depends on Executor Memory.
In my case broadcast join is working up to 31G. 

Example:

spark1:~/ab$ ~/spark/bin/spark-submit --executor-memory 31G 
debug_broadcast_join.py true
Creating test tables...
Joining tables...
Joined table schema:
root
 |-- id: long (nullable = true)
 |-- val: long (nullable = true)
 |-- id2: long (nullable = true)
 |-- val2: long (nullable = true)

Selecting data for id = 5...
[Row(id=5, val=5, id2=5, val2=5)]
spark$ ~/spark/bin/spark-submit --executor-memory 32G debug_broadcast_join.py 
true
Creating test tables...
Joining tables...
Joined table schema:
root
 |-- id: long (nullable = true)
 |-- val: long (nullable = true)
 |-- id2: long (nullable = true)
 |-- val2: long (nullable = true)

Selecting data for id = 5...
[Row(id=5, val=5, id2=None, val2=None)]

Please find example code attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11167) Incorrect type resolution on heterogeneous data structures

2015-10-23 Thread Maciej Szymkiewicz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14970922#comment-14970922
 ] 

Maciej Szymkiewicz commented on SPARK-11167:


Related problem: https://issues.apache.org/jira/browse/SPARK-11281


> Incorrect type resolution on heterogeneous data structures
> --
>
> Key: SPARK-11167
> URL: https://issues.apache.org/jira/browse/SPARK-11167
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.0
>Reporter: Maciej Szymkiewicz
>
> If structure contains heterogeneous incorrectly assigns type of the 
> encountered element as type of a whole structure. This problem affects both 
> lists:
> {code}
> SparkR:::infer_type(list(a=1, b="a")
> ## [1] "array"
> SparkR:::infer_type(list(a="a", b=1))
> ##  [1] "array"
> {code}
> and environments:
> {code}
> SparkR:::infer_type(as.environment(list(a=1, b="a")))
> ## [1] "map"
> SparkR:::infer_type(as.environment(list(a="a", b=1)))
> ## [1] "map"
> {code}
> This results in errors during data collection and other operations on 
> DataFrames:
> {code}
> ldf <- data.frame(row.names=1:2)
> ldf$foo <- list(list("1", 2), list(3, 4))
> sdf <- createDataFrame(sqlContext, ldf)
> collect(sdf)
> ## 15/10/17 17:58:57 ERROR Executor: Exception in task 0.0 in stage 9.0 (TID 
> 9)
> ## scala.MatchError: 2.0 (of class java.lang.Double)
> ## ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11282) Very strange broadcast join behaviour

2015-10-23 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-11282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-11282:
---
Description: 
Hi,
I found very strange broadcast join behaviour.

According to this Jira https://issues.apache.org/jira/browse/SPARK-10577
I'm using hint for broadcast join. (I patched 1.5.1 with 
https://github.com/apache/spark/pull/8801/files )

I found that working of this feature depends on Executor Memory.
In my case broadcast join is working up to 31G. 

Example:

spark1:~/ab$ ~/spark/bin/spark-submit --executor-memory 31G 
debug_broadcast_join.py true
Creating test tables...
Joining tables...
Joined table schema:
root
 |-- id: long (nullable = true)
 |-- val: long (nullable = true)
 |-- id2: long (nullable = true)
 |-- val2: long (nullable = true)

Selecting data for id = 5...
[Row(id=5, val=5, id2=5, val2=5)]
spark$ ~/spark/bin/spark-submit --executor-memory 32G 
debug_broadcast_join.py true
Creating test tables...
Joining tables...
Joined table schema:
root
 |-- id: long (nullable = true)
 |-- val: long (nullable = true)
 |-- id2: long (nullable = true)
 |-- val2: long (nullable = true)

Selecting data for id = 5...
[Row(id=5, val=5, id2=None, val2=None)]

Please find example code attached.

  was:
Hi,
I found very strange broadcast join behaviour.

According to this Jira https://issues.apache.org/jira/browse/SPARK-10577
I'm using hint for broadcast join. (I patched 1.5.1 with 
https://github.com/apache/spark/pull/8801/files )

I found that working of this feature depends on Executor Memory.
In my case broadcast join is working up to 31G. 

Example:

spark1:~/ab$ ~/spark/bin/spark-submit --executor-memory 31G 
debug_broadcast_join.py true
Creating test tables...
Joining tables...
Joined table schema:
root
 |-- id: long (nullable = true)
 |-- val: long (nullable = true)
 |-- id2: long (nullable = true)
 |-- val2: long (nullable = true)

Selecting data for id = 5...
[Row(id=5, val=5, id2=5, val2=5)]
spark$ ~/spark/bin/spark-submit --executor-memory 32G debug_broadcast_join.py 
true
Creating test tables...
Joining tables...
Joined table schema:
root
 |-- id: long (nullable = true)
 |-- val: long (nullable = true)
 |-- id2: long (nullable = true)
 |-- val2: long (nullable = true)

Selecting data for id = 5...
[Row(id=5, val=5, id2=None, val2=None)]

Please find example code attached.


> Very strange broadcast join behaviour
> -
>
> Key: SPARK-11282
> URL: https://issues.apache.org/jira/browse/SPARK-11282
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.5.1
>Reporter: Maciej Bryński
>Priority: Critical
> Attachments: SPARK-11282.py
>
>
> Hi,
> I found very strange broadcast join behaviour.
> According to this Jira https://issues.apache.org/jira/browse/SPARK-10577
> I'm using hint for broadcast join. (I patched 1.5.1 with 
> https://github.com/apache/spark/pull/8801/files )
> I found that working of this feature depends on Executor Memory.
> In my case broadcast join is working up to 31G. 
> Example:
>   spark1:~/ab$ ~/spark/bin/spark-submit --executor-memory 31G 
> debug_broadcast_join.py true
>   Creating test tables...
>   Joining tables...
>   Joined table schema:
>   root
>|-- id: long (nullable = true)
>|-- val: long (nullable = true)
>|-- id2: long (nullable = true)
>|-- val2: long (nullable = true)
>   Selecting data for id = 5...
>   [Row(id=5, val=5, id2=5, val2=5)]
>   spark$ ~/spark/bin/spark-submit --executor-memory 32G 
> debug_broadcast_join.py true
>   Creating test tables...
>   Joining tables...
>   Joined table schema:
>   root
>|-- id: long (nullable = true)
>|-- val: long (nullable = true)
>|-- id2: long (nullable = true)
>|-- val2: long (nullable = true)
> Selecting data for id = 5...
> [Row(id=5, val=5, id2=None, val2=None)]
> Please find example code attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11282) Very strange broadcast join behaviour

2015-10-23 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-11282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-11282:
---
Attachment: SPARK-11282.py

> Very strange broadcast join behaviour
> -
>
> Key: SPARK-11282
> URL: https://issues.apache.org/jira/browse/SPARK-11282
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.5.1
>Reporter: Maciej Bryński
>Priority: Critical
> Attachments: SPARK-11282.py
>
>
> Hi,
> I found very strange broadcast join behaviour.
> According to this Jira https://issues.apache.org/jira/browse/SPARK-10577
> I'm using hint for broadcast join. (I patched 1.5.1 with 
> https://github.com/apache/spark/pull/8801/files )
> I found that working of this feature depends on Executor Memory.
> In my case broadcast join is working up to 31G. 
> Example:
> spark1:~/ab$ ~/spark/bin/spark-submit --executor-memory 31G 
> debug_broadcast_join.py true
> Creating test tables...
> Joining tables...
> Joined table schema:
> root
>  |-- id: long (nullable = true)
>  |-- val: long (nullable = true)
>  |-- id2: long (nullable = true)
>  |-- val2: long (nullable = true)
> Selecting data for id = 5...
> [Row(id=5, val=5, id2=5, val2=5)]
> spark$ ~/spark/bin/spark-submit --executor-memory 32G debug_broadcast_join.py 
> true
> Creating test tables...
> Joining tables...
> Joined table schema:
> root
>  |-- id: long (nullable = true)
>  |-- val: long (nullable = true)
>  |-- id2: long (nullable = true)
>  |-- val2: long (nullable = true)
> Selecting data for id = 5...
> [Row(id=5, val=5, id2=None, val2=None)]
> Please find example code attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11282) Very strange broadcast join behaviour

2015-10-23 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-11282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-11282:
---
Description: 
Hi,
I found very strange broadcast join behaviour.

According to this Jira https://issues.apache.org/jira/browse/SPARK-10577
I'm using hint for broadcast join. (I patched 1.5.1 with 
https://github.com/apache/spark/pull/8801/files )

I found that working of this feature depends on Executor Memory.
In my case broadcast join is working up to 31G. 

Example:


spark1:~/ab$ ~/spark/bin/spark-submit --executor-memory 31G 
debug_broadcast_join.py true
Creating test tables...
Joining tables...
Joined table schema:
root
 |-- id: long (nullable = true)
 |-- val: long (nullable = true)
 |-- id2: long (nullable = true)
 |-- val2: long (nullable = true)

Selecting data for id = 5...
[Row(id=5, val=5, id2=5, val2=5)]
spark$ ~/spark/bin/spark-submit --executor-memory 32G 
debug_broadcast_join.py true
Creating test tables...
Joining tables...
Joined table schema:
root
 |-- id: long (nullable = true)
 |-- val: long (nullable = true)
 |-- id2: long (nullable = true)
 |-- val2: long (nullable = true)

Selecting data for id = 5...
[Row(id=5, val=5, id2=None, val2=None)]

Please find example code attached.

  was:
Hi,
I found very strange broadcast join behaviour.

According to this Jira https://issues.apache.org/jira/browse/SPARK-10577
I'm using hint for broadcast join. (I patched 1.5.1 with 
https://github.com/apache/spark/pull/8801/files )

I found that working of this feature depends on Executor Memory.
In my case broadcast join is working up to 31G. 

Example:

spark1:~/ab$ ~/spark/bin/spark-submit --executor-memory 31G 
debug_broadcast_join.py true
Creating test tables...
Joining tables...
Joined table schema:
root
 |-- id: long (nullable = true)
 |-- val: long (nullable = true)
 |-- id2: long (nullable = true)
 |-- val2: long (nullable = true)

Selecting data for id = 5...
[Row(id=5, val=5, id2=5, val2=5)]
spark$ ~/spark/bin/spark-submit --executor-memory 32G 
debug_broadcast_join.py true
Creating test tables...
Joining tables...
Joined table schema:
root
 |-- id: long (nullable = true)
 |-- val: long (nullable = true)
 |-- id2: long (nullable = true)
 |-- val2: long (nullable = true)

Selecting data for id = 5...
[Row(id=5, val=5, id2=None, val2=None)]

Please find example code attached.


> Very strange broadcast join behaviour
> -
>
> Key: SPARK-11282
> URL: https://issues.apache.org/jira/browse/SPARK-11282
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.5.1
>Reporter: Maciej Bryński
>Priority: Critical
> Attachments: SPARK-11282.py
>
>
> Hi,
> I found very strange broadcast join behaviour.
> According to this Jira https://issues.apache.org/jira/browse/SPARK-10577
> I'm using hint for broadcast join. (I patched 1.5.1 with 
> https://github.com/apache/spark/pull/8801/files )
> I found that working of this feature depends on Executor Memory.
> In my case broadcast join is working up to 31G. 
> Example:
>   
>   spark1:~/ab$ ~/spark/bin/spark-submit --executor-memory 31G 
> debug_broadcast_join.py true
>   Creating test tables...
>   Joining tables...
>   Joined table schema:
>   root
>|-- id: long (nullable = true)
>|-- val: long (nullable = true)
>|-- id2: long (nullable = true)
>|-- val2: long (nullable = true)
>   Selecting data for id = 5...
>   [Row(id=5, val=5, id2=5, val2=5)]
>   spark$ ~/spark/bin/spark-submit --executor-memory 32G 
> debug_broadcast_join.py true
>   Creating test tables...
>   Joining tables...
>   Joined table schema:
>   root
>|-- id: long (nullable = true)
>|-- val: long (nullable = true)
>|-- id2: long (nullable = true)
>|-- val2: long (nullable = true)
>   Selecting data for id = 5...
>   [Row(id=5, val=5, id2=None, val2=None)]
> Please find example code attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11282) Very strange broadcast join behaviour

2015-10-23 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-11282.
---
Resolution: Duplicate

[~maver1ck] this could use a better title, and there is no code attached. I 
also strongly suspect it duplicates 
https://issues.apache.org/jira/browse/SPARK-10914

> Very strange broadcast join behaviour
> -
>
> Key: SPARK-11282
> URL: https://issues.apache.org/jira/browse/SPARK-11282
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.5.1
>Reporter: Maciej Bryński
>Priority: Critical
> Attachments: SPARK-11282.py
>
>
> Hi,
> I found very strange broadcast join behaviour.
> According to this Jira https://issues.apache.org/jira/browse/SPARK-10577
> I'm using hint for broadcast join. (I patched 1.5.1 with 
> https://github.com/apache/spark/pull/8801/files )
> I found that working of this feature depends on Executor Memory.
> In my case broadcast join is working up to 31G. 
> Example:
>   spark1:~/ab$ ~/spark/bin/spark-submit --executor-memory 31G 
> debug_broadcast_join.py true
>   Creating test tables...
>   Joining tables...
>   Joined table schema:
>   root
>|-- id: long (nullable = true)
>|-- val: long (nullable = true)
>|-- id2: long (nullable = true)
>|-- val2: long (nullable = true)
>   Selecting data for id = 5...
>   [Row(id=5, val=5, id2=5, val2=5)]
>   spark$ ~/spark/bin/spark-submit --executor-memory 32G 
> debug_broadcast_join.py true
>   Creating test tables...
>   Joining tables...
>   Joined table schema:
>   root
>|-- id: long (nullable = true)
>|-- val: long (nullable = true)
>|-- id2: long (nullable = true)
>|-- val2: long (nullable = true)
> Selecting data for id = 5...
> [Row(id=5, val=5, id2=None, val2=None)]
> Please find example code attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11282) Very strange broadcast join behaviour

2015-10-23 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-11282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-11282:
---
Description: 
Hi,
I found very strange broadcast join behaviour.

According to this Jira https://issues.apache.org/jira/browse/SPARK-10577
I'm using hint for broadcast join. (I patched 1.5.1 with 
https://github.com/apache/spark/pull/8801/files )

I found that working of this feature depends on Executor Memory.
In my case broadcast join is working up to 31G. 

Example:

{code}
spark1:~/ab$ ~/spark/bin/spark-submit --executor-memory 31G 
debug_broadcast_join.py true
Creating test tables...
Joining tables...
Joined table schema:
root
 |-- id: long (nullable = true)
 |-- val: long (nullable = true)
 |-- id2: long (nullable = true)
 |-- val2: long (nullable = true)

Selecting data for id = 5...
[Row(id=5, val=5, id2=5, val2=5)]
spark$ ~/spark/bin/spark-submit --executor-memory 32G debug_broadcast_join.py 
true
Creating test tables...
Joining tables...
Joined table schema:
root
 |-- id: long (nullable = true)
 |-- val: long (nullable = true)
 |-- id2: long (nullable = true)
 |-- val2: long (nullable = true)

Selecting data for id = 5...
[Row(id=5, val=5, id2=None, val2=None)]
{code}

Please find example code attached.

  was:
Hi,
I found very strange broadcast join behaviour.

According to this Jira https://issues.apache.org/jira/browse/SPARK-10577
I'm using hint for broadcast join. (I patched 1.5.1 with 
https://github.com/apache/spark/pull/8801/files )

I found that working of this feature depends on Executor Memory.
In my case broadcast join is working up to 31G. 

Example:


spark1:~/ab$ ~/spark/bin/spark-submit --executor-memory 31G 
debug_broadcast_join.py true
Creating test tables...
Joining tables...
Joined table schema:
root
 |-- id: long (nullable = true)
 |-- val: long (nullable = true)
 |-- id2: long (nullable = true)
 |-- val2: long (nullable = true)

Selecting data for id = 5...
[Row(id=5, val=5, id2=5, val2=5)]
spark$ ~/spark/bin/spark-submit --executor-memory 32G 
debug_broadcast_join.py true
Creating test tables...
Joining tables...
Joined table schema:
root
 |-- id: long (nullable = true)
 |-- val: long (nullable = true)
 |-- id2: long (nullable = true)
 |-- val2: long (nullable = true)

Selecting data for id = 5...
[Row(id=5, val=5, id2=None, val2=None)]

Please find example code attached.


> Very strange broadcast join behaviour
> -
>
> Key: SPARK-11282
> URL: https://issues.apache.org/jira/browse/SPARK-11282
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.5.1
>Reporter: Maciej Bryński
>Priority: Critical
> Attachments: SPARK-11282.py
>
>
> Hi,
> I found very strange broadcast join behaviour.
> According to this Jira https://issues.apache.org/jira/browse/SPARK-10577
> I'm using hint for broadcast join. (I patched 1.5.1 with 
> https://github.com/apache/spark/pull/8801/files )
> I found that working of this feature depends on Executor Memory.
> In my case broadcast join is working up to 31G. 
> Example:
> {code}
> spark1:~/ab$ ~/spark/bin/spark-submit --executor-memory 31G 
> debug_broadcast_join.py true
> Creating test tables...
> Joining tables...
> Joined table schema:
> root
>  |-- id: long (nullable = true)
>  |-- val: long (nullable = true)
>  |-- id2: long (nullable = true)
>  |-- val2: long (nullable = true)
> Selecting data for id = 5...
> [Row(id=5, val=5, id2=5, val2=5)]
> spark$ ~/spark/bin/spark-submit --executor-memory 32G debug_broadcast_join.py 
> true
> Creating test tables...
> Joining tables...
> Joined table schema:
> root
>  |-- id: long (nullable = true)
>  |-- val: long (nullable = true)
>  |-- id2: long (nullable = true)
>  |-- val2: long (nullable = true)
> Selecting data for id = 5...
> [Row(id=5, val=5, id2=None, val2=None)]
> {code}
> Please find example code attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-11167) Incorrect type resolution on heterogeneous data structures

2015-10-23 Thread Maciej Szymkiewicz (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz updated SPARK-11167:
---
Comment: was deleted

(was: Related problem: https://issues.apache.org/jira/browse/SPARK-11281
)

> Incorrect type resolution on heterogeneous data structures
> --
>
> Key: SPARK-11167
> URL: https://issues.apache.org/jira/browse/SPARK-11167
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.0
>Reporter: Maciej Szymkiewicz
>
> If structure contains heterogeneous incorrectly assigns type of the 
> encountered element as type of a whole structure. This problem affects both 
> lists:
> {code}
> SparkR:::infer_type(list(a=1, b="a")
> ## [1] "array"
> SparkR:::infer_type(list(a="a", b=1))
> ##  [1] "array"
> {code}
> and environments:
> {code}
> SparkR:::infer_type(as.environment(list(a=1, b="a")))
> ## [1] "map"
> SparkR:::infer_type(as.environment(list(a="a", b=1)))
> ## [1] "map"
> {code}
> This results in errors during data collection and other operations on 
> DataFrames:
> {code}
> ldf <- data.frame(row.names=1:2)
> ldf$foo <- list(list("1", 2), list(3, 4))
> sdf <- createDataFrame(sqlContext, ldf)
> collect(sdf)
> ## 15/10/17 17:58:57 ERROR Executor: Exception in task 0.0 in stage 9.0 (TID 
> 9)
> ## scala.MatchError: 2.0 (of class java.lang.Double)
> ## ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11282) Very strange broadcast join behaviour

2015-10-23 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-11282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14970931#comment-14970931
 ] 

Maciej Bryński commented on SPARK-11282:


We had race condition here.
I was attaching file when you answered.

I'll try solution of 10914

> Very strange broadcast join behaviour
> -
>
> Key: SPARK-11282
> URL: https://issues.apache.org/jira/browse/SPARK-11282
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.5.1
>Reporter: Maciej Bryński
>Priority: Critical
> Attachments: SPARK-11282.py
>
>
> Hi,
> I found very strange broadcast join behaviour.
> According to this Jira https://issues.apache.org/jira/browse/SPARK-10577
> I'm using hint for broadcast join. (I patched 1.5.1 with 
> https://github.com/apache/spark/pull/8801/files )
> I found that working of this feature depends on Executor Memory.
> In my case broadcast join is working up to 31G. 
> Example:
> {code}
> spark1:~/ab$ ~/spark/bin/spark-submit --executor-memory 31G 
> debug_broadcast_join.py true
> Creating test tables...
> Joining tables...
> Joined table schema:
> root
>  |-- id: long (nullable = true)
>  |-- val: long (nullable = true)
>  |-- id2: long (nullable = true)
>  |-- val2: long (nullable = true)
> Selecting data for id = 5...
> [Row(id=5, val=5, id2=5, val2=5)]
> spark$ ~/spark/bin/spark-submit --executor-memory 32G debug_broadcast_join.py 
> true
> Creating test tables...
> Joining tables...
> Joined table schema:
> root
>  |-- id: long (nullable = true)
>  |-- val: long (nullable = true)
>  |-- id2: long (nullable = true)
>  |-- val2: long (nullable = true)
> Selecting data for id = 5...
> [Row(id=5, val=5, id2=None, val2=None)]
> {code}
> Please find example code attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11265) YarnClient can't get tokens to talk to Hive in a secure cluster

2015-10-23 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14970930#comment-14970930
 ] 

Steve Loughran commented on SPARK-11265:


I can trigger a failure in a unit test now, once you get pass Hive failing to 
load (classpath issue), the {{get()}} operation fails
{code}
 obtain Tokens For HiveMetastore *** FAILED ***  
java.lang.IllegalArgumentException: wrong number of arguments
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
  at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:606)
  at 
org.apache.spark.deploy.yarn.YarnSparkHadoopUtil.obtainTokenForHiveMetastoreInner(YarnSparkHadoopUtil.scala:203)
  at 
org.apache.spark.deploy.yarn.YarnSparkHadoopUtilSuite$$anonfun$22.apply(YarnSparkHadoopUtilSuite.scala:254)
  at 
org.apache.spark.deploy.yarn.YarnSparkHadoopUtilSuite$$anonfun$22.apply(YarnSparkHadoopUtilSuite.scala:249)
  at 
org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
  at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
  at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
{code}

> YarnClient can't get tokens to talk to Hive in a secure cluster
> ---
>
> Key: SPARK-11265
> URL: https://issues.apache.org/jira/browse/SPARK-11265
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.5.1
> Environment: Kerberized Hadoop cluster
>Reporter: Steve Loughran
>
> As reported on the dev list, trying to run a YARN client which wants to talk 
> to Hive in a Kerberized hadoop cluster fails. This appears to be because the 
> constructor of the {{ org.apache.hadoop.hive.ql.metadata.Hive}} class was 
> made private and replaced with a factory method. The YARN client uses 
> reflection to get the tokens, so the signature changes weren't picked up in 
> SPARK-8064.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11282) Very strange broadcast join behaviour

2015-10-23 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-11282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14970931#comment-14970931
 ] 

Maciej Bryński edited comment on SPARK-11282 at 10/23/15 1:07 PM:
--

We had race condition here.
I was attaching file when you answered.

You're probably right.
I'll try solution of https://issues.apache.org/jira/browse/SPARK-10914


was (Author: maver1ck):
We had race condition here.
I was attaching file when you answered.

Uue're probably right.
I'll try solution of https://issues.apache.org/jira/browse/SPARK-10914

> Very strange broadcast join behaviour
> -
>
> Key: SPARK-11282
> URL: https://issues.apache.org/jira/browse/SPARK-11282
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.5.1
>Reporter: Maciej Bryński
>Priority: Critical
> Attachments: SPARK-11282.py
>
>
> Hi,
> I found very strange broadcast join behaviour.
> According to this Jira https://issues.apache.org/jira/browse/SPARK-10577
> I'm using hint for broadcast join. (I patched 1.5.1 with 
> https://github.com/apache/spark/pull/8801/files )
> I found that working of this feature depends on Executor Memory.
> In my case broadcast join is working up to 31G. 
> Example:
> {code}
> spark1:~/ab$ ~/spark/bin/spark-submit --executor-memory 31G 
> debug_broadcast_join.py true
> Creating test tables...
> Joining tables...
> Joined table schema:
> root
>  |-- id: long (nullable = true)
>  |-- val: long (nullable = true)
>  |-- id2: long (nullable = true)
>  |-- val2: long (nullable = true)
> Selecting data for id = 5...
> [Row(id=5, val=5, id2=5, val2=5)]
> spark$ ~/spark/bin/spark-submit --executor-memory 32G debug_broadcast_join.py 
> true
> Creating test tables...
> Joining tables...
> Joined table schema:
> root
>  |-- id: long (nullable = true)
>  |-- val: long (nullable = true)
>  |-- id2: long (nullable = true)
>  |-- val2: long (nullable = true)
> Selecting data for id = 5...
> [Row(id=5, val=5, id2=None, val2=None)]
> {code}
> Please find example code attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11282) Very strange broadcast join behaviour

2015-10-23 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-11282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14970931#comment-14970931
 ] 

Maciej Bryński edited comment on SPARK-11282 at 10/23/15 1:07 PM:
--

We had race condition here.
I was attaching file when you answered.

Uue're probably right.
I'll try solution of https://issues.apache.org/jira/browse/SPARK-10914


was (Author: maver1ck):
We had race condition here.
I was attaching file when you answered.

I'll try solution of 10914

> Very strange broadcast join behaviour
> -
>
> Key: SPARK-11282
> URL: https://issues.apache.org/jira/browse/SPARK-11282
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.5.1
>Reporter: Maciej Bryński
>Priority: Critical
> Attachments: SPARK-11282.py
>
>
> Hi,
> I found very strange broadcast join behaviour.
> According to this Jira https://issues.apache.org/jira/browse/SPARK-10577
> I'm using hint for broadcast join. (I patched 1.5.1 with 
> https://github.com/apache/spark/pull/8801/files )
> I found that working of this feature depends on Executor Memory.
> In my case broadcast join is working up to 31G. 
> Example:
> {code}
> spark1:~/ab$ ~/spark/bin/spark-submit --executor-memory 31G 
> debug_broadcast_join.py true
> Creating test tables...
> Joining tables...
> Joined table schema:
> root
>  |-- id: long (nullable = true)
>  |-- val: long (nullable = true)
>  |-- id2: long (nullable = true)
>  |-- val2: long (nullable = true)
> Selecting data for id = 5...
> [Row(id=5, val=5, id2=5, val2=5)]
> spark$ ~/spark/bin/spark-submit --executor-memory 32G debug_broadcast_join.py 
> true
> Creating test tables...
> Joining tables...
> Joined table schema:
> root
>  |-- id: long (nullable = true)
>  |-- val: long (nullable = true)
>  |-- id2: long (nullable = true)
>  |-- val2: long (nullable = true)
> Selecting data for id = 5...
> [Row(id=5, val=5, id2=None, val2=None)]
> {code}
> Please find example code attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11282) Very strange broadcast join behaviour

2015-10-23 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-11282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14970952#comment-14970952
 ] 

Maciej Bryński commented on SPARK-11282:


UPDATE:
Looks like:  -XX:-UseCompressedOops solve the problem.

> Very strange broadcast join behaviour
> -
>
> Key: SPARK-11282
> URL: https://issues.apache.org/jira/browse/SPARK-11282
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.5.1
>Reporter: Maciej Bryński
>Priority: Critical
> Attachments: SPARK-11282.py
>
>
> Hi,
> I found very strange broadcast join behaviour.
> According to this Jira https://issues.apache.org/jira/browse/SPARK-10577
> I'm using hint for broadcast join. (I patched 1.5.1 with 
> https://github.com/apache/spark/pull/8801/files )
> I found that working of this feature depends on Executor Memory.
> In my case broadcast join is working up to 31G. 
> Example:
> {code}
> spark1:~/ab$ ~/spark/bin/spark-submit --executor-memory 31G 
> debug_broadcast_join.py true
> Creating test tables...
> Joining tables...
> Joined table schema:
> root
>  |-- id: long (nullable = true)
>  |-- val: long (nullable = true)
>  |-- id2: long (nullable = true)
>  |-- val2: long (nullable = true)
> Selecting data for id = 5...
> [Row(id=5, val=5, id2=5, val2=5)]
> spark$ ~/spark/bin/spark-submit --executor-memory 32G debug_broadcast_join.py 
> true
> Creating test tables...
> Joining tables...
> Joined table schema:
> root
>  |-- id: long (nullable = true)
>  |-- val: long (nullable = true)
>  |-- id2: long (nullable = true)
>  |-- val2: long (nullable = true)
> Selecting data for id = 5...
> [Row(id=5, val=5, id2=None, val2=None)]
> {code}
> Please find example code attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6270) Standalone Master hangs when streaming job completes and event logging is enabled

2015-10-23 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated SPARK-6270:
--
Affects Version/s: 1.5.1

> Standalone Master hangs when streaming job completes and event logging is 
> enabled
> -
>
> Key: SPARK-6270
> URL: https://issues.apache.org/jira/browse/SPARK-6270
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Streaming
>Affects Versions: 1.2.0, 1.2.1, 1.3.0, 1.5.1
>Reporter: Tathagata Das
>Priority: Critical
>
> If the event logging is enabled, the Spark Standalone Master tries to 
> recreate the web UI of a completed Spark application from its event logs. 
> However if this event log is huge (e.g. for a Spark Streaming application), 
> then the master hangs in its attempt to read and recreate the web ui. This 
> hang causes the whole standalone cluster to be unusable. 
> Workaround is to disable the event logging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11283) List column gets additional level of nesting when converted to Spark DataFrame

2015-10-23 Thread Maciej Szymkiewicz (JIRA)
Maciej Szymkiewicz created SPARK-11283:
--

 Summary: List column gets additional level of nesting when 
converted to Spark DataFrame
 Key: SPARK-11283
 URL: https://issues.apache.org/jira/browse/SPARK-11283
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 1.6.0
 Environment: R 3.2.2, Spark build from master 
487d409e71767c76399217a07af8de1bb0da7aa8
Reporter: Maciej Szymkiewicz


When input data frame contains list column there is an additional level of 
nesting in a Spark DataFrame and as a result collected data is no longer 
identical to input:

{code}
ldf <- data.frame(row.names=1:2)
ldf$x <- list(list(1), list(2))
sdf <- createDataFrame(sqlContext, ldf)

printSchema(sdf)
## root
##  |-- x: array (nullable = true)
##  ||-- element: array (containsNull = true)
##  |||-- element: double (containsNull = true)

identical(ldf, collect(sdf))
## [1] FALSE
{code}

Comparing structure:

Local df

{code}
unclass(ldf)
## $x
## $x[[1]]
## $x[[1]][[1]]
## [1] 1
##
## $x[[2]]
## $x[[2]][[1]]
## [1] 2
##
## attr(,"row.names")
## [1] 1 2
{code}

Collected

{code}
unclass(collect(sdf))
## $x
## $x[[1]]
## $x[[1]][[1]]
## $x[[1]][[1]][[1]]
## [1] 1
## 
## $x[[2]]
## $x[[2]][[1]]
## $x[[2]][[1]][[1]]
## [1] 2
##
## attr(,"row.names")
## [1] 1 2
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6847) Stack overflow on updateStateByKey which followed by a dstream with checkpoint set

2015-10-23 Thread Glyton Camilleri (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14971004#comment-14971004
 ] 

Glyton Camilleri commented on SPARK-6847:
-

Hi,
we managed to actually get rid of the overflow issues by settings checkpoints 
on more streams than we thought we needed to, in addition to implementing a 
small change following your suggestion; before the fix, the setup was similar 
to what you describe:

{code}
val dStream1 = // create kafka stream and do some preprocessing
val dStream2 = dStream1.updateStateByKey { func }.checkpoint(timeWindow * 2)
val dStream3 = dStream2.map { ... }

// (1) perform some side-effect on the state
if (certainConditionsAreMet) dStream2.foreachRDD { 
  _.foreachPartition { ... }
}

// (2) publish final results to a set of Kafka topics
dStream3.transform { ... }.foreachRDD {
  _.foreachPartition { ... }
}
{code}

There were two things we did:
a) set different checkpoints for {{dStream2}} and {{dStream3}}, whereas before 
we were only setting the checkpoint for {{dStream2}}
b) changed (1) above such then when {{!certainConditionsAreMet}}, we just 
consume the stream like you describe in your suggestion

I honestly think that b) was more likely to be influential in removing the 
StackOverflowError really, but we decided to leave the checkpoint settings in 
a) there anyway.
Apologies for the late follow-up, but we needed to make sure the issue had 
actually been resolved.

> Stack overflow on updateStateByKey which followed by a dstream with 
> checkpoint set
> --
>
> Key: SPARK-6847
> URL: https://issues.apache.org/jira/browse/SPARK-6847
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.3.0
>Reporter: Jack Hu
>  Labels: StackOverflowError, Streaming
>
> The issue happens with the following sample code: uses {{updateStateByKey}} 
> followed by a {{map}} with checkpoint interval 10 seconds
> {code}
> val sparkConf = new SparkConf().setAppName("test")
> val streamingContext = new StreamingContext(sparkConf, Seconds(10))
> streamingContext.checkpoint("""checkpoint""")
> val source = streamingContext.socketTextStream("localhost", )
> val updatedResult = source.map(
> (1,_)).updateStateByKey(
> (newlist : Seq[String], oldstate : Option[String]) => 
> newlist.headOption.orElse(oldstate))
> updatedResult.map(_._2)
> .checkpoint(Seconds(10))
> .foreachRDD((rdd, t) => {
>   println("Deep: " + rdd.toDebugString.split("\n").length)
>   println(t.toString() + ": " + rdd.collect.length)
> })
> streamingContext.start()
> streamingContext.awaitTermination()
> {code}
> From the output, we can see that the dependency will be increasing time over 
> time, the {{updateStateByKey}} never get check-pointed,  and finally, the 
> stack overflow will happen. 
> Note:
> * The rdd in {{updatedResult.map(_._2)}} get check-pointed in this case, but 
> not the {{updateStateByKey}} 
> * If remove the {{checkpoint(Seconds(10))}} from the map result ( 
> {{updatedResult.map(_._2)}} ), the stack overflow will not happen



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11284) ALS produces predictions as floats and should be double

2015-10-23 Thread Dominik Dahlem (JIRA)
Dominik Dahlem created SPARK-11284:
--

 Summary: ALS produces predictions as floats and should be double
 Key: SPARK-11284
 URL: https://issues.apache.org/jira/browse/SPARK-11284
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 1.5.1
 Environment: All
Reporter: Dominik Dahlem


Using pyspark.ml and DataFrames, The ALS recommender cannot be evaluated using 
the RegressionEvaluator, because of a type mis-match between the model 
transformation and the evaluation APIs. One can work around this by casting the 
prediction column into double before passing it into the evaluator. However, 
this does not work with pipelines and cross validation.

Code and traceback below:

{code}
als = ALS(rank=10, maxIter=30, regParam=0.1, userCol='userID', 
itemCol='movieID', ratingCol='rating')
model = als.fit(training)
predictions = model.transform(validation)
evaluator = RegressionEvaluator(predictionCol='prediction', 
labelCol='rating')
validationRmse = evaluator.evaluate(predictions, {evaluator.metricName: 
'rmse'})
{code}

Traceback:
validationRmse = evaluator.evaluate(predictions, {evaluator.metricName: 
'rmse'})
  File 
"/Users/dominikdahlem/software/spark-1.6.0-SNAPSHOT-bin-custom-spark/python/lib/pyspark.zip/pyspark/ml/evaluation.py",
 line 63, in evaluate
  File 
"/Users/dominikdahlem/software/spark-1.6.0-SNAPSHOT-bin-custom-spark/python/lib/pyspark.zip/pyspark/ml/evaluation.py",
 line 94, in _evaluate
  File 
"/Users/dominikdahlem/software/spark-1.6.0-SNAPSHOT-bin-custom-spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py",
 line 813, in __call__
  File 
"/Users/dominikdahlem/projects/repositories/spark/python/pyspark/sql/utils.py", 
line 42, in deco
raise IllegalArgumentException(s.split(': ', 1)[1])
pyspark.sql.utils.IllegalArgumentException: requirement failed: Column 
prediction must be of type DoubleType but was actually FloatType.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11258) Remove quadratic runtime complexity for converting a Spark DataFrame into an R data.frame

2015-10-23 Thread Frank Rosner (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14971025#comment-14971025
 ] 

Frank Rosner commented on SPARK-11258:
--

Actually I am pretty confused now. Thinking about it, having a for loop and a 
map should not be accessing every element more then one time. However, it still 
seems to be more complex than required to me. Let me try to reproduce the fact 
that we could not load it with the old function but with the new one. Maybe to 
.toArray method is a problem with memory as it is first recreating the whole 
shabang and then copying it to another array?

> Remove quadratic runtime complexity for converting a Spark DataFrame into an 
> R data.frame
> -
>
> Key: SPARK-11258
> URL: https://issues.apache.org/jira/browse/SPARK-11258
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 1.5.1
>Reporter: Frank Rosner
>
> h4. Introduction
> We tried to collect a DataFrame with > 1 million rows and a few hundred 
> columns in SparkR. This took a huge amount of time (much more than in the 
> Spark REPL). When looking into the code, I found that the 
> {{org.apache.spark.sql.api.r.SQLUtils.dfToCols}} method has quadratic run 
> time complexity (it goes through the complete data set _m_ times, where _m_ 
> is the number of columns.
> h4. Problem
> The {{dfToCols}} method is transposing the row-wise representation of the 
> Spark DataFrame (array of rows) into a column wise representation (array of 
> columns) to then be put into a data frame. This is done in a very inefficient 
> way, yielding to huge performance (and possibly also memory) problems when 
> collecting bigger data frames.
> h4. Solution
> Directly transpose the row wise representation to the column wise 
> representation with one pass through the data. I will create a pull request 
> for this.
> h4. Runtime comparison
> On a test data frame with 1 million rows and 22 columns, the old {{dfToCols}} 
> method takes average 2267 ms to complete. My implementation takes only 554 ms 
> on average. This effect gets even bigger, the more columns you have.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11016) Spark fails when running with a task that requires a more recent version of RoaringBitmaps

2015-10-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11016:


Assignee: Apache Spark

> Spark fails when running with a task that requires a more recent version of 
> RoaringBitmaps
> --
>
> Key: SPARK-11016
> URL: https://issues.apache.org/jira/browse/SPARK-11016
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.0
>Reporter: Charles Allen
>Assignee: Apache Spark
>
> The following error appears during Kryo init whenever a more recent version 
> (>0.5.0) of Roaring bitmaps is required by a job. 
> org/roaringbitmap/RoaringArray$Element was removed in 0.5.0
> {code}
> A needed class was not found. This could be due to an error in your runpath. 
> Missing class: org/roaringbitmap/RoaringArray$Element
> java.lang.NoClassDefFoundError: org/roaringbitmap/RoaringArray$Element
>   at 
> org.apache.spark.serializer.KryoSerializer$.(KryoSerializer.scala:338)
>   at 
> org.apache.spark.serializer.KryoSerializer$.(KryoSerializer.scala)
>   at 
> org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:93)
>   at 
> org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:237)
>   at 
> org.apache.spark.serializer.KryoSerializerInstance.(KryoSerializer.scala:222)
>   at 
> org.apache.spark.serializer.KryoSerializer.newInstance(KryoSerializer.scala:138)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:201)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:102)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:85)
>   at 
> org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
>   at 
> org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:63)
>   at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1318)
>   at 
> org.apache.spark.SparkContext$$anonfun$hadoopFile$1.apply(SparkContext.scala:1006)
>   at 
> org.apache.spark.SparkContext$$anonfun$hadoopFile$1.apply(SparkContext.scala:1003)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
>   at org.apache.spark.SparkContext.withScope(SparkContext.scala:700)
>   at org.apache.spark.SparkContext.hadoopFile(SparkContext.scala:1003)
>   at 
> org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:818)
>   at 
> org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:816)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
>   at org.apache.spark.SparkContext.withScope(SparkContext.scala:700)
>   at org.apache.spark.SparkContext.textFile(SparkContext.scala:816)
> {code}
> See https://issues.apache.org/jira/browse/SPARK-5949 for related info



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11016) Spark fails when running with a task that requires a more recent version of RoaringBitmaps

2015-10-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14971027#comment-14971027
 ] 

Apache Spark commented on SPARK-11016:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/9243

> Spark fails when running with a task that requires a more recent version of 
> RoaringBitmaps
> --
>
> Key: SPARK-11016
> URL: https://issues.apache.org/jira/browse/SPARK-11016
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.0
>Reporter: Charles Allen
>
> The following error appears during Kryo init whenever a more recent version 
> (>0.5.0) of Roaring bitmaps is required by a job. 
> org/roaringbitmap/RoaringArray$Element was removed in 0.5.0
> {code}
> A needed class was not found. This could be due to an error in your runpath. 
> Missing class: org/roaringbitmap/RoaringArray$Element
> java.lang.NoClassDefFoundError: org/roaringbitmap/RoaringArray$Element
>   at 
> org.apache.spark.serializer.KryoSerializer$.(KryoSerializer.scala:338)
>   at 
> org.apache.spark.serializer.KryoSerializer$.(KryoSerializer.scala)
>   at 
> org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:93)
>   at 
> org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:237)
>   at 
> org.apache.spark.serializer.KryoSerializerInstance.(KryoSerializer.scala:222)
>   at 
> org.apache.spark.serializer.KryoSerializer.newInstance(KryoSerializer.scala:138)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:201)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:102)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:85)
>   at 
> org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
>   at 
> org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:63)
>   at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1318)
>   at 
> org.apache.spark.SparkContext$$anonfun$hadoopFile$1.apply(SparkContext.scala:1006)
>   at 
> org.apache.spark.SparkContext$$anonfun$hadoopFile$1.apply(SparkContext.scala:1003)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
>   at org.apache.spark.SparkContext.withScope(SparkContext.scala:700)
>   at org.apache.spark.SparkContext.hadoopFile(SparkContext.scala:1003)
>   at 
> org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:818)
>   at 
> org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:816)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
>   at org.apache.spark.SparkContext.withScope(SparkContext.scala:700)
>   at org.apache.spark.SparkContext.textFile(SparkContext.scala:816)
> {code}
> See https://issues.apache.org/jira/browse/SPARK-5949 for related info



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11016) Spark fails when running with a task that requires a more recent version of RoaringBitmaps

2015-10-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11016:


Assignee: (was: Apache Spark)

> Spark fails when running with a task that requires a more recent version of 
> RoaringBitmaps
> --
>
> Key: SPARK-11016
> URL: https://issues.apache.org/jira/browse/SPARK-11016
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.0
>Reporter: Charles Allen
>
> The following error appears during Kryo init whenever a more recent version 
> (>0.5.0) of Roaring bitmaps is required by a job. 
> org/roaringbitmap/RoaringArray$Element was removed in 0.5.0
> {code}
> A needed class was not found. This could be due to an error in your runpath. 
> Missing class: org/roaringbitmap/RoaringArray$Element
> java.lang.NoClassDefFoundError: org/roaringbitmap/RoaringArray$Element
>   at 
> org.apache.spark.serializer.KryoSerializer$.(KryoSerializer.scala:338)
>   at 
> org.apache.spark.serializer.KryoSerializer$.(KryoSerializer.scala)
>   at 
> org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:93)
>   at 
> org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:237)
>   at 
> org.apache.spark.serializer.KryoSerializerInstance.(KryoSerializer.scala:222)
>   at 
> org.apache.spark.serializer.KryoSerializer.newInstance(KryoSerializer.scala:138)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:201)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:102)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:85)
>   at 
> org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
>   at 
> org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:63)
>   at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1318)
>   at 
> org.apache.spark.SparkContext$$anonfun$hadoopFile$1.apply(SparkContext.scala:1006)
>   at 
> org.apache.spark.SparkContext$$anonfun$hadoopFile$1.apply(SparkContext.scala:1003)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
>   at org.apache.spark.SparkContext.withScope(SparkContext.scala:700)
>   at org.apache.spark.SparkContext.hadoopFile(SparkContext.scala:1003)
>   at 
> org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:818)
>   at 
> org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:816)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
>   at org.apache.spark.SparkContext.withScope(SparkContext.scala:700)
>   at org.apache.spark.SparkContext.textFile(SparkContext.scala:816)
> {code}
> See https://issues.apache.org/jira/browse/SPARK-5949 for related info



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10562) .partitionBy() creates the metastore partition columns in all lowercase, but persists the data path as MixedCase resulting in an error when the data is later attempted

2015-10-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14971042#comment-14971042
 ] 

Apache Spark commented on SPARK-10562:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/9251

> .partitionBy() creates the metastore partition columns in all lowercase, but 
> persists the data path as MixedCase resulting in an error when the data is 
> later attempted to query.
> -
>
> Key: SPARK-10562
> URL: https://issues.apache.org/jira/browse/SPARK-10562
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: Jason Pohl
>Assignee: Wenchen Fan
> Attachments: MixedCasePartitionBy.dbc
>
>
> When using DataFrame.write.partitionBy().saveAsTable() it creates the 
> partiton by columns in all lowercase in the meta-store.  However, it writes 
> the data to the filesystem using mixed-case.
> This causes an error when running a select against the table.
> --
> from pyspark.sql import Row
> # Create a data frame with mixed case column names
> myRDD = sc.parallelize([Row(Name="John Terry", Goals=1, Year=2015),
>Row(Name="Frank Lampard", Goals=15, Year=2012)])
> myDF = sqlContext.createDataFrame(myRDD)
> # Write this data out to a parquet file and partition by the Year (which is a 
> mixedCase name)
> myDF.write.partitionBy("Year").saveAsTable("chelsea_goals")
> %sql show create table chelsea_goals;
> --The metastore is showwing a partition column name of all lowercase "year"
> # Verify that the data is written with appropriate partitions
> display(dbutils.fs.ls("/user/hive/warehouse/chelsea_goals"))
> %sql
> --Now try to run a query against this table
> select * from chelsea_goals
> Error in SQL statement: UncheckedExecutionException: 
> java.lang.RuntimeException: Partition column year not found in schema 
> StructType(StructField(Goals,LongType,true), 
> StructField(Name,StringType,true), StructField(Year,LongType,true))
> # Now lets try this again using a lowercase column name
> myRDD2 = sc.parallelize([Row(Name="John Terry", Goals=1, year=2015),
>  Row(Name="Frank Lampard", Goals=15, year=2012)])
> myDF2 = sqlContext.createDataFrame(myRDD2)
> myDF2.write.partitionBy("year").saveAsTable("chelsea_goals2")
> %sql select * from chelsea_goals2;
> --Now everything works



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11284) ALS produces predictions as floats and should be double

2015-10-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11284:


Assignee: (was: Apache Spark)

> ALS produces predictions as floats and should be double
> ---
>
> Key: SPARK-11284
> URL: https://issues.apache.org/jira/browse/SPARK-11284
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.5.1
> Environment: All
>Reporter: Dominik Dahlem
>  Labels: ml, recommender
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Using pyspark.ml and DataFrames, The ALS recommender cannot be evaluated 
> using the RegressionEvaluator, because of a type mis-match between the model 
> transformation and the evaluation APIs. One can work around this by casting 
> the prediction column into double before passing it into the evaluator. 
> However, this does not work with pipelines and cross validation.
> Code and traceback below:
> {code}
> als = ALS(rank=10, maxIter=30, regParam=0.1, userCol='userID', 
> itemCol='movieID', ratingCol='rating')
> model = als.fit(training)
> predictions = model.transform(validation)
> evaluator = RegressionEvaluator(predictionCol='prediction', 
> labelCol='rating')
> validationRmse = evaluator.evaluate(predictions, 
> {evaluator.metricName: 'rmse'})
> {code}
> Traceback:
> validationRmse = evaluator.evaluate(predictions, {evaluator.metricName: 
> 'rmse'})
>   File 
> "/Users/dominikdahlem/software/spark-1.6.0-SNAPSHOT-bin-custom-spark/python/lib/pyspark.zip/pyspark/ml/evaluation.py",
>  line 63, in evaluate
>   File 
> "/Users/dominikdahlem/software/spark-1.6.0-SNAPSHOT-bin-custom-spark/python/lib/pyspark.zip/pyspark/ml/evaluation.py",
>  line 94, in _evaluate
>   File 
> "/Users/dominikdahlem/software/spark-1.6.0-SNAPSHOT-bin-custom-spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py",
>  line 813, in __call__
>   File 
> "/Users/dominikdahlem/projects/repositories/spark/python/pyspark/sql/utils.py",
>  line 42, in deco
> raise IllegalArgumentException(s.split(': ', 1)[1])
> pyspark.sql.utils.IllegalArgumentException: requirement failed: Column 
> prediction must be of type DoubleType but was actually FloatType.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11284) ALS produces predictions as floats and should be double

2015-10-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11284:


Assignee: Apache Spark

> ALS produces predictions as floats and should be double
> ---
>
> Key: SPARK-11284
> URL: https://issues.apache.org/jira/browse/SPARK-11284
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.5.1
> Environment: All
>Reporter: Dominik Dahlem
>Assignee: Apache Spark
>  Labels: ml, recommender
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Using pyspark.ml and DataFrames, The ALS recommender cannot be evaluated 
> using the RegressionEvaluator, because of a type mis-match between the model 
> transformation and the evaluation APIs. One can work around this by casting 
> the prediction column into double before passing it into the evaluator. 
> However, this does not work with pipelines and cross validation.
> Code and traceback below:
> {code}
> als = ALS(rank=10, maxIter=30, regParam=0.1, userCol='userID', 
> itemCol='movieID', ratingCol='rating')
> model = als.fit(training)
> predictions = model.transform(validation)
> evaluator = RegressionEvaluator(predictionCol='prediction', 
> labelCol='rating')
> validationRmse = evaluator.evaluate(predictions, 
> {evaluator.metricName: 'rmse'})
> {code}
> Traceback:
> validationRmse = evaluator.evaluate(predictions, {evaluator.metricName: 
> 'rmse'})
>   File 
> "/Users/dominikdahlem/software/spark-1.6.0-SNAPSHOT-bin-custom-spark/python/lib/pyspark.zip/pyspark/ml/evaluation.py",
>  line 63, in evaluate
>   File 
> "/Users/dominikdahlem/software/spark-1.6.0-SNAPSHOT-bin-custom-spark/python/lib/pyspark.zip/pyspark/ml/evaluation.py",
>  line 94, in _evaluate
>   File 
> "/Users/dominikdahlem/software/spark-1.6.0-SNAPSHOT-bin-custom-spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py",
>  line 813, in __call__
>   File 
> "/Users/dominikdahlem/projects/repositories/spark/python/pyspark/sql/utils.py",
>  line 42, in deco
> raise IllegalArgumentException(s.split(': ', 1)[1])
> pyspark.sql.utils.IllegalArgumentException: requirement failed: Column 
> prediction must be of type DoubleType but was actually FloatType.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11261) Provide a more flexible alternative to Jdbc RDD

2015-10-23 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-11261.
---
Resolution: Won't Fix

> Provide a more flexible alternative to Jdbc RDD
> ---
>
> Key: SPARK-11261
> URL: https://issues.apache.org/jira/browse/SPARK-11261
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Richard Marscher
>
> The existing JdbcRDD only covers a limited number of use cases by requiring 
> the semantics of your query to operate on upper and lower bound predicates 
> like: "select title, author from books where ? <= id and id <= ?"
> However, there are many use cases that cannot use such a method and/or are 
> much more inefficient doing so.
> For example, we have a MySQL table partitioned on a partition key. We don't 
> have range values to lookup but rather want to get all entries matching a 
> predicate and have Spark run 1 query in a partition against each logical 
> partition of our MySQL table. For example: "select * from devices where 
> partition_id = ? and app_id = 'abcd'".
> Another use case, looking up against a distinct set of identifiers that don't 
> fall within an ordering. "select * from users where user_id in 
> (?,?,?,?,?,?,?)". The number of identifiers may be quite large and/or dynamic.
> Solution:
> Instead of addressing each use case differently with new RDD types, provide 
> an alternate, general RDD that gives the user direct control over how the 
> query is partitioned in Spark and filling in the placeholders.
> The user should be able to control which placeholder values are available on 
> each partition of the RDD and also how they are inserted into the 
> PreparedStatement. Ideally it can support dynamic placeholder values like 
> inserting a set of values for an IN clause or similar.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11284) ALS produces predictions as floats and should be double

2015-10-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14971055#comment-14971055
 ] 

Apache Spark commented on SPARK-11284:
--

User 'dahlem' has created a pull request for this issue:
https://github.com/apache/spark/pull/9252

> ALS produces predictions as floats and should be double
> ---
>
> Key: SPARK-11284
> URL: https://issues.apache.org/jira/browse/SPARK-11284
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.5.1
> Environment: All
>Reporter: Dominik Dahlem
>  Labels: ml, recommender
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Using pyspark.ml and DataFrames, The ALS recommender cannot be evaluated 
> using the RegressionEvaluator, because of a type mis-match between the model 
> transformation and the evaluation APIs. One can work around this by casting 
> the prediction column into double before passing it into the evaluator. 
> However, this does not work with pipelines and cross validation.
> Code and traceback below:
> {code}
> als = ALS(rank=10, maxIter=30, regParam=0.1, userCol='userID', 
> itemCol='movieID', ratingCol='rating')
> model = als.fit(training)
> predictions = model.transform(validation)
> evaluator = RegressionEvaluator(predictionCol='prediction', 
> labelCol='rating')
> validationRmse = evaluator.evaluate(predictions, 
> {evaluator.metricName: 'rmse'})
> {code}
> Traceback:
> validationRmse = evaluator.evaluate(predictions, {evaluator.metricName: 
> 'rmse'})
>   File 
> "/Users/dominikdahlem/software/spark-1.6.0-SNAPSHOT-bin-custom-spark/python/lib/pyspark.zip/pyspark/ml/evaluation.py",
>  line 63, in evaluate
>   File 
> "/Users/dominikdahlem/software/spark-1.6.0-SNAPSHOT-bin-custom-spark/python/lib/pyspark.zip/pyspark/ml/evaluation.py",
>  line 94, in _evaluate
>   File 
> "/Users/dominikdahlem/software/spark-1.6.0-SNAPSHOT-bin-custom-spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py",
>  line 813, in __call__
>   File 
> "/Users/dominikdahlem/projects/repositories/spark/python/pyspark/sql/utils.py",
>  line 42, in deco
> raise IllegalArgumentException(s.split(': ', 1)[1])
> pyspark.sql.utils.IllegalArgumentException: requirement failed: Column 
> prediction must be of type DoubleType but was actually FloatType.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10857) SQL injection bug in JdbcDialect.getTableExistsQuery()

2015-10-23 Thread Rick Hillegas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14971101#comment-14971101
 ] 

Rick Hillegas commented on SPARK-10857:
---

Hi Sean,

That is my understanding after examining the code with Intellij. If we intend 
to use this method for some other purpose in the future, then, to avoid 
confusion, I would recommend renaming it. "getTableExistsQuery()" is a sensible 
question to ask about a base table or view. But it strikes my ear as an odd 
question to ask about an arbitrary query expression.

Thanks,
-Rick


> SQL injection bug in JdbcDialect.getTableExistsQuery()
> --
>
> Key: SPARK-10857
> URL: https://issues.apache.org/jira/browse/SPARK-10857
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Rick Hillegas
>Priority: Minor
>
> All of the implementations of this method involve constructing a query by 
> concatenating boilerplate text with a user-supplied name. This looks like a 
> SQL injection bug to me.
> A better solution would be to call java.sql.DatabaseMetaData.getTables() to 
> implement this method, using the catalog and schema which are available from 
> Connection.getCatalog() and Connection.getSchema(). This would not work on 
> Java 6 because Connection.getSchema() was introduced in Java 7. However, the 
> solution would work for more modern JVMs. Limiting the vulnerability to 
> obsolete JVMs would at least be an improvement over the current situation. 
> Java 6 has been end-of-lifed and is not an appropriate platform for users who 
> are concerned about security.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11285) Infinite TaskCommitDenied loop

2015-10-23 Thread Ryan Williams (JIRA)
Ryan Williams created SPARK-11285:
-

 Summary: Infinite TaskCommitDenied loop
 Key: SPARK-11285
 URL: https://issues.apache.org/jira/browse/SPARK-11285
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.5.0
Reporter: Ryan Williams


I've seen several apps enter this failing state in the last couple of days. 
I've gathered all the documentation I can about two of them, 
[application_1444948191538_0051|https://www.dropbox.com/sh/ku9btpsbwrizx9y/AAAXIY0VhMqFabJBCtTVYxtma?dl=0]
 and 
[application_1444948191538_0116|https://www.dropbox.com/home/spark/application_1444948191538_0116].
 Both were run on Spark 1.5.0 on in yarn-client mode with dynamic allocation of 
executors.

In application_1444948191538_0051, partitions 5808 and 6109 in stage-attempt 
1.0 failed 7948 and 7921 times, respectively, before I killed the app. In both 
cases, the first two attempts failed due to {{ExecutorLostFailure}}'s, and the 
remaining ~7900 attempts all failed due to {{TaskCommitDenied}}'s, over ~6hrs 
at a rate of about once per ~4s. See the last several thousand lines of 
[application_1444948191538_0051/driver|https://www.dropbox.com/s/f3zghuzuxobyzem/driver?dl=0].

In application_1444948191538_0116, partition 10593 in stage-attempt 6.0 failed 
its first attempt due to an ExecutorLostFailure, and then a subsequent 219 
attempts in ~22mins due to {{TaskCommitDenied}}'s before I killed the app. 
Again, [the driver logs|https://www.dropbox.com/s/ay1398p017qp712/driver?dl=0] 
enumerate each attempt.

I'm guessing that the OutputCommitCoordinator is getting stuck due to early 
failed attempts?

I'm trying to re-run some of these jobs on a 1.5.1 release and will let you 
know if I repro it there as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11258) Converting a Spark DataFrame into an R data.frame is slow / requires a lot of memory

2015-10-23 Thread Frank Rosner (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank Rosner updated SPARK-11258:
-
Summary: Converting a Spark DataFrame into an R data.frame is slow / 
requires a lot of memory  (was: Remove quadratic runtime complexity for 
converting a Spark DataFrame into an R data.frame)

> Converting a Spark DataFrame into an R data.frame is slow / requires a lot of 
> memory
> 
>
> Key: SPARK-11258
> URL: https://issues.apache.org/jira/browse/SPARK-11258
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 1.5.1
>Reporter: Frank Rosner
>
> h4. Introduction
> We tried to collect a DataFrame with > 1 million rows and a few hundred 
> columns in SparkR. This took a huge amount of time (much more than in the 
> Spark REPL). When looking into the code, I found that the 
> {{org.apache.spark.sql.api.r.SQLUtils.dfToCols}} method has quadratic run 
> time complexity (it goes through the complete data set _m_ times, where _m_ 
> is the number of columns.
> h4. Problem
> The {{dfToCols}} method is transposing the row-wise representation of the 
> Spark DataFrame (array of rows) into a column wise representation (array of 
> columns) to then be put into a data frame. This is done in a very inefficient 
> way, yielding to huge performance (and possibly also memory) problems when 
> collecting bigger data frames.
> h4. Solution
> Directly transpose the row wise representation to the column wise 
> representation with one pass through the data. I will create a pull request 
> for this.
> h4. Runtime comparison
> On a test data frame with 1 million rows and 22 columns, the old {{dfToCols}} 
> method takes average 2267 ms to complete. My implementation takes only 554 ms 
> on average. This effect gets even bigger, the more columns you have.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11258) Converting a Spark DataFrame into an R data.frame is slow / requires a lot of memory

2015-10-23 Thread Frank Rosner (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank Rosner updated SPARK-11258:
-
Description: 
h4. Problem

We tried to collect a DataFrame with > 1 million rows and a few hundred columns 
in SparkR. This took a huge amount of time (much more than in the Spark REPL). 
When looking into the code, I found that the 
{{org.apache.spark.sql.api.r.SQLUtils.dfToCols}} method does some map and then 
{{.toArray}} which might cause the problem.

h4. Solution

Directly transpose the row wise representation to the column wise 
representation with one pass through the data. I will create a pull request for 
this.

h4. Runtime comparison

On a test data frame with 1 million rows and 22 columns, the old {{dfToCols}} 
method takes average 2267 ms to complete. My implementation takes only 554 ms 
on average. This effect gets even bigger, the more columns you have.

  was:
h4. Introduction

We tried to collect a DataFrame with > 1 million rows and a few hundred columns 
in SparkR. This took a huge amount of time (much more than in the Spark REPL). 
When looking into the code, I found that the 
{{org.apache.spark.sql.api.r.SQLUtils.dfToCols}} method has quadratic run time 
complexity (it goes through the complete data set _m_ times, where _m_ is the 
number of columns.

h4. Problem

The {{dfToCols}} method is transposing the row-wise representation of the Spark 
DataFrame (array of rows) into a column wise representation (array of columns) 
to then be put into a data frame. This is done in a very inefficient way, 
yielding to huge performance (and possibly also memory) problems when 
collecting bigger data frames.

h4. Solution

Directly transpose the row wise representation to the column wise 
representation with one pass through the data. I will create a pull request for 
this.

h4. Runtime comparison

On a test data frame with 1 million rows and 22 columns, the old {{dfToCols}} 
method takes average 2267 ms to complete. My implementation takes only 554 ms 
on average. This effect gets even bigger, the more columns you have.


> Converting a Spark DataFrame into an R data.frame is slow / requires a lot of 
> memory
> 
>
> Key: SPARK-11258
> URL: https://issues.apache.org/jira/browse/SPARK-11258
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 1.5.1
>Reporter: Frank Rosner
>
> h4. Problem
> We tried to collect a DataFrame with > 1 million rows and a few hundred 
> columns in SparkR. This took a huge amount of time (much more than in the 
> Spark REPL). When looking into the code, I found that the 
> {{org.apache.spark.sql.api.r.SQLUtils.dfToCols}} method does some map and 
> then {{.toArray}} which might cause the problem.
> h4. Solution
> Directly transpose the row wise representation to the column wise 
> representation with one pass through the data. I will create a pull request 
> for this.
> h4. Runtime comparison
> On a test data frame with 1 million rows and 22 columns, the old {{dfToCols}} 
> method takes average 2267 ms to complete. My implementation takes only 554 ms 
> on average. This effect gets even bigger, the more columns you have.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11258) Converting a Spark DataFrame into an R data.frame is slow / requires a lot of memory

2015-10-23 Thread Frank Rosner (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14971155#comment-14971155
 ] 

Frank Rosner commented on SPARK-11258:
--

I adjusted the description to be more general. I will see if I can get some 
memory profiling or something. Maybe I can also provide a reproducible example.

> Converting a Spark DataFrame into an R data.frame is slow / requires a lot of 
> memory
> 
>
> Key: SPARK-11258
> URL: https://issues.apache.org/jira/browse/SPARK-11258
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 1.5.1
>Reporter: Frank Rosner
>
> h4. Problem
> We tried to collect a DataFrame with > 1 million rows and a few hundred 
> columns in SparkR. This took a huge amount of time (much more than in the 
> Spark REPL). When looking into the code, I found that the 
> {{org.apache.spark.sql.api.r.SQLUtils.dfToCols}} method does some map and 
> then {{.toArray}} which might cause the problem.
> h4. Solution
> Directly transpose the row wise representation to the column wise 
> representation with one pass through the data. I will create a pull request 
> for this.
> h4. Runtime comparison
> On a test data frame with 1 million rows and 22 columns, the old {{dfToCols}} 
> method takes average 2267 ms to complete. My implementation takes only 554 ms 
> on average. This effect gets even bigger, the more columns you have.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11285) Infinite TaskCommitDenied loop

2015-10-23 Thread Ryan Williams (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Williams updated SPARK-11285:
--
Description: 
I've seen several apps enter this failing state in the last couple of days. 
I've gathered all the documentation I can about two of them:

* 
[application_1444948191538_0051|https://www.dropbox.com/sh/ku9btpsbwrizx9y/AAAXIY0VhMqFabJBCtTVYxtma?dl=0]
* 
[application_1444948191538_0116|https://www.dropbox.com/home/spark/application_1444948191538_0116]

Both were run on Spark 1.5.0 on in yarn-client mode with dynamic allocation of 
executors.

In application_1444948191538_0051, partitions 5808 and 6109 in stage-attempt 
1.0 failed 7948 and 7921 times, respectively, before I killed the app. In both 
cases, the first two attempts failed due to {{ExecutorLostFailure}}'s, and the 
remaining ~7900 attempts all failed due to {{TaskCommitDenied}}'s, over ~6hrs 
at a rate of about once per ~4s. See the last several thousand lines of 
[application_1444948191538_0051/driver|https://www.dropbox.com/s/f3zghuzuxobyzem/driver?dl=0].

In application_1444948191538_0116, partition 10593 in stage-attempt 6.0 failed 
its first attempt due to an ExecutorLostFailure, and then a subsequent 219 
attempts in ~22mins due to {{TaskCommitDenied}}'s before I killed the app. 
Again, [the driver logs|https://www.dropbox.com/s/ay1398p017qp712/driver?dl=0] 
enumerate each attempt.

I'm guessing that the OutputCommitCoordinator is getting stuck due to early 
failed attempts?

I'm trying to re-run some of these jobs on a 1.5.1 release and will let you 
know if I repro it there as well.

  was:
I've seen several apps enter this failing state in the last couple of days. 
I've gathered all the documentation I can about two of them, 
[application_1444948191538_0051|https://www.dropbox.com/sh/ku9btpsbwrizx9y/AAAXIY0VhMqFabJBCtTVYxtma?dl=0]
 and 
[application_1444948191538_0116|https://www.dropbox.com/home/spark/application_1444948191538_0116].
 Both were run on Spark 1.5.0 on in yarn-client mode with dynamic allocation of 
executors.

In application_1444948191538_0051, partitions 5808 and 6109 in stage-attempt 
1.0 failed 7948 and 7921 times, respectively, before I killed the app. In both 
cases, the first two attempts failed due to {{ExecutorLostFailure}}'s, and the 
remaining ~7900 attempts all failed due to {{TaskCommitDenied}}'s, over ~6hrs 
at a rate of about once per ~4s. See the last several thousand lines of 
[application_1444948191538_0051/driver|https://www.dropbox.com/s/f3zghuzuxobyzem/driver?dl=0].

In application_1444948191538_0116, partition 10593 in stage-attempt 6.0 failed 
its first attempt due to an ExecutorLostFailure, and then a subsequent 219 
attempts in ~22mins due to {{TaskCommitDenied}}'s before I killed the app. 
Again, [the driver logs|https://www.dropbox.com/s/ay1398p017qp712/driver?dl=0] 
enumerate each attempt.

I'm guessing that the OutputCommitCoordinator is getting stuck due to early 
failed attempts?

I'm trying to re-run some of these jobs on a 1.5.1 release and will let you 
know if I repro it there as well.


> Infinite TaskCommitDenied loop
> --
>
> Key: SPARK-11285
> URL: https://issues.apache.org/jira/browse/SPARK-11285
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.0
>Reporter: Ryan Williams
>
> I've seen several apps enter this failing state in the last couple of days. 
> I've gathered all the documentation I can about two of them:
> * 
> [application_1444948191538_0051|https://www.dropbox.com/sh/ku9btpsbwrizx9y/AAAXIY0VhMqFabJBCtTVYxtma?dl=0]
> * 
> [application_1444948191538_0116|https://www.dropbox.com/home/spark/application_1444948191538_0116]
> Both were run on Spark 1.5.0 on in yarn-client mode with dynamic allocation 
> of executors.
> In application_1444948191538_0051, partitions 5808 and 6109 in stage-attempt 
> 1.0 failed 7948 and 7921 times, respectively, before I killed the app. In 
> both cases, the first two attempts failed due to {{ExecutorLostFailure}}'s, 
> and the remaining ~7900 attempts all failed due to {{TaskCommitDenied}}'s, 
> over ~6hrs at a rate of about once per ~4s. See the last several thousand 
> lines of 
> [application_1444948191538_0051/driver|https://www.dropbox.com/s/f3zghuzuxobyzem/driver?dl=0].
> In application_1444948191538_0116, partition 10593 in stage-attempt 6.0 
> failed its first attempt due to an ExecutorLostFailure, and then a subsequent 
> 219 attempts in ~22mins due to {{TaskCommitDenied}}'s before I killed the 
> app. Again, [the driver 
> logs|https://www.dropbox.com/s/ay1398p017qp712/driver?dl=0] enumerate each 
> attempt.
> I'm guessing that the OutputCommitCoordinator is getting stuck due to early 
> failed attempts?
> I'm trying to re-run some of these jobs on a 1.5.1 release and will let you 
> know i

[jira] [Commented] (SPARK-11285) Infinite TaskCommitDenied loop

2015-10-23 Thread Ryan Williams (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14971161#comment-14971161
 ] 

Ryan Williams commented on SPARK-11285:
---

A quick note on the structure of the linked directories with each application's 
logs from above:

The directories are prepared by [this 
script|https://github.com/hammerlab/yarn-logs-helpers/blob/master/yarn-container-logs]
 that I use to parse aggregated YARN logs out into individual containers, which 
contains:
* {{events.json}}: the event-log file
* {{driver}}: the driver's stdout, symlink to {{drivers/0}}.
* {{app_master}}: the ApplicationMaster container's output; symlink to 
{{app_masters/0}}.
* {{containers}}: directory containing the output of every executor container.
* {{eids}}: directory containing the output of each executor in the form of 
symlinks to the relevant file in the {{containers}} directory.
* {{tids}}: directory symlinking each task ID to the container-output-file for 
that task.
* {{hosts}}: directories for each host with symlinks to the containers that ran 
on them.

> Infinite TaskCommitDenied loop
> --
>
> Key: SPARK-11285
> URL: https://issues.apache.org/jira/browse/SPARK-11285
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.0
>Reporter: Ryan Williams
>
> I've seen several apps enter this failing state in the last couple of days. 
> I've gathered all the documentation I can about two of them, 
> [application_1444948191538_0051|https://www.dropbox.com/sh/ku9btpsbwrizx9y/AAAXIY0VhMqFabJBCtTVYxtma?dl=0]
>  and 
> [application_1444948191538_0116|https://www.dropbox.com/home/spark/application_1444948191538_0116].
>  Both were run on Spark 1.5.0 on in yarn-client mode with dynamic allocation 
> of executors.
> In application_1444948191538_0051, partitions 5808 and 6109 in stage-attempt 
> 1.0 failed 7948 and 7921 times, respectively, before I killed the app. In 
> both cases, the first two attempts failed due to {{ExecutorLostFailure}}'s, 
> and the remaining ~7900 attempts all failed due to {{TaskCommitDenied}}'s, 
> over ~6hrs at a rate of about once per ~4s. See the last several thousand 
> lines of 
> [application_1444948191538_0051/driver|https://www.dropbox.com/s/f3zghuzuxobyzem/driver?dl=0].
> In application_1444948191538_0116, partition 10593 in stage-attempt 6.0 
> failed its first attempt due to an ExecutorLostFailure, and then a subsequent 
> 219 attempts in ~22mins due to {{TaskCommitDenied}}'s before I killed the 
> app. Again, [the driver 
> logs|https://www.dropbox.com/s/ay1398p017qp712/driver?dl=0] enumerate each 
> attempt.
> I'm guessing that the OutputCommitCoordinator is getting stuck due to early 
> failed attempts?
> I'm trying to re-run some of these jobs on a 1.5.1 release and will let you 
> know if I repro it there as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10382) Make example code in user guide testable

2015-10-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-10382.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9109
[https://github.com/apache/spark/pull/9109]

> Make example code in user guide testable
> 
>
> Key: SPARK-10382
> URL: https://issues.apache.org/jira/browse/SPARK-10382
> Project: Spark
>  Issue Type: Brainstorming
>  Components: Documentation, ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: Xusen Yin
>Priority: Critical
> Fix For: 1.6.0
>
>
> The example code in the user guide is embedded in the markdown and hence it 
> is not easy to test. It would be nice to automatically test them. This JIRA 
> is to discuss options to automate example code testing and see what we can do 
> in Spark 1.6.
> One option I propose is to move actual example code to spark/examples and 
> test compilation in Jenkins builds. Then in the markdown, we can reference 
> part of the code to show in the user guide. This requires adding a Jekyll tag 
> that is similar to 
> https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, 
> e.g., called include_example.
> {code}
> {% include_example scala ml.KMeansExample guide %}
> {code}
> Jekyll will find 
> `examples/src/main/scala/org/apache/spark/examples/ml/KMeansExample.scala` 
> and pick code blocks marked "guide" and put them under `{% highlight %}` in 
> the markdown. We can discuss the syntax for marker comments.
> Just one way to implement this. It would be nice to hear more ideas.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11265) YarnClient can't get tokens to talk to Hive in a secure cluster

2015-10-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11265:


Assignee: Apache Spark

> YarnClient can't get tokens to talk to Hive in a secure cluster
> ---
>
> Key: SPARK-11265
> URL: https://issues.apache.org/jira/browse/SPARK-11265
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.5.1
> Environment: Kerberized Hadoop cluster
>Reporter: Steve Loughran
>Assignee: Apache Spark
>
> As reported on the dev list, trying to run a YARN client which wants to talk 
> to Hive in a Kerberized hadoop cluster fails. This appears to be because the 
> constructor of the {{ org.apache.hadoop.hive.ql.metadata.Hive}} class was 
> made private and replaced with a factory method. The YARN client uses 
> reflection to get the tokens, so the signature changes weren't picked up in 
> SPARK-8064.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11265) YarnClient can't get tokens to talk to Hive in a secure cluster

2015-10-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14971190#comment-14971190
 ] 

Apache Spark commented on SPARK-11265:
--

User 'steveloughran' has created a pull request for this issue:
https://github.com/apache/spark/pull/9232

> YarnClient can't get tokens to talk to Hive in a secure cluster
> ---
>
> Key: SPARK-11265
> URL: https://issues.apache.org/jira/browse/SPARK-11265
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.5.1
> Environment: Kerberized Hadoop cluster
>Reporter: Steve Loughran
>
> As reported on the dev list, trying to run a YARN client which wants to talk 
> to Hive in a Kerberized hadoop cluster fails. This appears to be because the 
> constructor of the {{ org.apache.hadoop.hive.ql.metadata.Hive}} class was 
> made private and replaced with a factory method. The YARN client uses 
> reflection to get the tokens, so the signature changes weren't picked up in 
> SPARK-8064.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11265) YarnClient can't get tokens to talk to Hive in a secure cluster

2015-10-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11265:


Assignee: (was: Apache Spark)

> YarnClient can't get tokens to talk to Hive in a secure cluster
> ---
>
> Key: SPARK-11265
> URL: https://issues.apache.org/jira/browse/SPARK-11265
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.5.1
> Environment: Kerberized Hadoop cluster
>Reporter: Steve Loughran
>
> As reported on the dev list, trying to run a YARN client which wants to talk 
> to Hive in a Kerberized hadoop cluster fails. This appears to be because the 
> constructor of the {{ org.apache.hadoop.hive.ql.metadata.Hive}} class was 
> made private and replaced with a factory method. The YARN client uses 
> reflection to get the tokens, so the signature changes weren't picked up in 
> SPARK-8064.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11265) YarnClient can't get tokens to talk to Hive in a secure cluster

2015-10-23 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14971188#comment-14971188
 ] 

Steve Loughran commented on SPARK-11265:


Pull request is : https://github.com/apache/spark/pull/9232

> YarnClient can't get tokens to talk to Hive in a secure cluster
> ---
>
> Key: SPARK-11265
> URL: https://issues.apache.org/jira/browse/SPARK-11265
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.5.1
> Environment: Kerberized Hadoop cluster
>Reporter: Steve Loughran
>
> As reported on the dev list, trying to run a YARN client which wants to talk 
> to Hive in a Kerberized hadoop cluster fails. This appears to be because the 
> constructor of the {{ org.apache.hadoop.hive.ql.metadata.Hive}} class was 
> made private and replaced with a factory method. The YARN client uses 
> reflection to get the tokens, so the signature changes weren't picked up in 
> SPARK-8064.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10277) Add @since annotation to pyspark.mllib.regression

2015-10-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-10277.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8684
[https://github.com/apache/spark/pull/8684]

> Add @since annotation to pyspark.mllib.regression
> -
>
> Key: SPARK-10277
> URL: https://issues.apache.org/jira/browse/SPARK-10277
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib, PySpark
>Reporter: Xiangrui Meng
>Assignee: Yu Ishikawa
>Priority: Minor
>  Labels: starter
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6723) Model import/export for ChiSqSelector

2015-10-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-6723:
-
Assignee: Jayant Shekhar

> Model import/export for ChiSqSelector
> -
>
> Key: SPARK-6723
> URL: https://issues.apache.org/jira/browse/SPARK-6723
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Assignee: Jayant Shekhar
>Priority: Minor
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6723) Model import/export for ChiSqSelector

2015-10-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-6723.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 6785
[https://github.com/apache/spark/pull/6785]

> Model import/export for ChiSqSelector
> -
>
> Key: SPARK-6723
> URL: https://issues.apache.org/jira/browse/SPARK-6723
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Priority: Minor
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10610) Using AppName instead of AppId in the name of all metrics

2015-10-23 Thread Yi Tian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Tian updated SPARK-10610:

Summary: Using AppName instead of AppId in the name of all metrics  (was: 
Using AppName instead AppId in the name of all metrics)

> Using AppName instead of AppId in the name of all metrics
> -
>
> Key: SPARK-10610
> URL: https://issues.apache.org/jira/browse/SPARK-10610
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 1.5.0
>Reporter: Yi Tian
>Priority: Minor
>
> When we using {{JMX}} to monitor spark system,  We have to configure the name 
> of target metrics in the monitor system. But the current name of metrics is 
> {{appId}} + {{executorId}} + {{source}} . So when the spark program 
> restarted, we have to update the name of metrics in the monitor system.
> We should add an optional configuration property to control whether using the 
> appName instead of appId in spark metrics system.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7970) Optimize code for SQL queries fired on Union of RDDs (closure cleaner)

2015-10-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14971217#comment-14971217
 ] 

Apache Spark commented on SPARK-7970:
-

User 'nitin2goyal' has created a pull request for this issue:
https://github.com/apache/spark/pull/9253

> Optimize code for SQL queries fired on Union of RDDs (closure cleaner)
> --
>
> Key: SPARK-7970
> URL: https://issues.apache.org/jira/browse/SPARK-7970
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 1.2.0, 1.3.0
>Reporter: Nitin Goyal
> Attachments: Screen Shot 2015-05-27 at 11.01.03 pm.png, Screen Shot 
> 2015-05-27 at 11.07.02 pm.png
>
>
> Closure cleaner slows down the execution of Spark SQL queries fired on union 
> of RDDs. The time increases linearly at driver side with number of RDDs 
> unioned. Refer following thread for more context :-
> http://apache-spark-developers-list.1001551.n3.nabble.com/ClosureCleaner-slowing-down-Spark-SQL-queries-tt12466.html
> As can be seen in attached screenshots of Jprofiler, lot of time is getting 
> consumed in "getClassReader" method of ClosureCleaner and rest in 
> "ensureSerializable" (atleast in my case)
> This can be fixed in two ways (as per my current understanding) :-
> 1. Fixed at Spark SQL level - As pointed out by yhuai, we can create 
> MapPartitionsRDD idirectly nstead of doing rdd.mapPartitions which calls 
> ClosureCleaner clean method (See PR - 
> https://github.com/apache/spark/pull/6256).
> 2. Fix at Spark core level -
>   (i) Make "checkSerializable" property driven in SparkContext's clean method
>   (ii) Somehow cache classreader for last 'n' classes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7970) Optimize code for SQL queries fired on Union of RDDs (closure cleaner)

2015-10-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7970:
---

Assignee: (was: Apache Spark)

> Optimize code for SQL queries fired on Union of RDDs (closure cleaner)
> --
>
> Key: SPARK-7970
> URL: https://issues.apache.org/jira/browse/SPARK-7970
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 1.2.0, 1.3.0
>Reporter: Nitin Goyal
> Attachments: Screen Shot 2015-05-27 at 11.01.03 pm.png, Screen Shot 
> 2015-05-27 at 11.07.02 pm.png
>
>
> Closure cleaner slows down the execution of Spark SQL queries fired on union 
> of RDDs. The time increases linearly at driver side with number of RDDs 
> unioned. Refer following thread for more context :-
> http://apache-spark-developers-list.1001551.n3.nabble.com/ClosureCleaner-slowing-down-Spark-SQL-queries-tt12466.html
> As can be seen in attached screenshots of Jprofiler, lot of time is getting 
> consumed in "getClassReader" method of ClosureCleaner and rest in 
> "ensureSerializable" (atleast in my case)
> This can be fixed in two ways (as per my current understanding) :-
> 1. Fixed at Spark SQL level - As pointed out by yhuai, we can create 
> MapPartitionsRDD idirectly nstead of doing rdd.mapPartitions which calls 
> ClosureCleaner clean method (See PR - 
> https://github.com/apache/spark/pull/6256).
> 2. Fix at Spark core level -
>   (i) Make "checkSerializable" property driven in SparkContext's clean method
>   (ii) Somehow cache classreader for last 'n' classes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7970) Optimize code for SQL queries fired on Union of RDDs (closure cleaner)

2015-10-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7970:
---

Assignee: Apache Spark

> Optimize code for SQL queries fired on Union of RDDs (closure cleaner)
> --
>
> Key: SPARK-7970
> URL: https://issues.apache.org/jira/browse/SPARK-7970
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 1.2.0, 1.3.0
>Reporter: Nitin Goyal
>Assignee: Apache Spark
> Attachments: Screen Shot 2015-05-27 at 11.01.03 pm.png, Screen Shot 
> 2015-05-27 at 11.07.02 pm.png
>
>
> Closure cleaner slows down the execution of Spark SQL queries fired on union 
> of RDDs. The time increases linearly at driver side with number of RDDs 
> unioned. Refer following thread for more context :-
> http://apache-spark-developers-list.1001551.n3.nabble.com/ClosureCleaner-slowing-down-Spark-SQL-queries-tt12466.html
> As can be seen in attached screenshots of Jprofiler, lot of time is getting 
> consumed in "getClassReader" method of ClosureCleaner and rest in 
> "ensureSerializable" (atleast in my case)
> This can be fixed in two ways (as per my current understanding) :-
> 1. Fixed at Spark SQL level - As pointed out by yhuai, we can create 
> MapPartitionsRDD idirectly nstead of doing rdd.mapPartitions which calls 
> ClosureCleaner clean method (See PR - 
> https://github.com/apache/spark/pull/6256).
> 2. Fix at Spark core level -
>   (i) Make "checkSerializable" property driven in SparkContext's clean method
>   (ii) Somehow cache classreader for last 'n' classes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4940) Support more evenly distributing cores for Mesos mode

2015-10-23 Thread Jerry Lam (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14971228#comment-14971228
 ] 

Jerry Lam commented on SPARK-4940:
--

I just want to weight in the importance of this issue. My observation is that 
using coarse grained mode, it is possible that if I configure total core max to 
20, I could end up having ONE executor with 20 cores. This is not ideal even I 
have 5 slaves with 32 cores each. It would makes more sense to have ONE 
executor per slave and each executor has 4 cores. 

Is there a workaround at this moment using Spark 1.5.1. to make load more 
evenly distributed on mesos. How people actually use spark on mesos when the 
resource is not distributed evenly?

Thanks!

> Support more evenly distributing cores for Mesos mode
> -
>
> Key: SPARK-4940
> URL: https://issues.apache.org/jira/browse/SPARK-4940
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Timothy Chen
> Attachments: mesos-config-difference-3nodes-vs-2nodes.png
>
>
> Currently in Coarse grain mode the spark scheduler simply takes all the 
> resources it can on each node, but can cause uneven distribution based on 
> resources available on each slave.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4940) Support more evenly distributing cores for Mesos mode

2015-10-23 Thread Jerry Lam (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14971228#comment-14971228
 ] 

Jerry Lam edited comment on SPARK-4940 at 10/23/15 4:01 PM:


I just want to weight in the importance of this issue. My observation is that 
using coarse grained mode, it is possible that if I configure total core max to 
20, I could end up having ONE executor with 20 cores. This is not ideal when I 
have 5 slaves with 32 cores each. It would makes more sense to have ONE 
executor per slave and each executor has 4 cores. 

Is there a workaround at this moment using Spark 1.5.1. to make load more 
evenly distributed on mesos. How people actually use spark on mesos when the 
resource is not distributed evenly?

Thanks!


was (Author: superwai):
I just want to weight in the importance of this issue. My observation is that 
using coarse grained mode, it is possible that if I configure total core max to 
20, I could end up having ONE executor with 20 cores. This is not ideal even I 
have 5 slaves with 32 cores each. It would makes more sense to have ONE 
executor per slave and each executor has 4 cores. 

Is there a workaround at this moment using Spark 1.5.1. to make load more 
evenly distributed on mesos. How people actually use spark on mesos when the 
resource is not distributed evenly?

Thanks!

> Support more evenly distributing cores for Mesos mode
> -
>
> Key: SPARK-4940
> URL: https://issues.apache.org/jira/browse/SPARK-4940
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Timothy Chen
> Attachments: mesos-config-difference-3nodes-vs-2nodes.png
>
>
> Currently in Coarse grain mode the spark scheduler simply takes all the 
> resources it can on each node, but can cause uneven distribution based on 
> resources available on each slave.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4940) Support more evenly distributing cores for Mesos mode

2015-10-23 Thread Jerry Lam (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14971228#comment-14971228
 ] 

Jerry Lam edited comment on SPARK-4940 at 10/23/15 4:02 PM:


I just want to weight in the importance of this issue. My observation is that 
using coarse grained mode, it is possible that if I configure total core max to 
20, I could end up having ONE executor with 20 cores. This is not ideal when I 
have 5 slaves with 32 cores each. It would makes more sense to have ONE 
executor per slave and each executor has 4 cores. 

Is there a workaround at this moment using Spark 1.5.1. to make load more 
evenly distributed on mesos. How people actually use spark on mesos when the 
resource is not distributed evenly?

Also, I notice that there is much better features on Spark with Yarn. Does it 
mean it is better to run spark on Yarn than Mesos? 

Thanks!


was (Author: superwai):
I just want to weight in the importance of this issue. My observation is that 
using coarse grained mode, it is possible that if I configure total core max to 
20, I could end up having ONE executor with 20 cores. This is not ideal when I 
have 5 slaves with 32 cores each. It would makes more sense to have ONE 
executor per slave and each executor has 4 cores. 

Is there a workaround at this moment using Spark 1.5.1. to make load more 
evenly distributed on mesos. How people actually use spark on mesos when the 
resource is not distributed evenly?

Thanks!

> Support more evenly distributing cores for Mesos mode
> -
>
> Key: SPARK-4940
> URL: https://issues.apache.org/jira/browse/SPARK-4940
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Timothy Chen
> Attachments: mesos-config-difference-3nodes-vs-2nodes.png
>
>
> Currently in Coarse grain mode the spark scheduler simply takes all the 
> resources it can on each node, but can cause uneven distribution based on 
> resources available on each slave.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10975) Shuffle files left behind on Mesos without dynamic allocation

2015-10-23 Thread Chris Bannister (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14971265#comment-14971265
 ] 

Chris Bannister commented on SPARK-10975:
-

Spark will use MESOS_DIRECTORY sandbox when not using shuffle service now that 
SPARK-9708 is merged. Is this a duplicate?

> Shuffle files left behind on Mesos without dynamic allocation
> -
>
> Key: SPARK-10975
> URL: https://issues.apache.org/jira/browse/SPARK-10975
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.5.1
>Reporter: Iulian Dragos
>Priority: Blocker
>
> (from mailing list)
> Running on Mesos in coarse-grained mode. No dynamic allocation or shuffle 
> service. 
> I see that there are two types of temporary files under /tmp folder 
> associated with every executor: /tmp/spark- and /tmp/blockmgr-. 
> When job is finished /tmp/spark- is gone, but blockmgr directory is 
> left with all gigabytes in it. 
> The reason is that logic to clean up files is only enabled when the shuffle 
> service is running, see https://github.com/apache/spark/pull/7820
> The shuffle files should be placed in the Mesos sandbox or under `tmp/spark` 
> unless the shuffle service is enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4940) Support more evenly distributing cores for Mesos mode

2015-10-23 Thread Jerry Lam (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14971228#comment-14971228
 ] 

Jerry Lam edited comment on SPARK-4940 at 10/23/15 4:15 PM:


I just want to weight in the importance of this issue. My observation is that 
using coarse grained mode, it is possible that if I configure total core max to 
20, I could end up having ONE executor with 20 cores. This is not ideal when I 
have 5 slaves with 32 cores each. It would makes more sense to have ONE 
executor per slave and each executor has 4 cores. 

It is very difficult to use because an executor configures with 10GB ram could 
have 20 tasks or 1 task allocated to it (assuming 1 cpu per task). Say each 
task could use up to 2GB of RAM, it would be a OOM for 20 tasks (40GB required) 
and underutilized for 1 task (2GB required). 

Is there a workaround at this moment using Spark 1.5.1. to make load more 
evenly distributed on mesos. How people actually use spark on mesos when the 
resource is not distributed evenly?

Also, I notice that there is much better features on Spark with Yarn. Does it 
mean it is better to run spark on Yarn than Mesos? 

Thanks!


was (Author: superwai):
I just want to weight in the importance of this issue. My observation is that 
using coarse grained mode, it is possible that if I configure total core max to 
20, I could end up having ONE executor with 20 cores. This is not ideal when I 
have 5 slaves with 32 cores each. It would makes more sense to have ONE 
executor per slave and each executor has 4 cores. 

Is there a workaround at this moment using Spark 1.5.1. to make load more 
evenly distributed on mesos. How people actually use spark on mesos when the 
resource is not distributed evenly?

Also, I notice that there is much better features on Spark with Yarn. Does it 
mean it is better to run spark on Yarn than Mesos? 

Thanks!

> Support more evenly distributing cores for Mesos mode
> -
>
> Key: SPARK-4940
> URL: https://issues.apache.org/jira/browse/SPARK-4940
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Timothy Chen
> Attachments: mesos-config-difference-3nodes-vs-2nodes.png
>
>
> Currently in Coarse grain mode the spark scheduler simply takes all the 
> resources it can on each node, but can cause uneven distribution based on 
> resources available on each slave.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >