date:20151114

[jira] [Commented] (SPARK-10673) spark.sql.hive.verifyPartitionPath Attempts to Verify Unregistered Partitions

2015-11-14 Thread Xin Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15005249#comment-15005249
 ] 

Xin Wu commented on SPARK-10673:


if the default is false,
{code}
if (!sc.conf.verifyPartitionPath) {
partitionToDeserializer
} 
{code}
will not get into the code path you mentioned. 

What the problem is that when the property is set to true, then, it gets into 
the code path that potentially evaluates all partitions of the table that 
matches the pathPatternStr. However, the pathPatternStr is computed as 
"/pathToTable/*/*/.." depending on the number of partition columns.  Basically, 
what it means is to validate the desired partition path against all existing 
partition paths, including nested directories， which may be a lot.. 

So to avoid this potential performance issue.. I think we maybe able to simply 
the code in the else block of function verifyPartitionPath(). 
I am working on a fix.  


> spark.sql.hive.verifyPartitionPath Attempts to Verify Unregistered Partitions
> -
>
> Key: SPARK-10673
> URL: https://issues.apache.org/jira/browse/SPARK-10673
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0, 1.5.0
>Reporter: Miklos Christine
>Priority: Minor
>
> In Spark 1.4, spark.sql.hive.verifyPartitionPath was set to true by default. 
> In Spark 1.5, it is now set to false by default. 
> If a table has a lot of partitions in the underlying filesystem, the code 
> unnecessarily checks for all the underlying directories when executing a 
> query. 
> https://github.com/apache/spark/blob/v1.5.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala#L162
> Structure:
> {code}
> /user/hive/warehouse/table1/year=2015/month=01/
> /user/hive/warehouse/table1/year=2015/month=02/
> /user/hive/warehouse/table1/year=2015/month=03/
> ...
> /user/hive/warehouse/table1/year=2014/month=01/
> /user/hive/warehouse/table1/year=2014/month=02/
> {code}
> If the registered partitions only contain year=2015 when you run "show 
> partitions table1", this code path checks for all directories under the 
> table's root directory. This incurs a significant performance penalty if 
> there are a lot of partition directories. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-10673) spark.sql.hive.verifyPartitionPath Attempts to Verify Unregistered Partitions

2015-11-14 Thread Xin Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15005249#comment-15005249
 ] 

Xin Wu edited comment on SPARK-10673 at 11/14/15 8:19 AM:
--

if the default is false,
{code}
if (!sc.conf.verifyPartitionPath) {
partitionToDeserializer
} 
{code}
will not get into the code path you mentioned. 

What the problem is that when the property is set to true, then, it gets into 
the code path that potentially evaluates all partitions of the table that 
matches the pathPatternStr. However, the pathPatternStr is computed as 
"/pathToTable/\*/\*/.." depending on the number of partition columns.  
Basically, what it means is to validate the desired partition path against all 
existing partition paths, including nested directories， which may be a lot.. 

So to avoid this potential performance issue.. I think we maybe able to simply 
the code in the else block of function verifyPartitionPath(). 
I am working on a fix.  



was (Author: xwu0226):
if the default is false,
{code}
if (!sc.conf.verifyPartitionPath) {
partitionToDeserializer
} 
{code}
will not get into the code path you mentioned. 

What the problem is that when the property is set to true, then, it gets into 
the code path that potentially evaluates all partitions of the table that 
matches the pathPatternStr. However, the pathPatternStr is computed as 
"/pathToTable/*/*/.." depending on the number of partition columns.  Basically, 
what it means is to validate the desired partition path against all existing 
partition paths, including nested directories， which may be a lot.. 

So to avoid this potential performance issue.. I think we maybe able to simply 
the code in the else block of function verifyPartitionPath(). 
I am working on a fix.  


> spark.sql.hive.verifyPartitionPath Attempts to Verify Unregistered Partitions
> -
>
> Key: SPARK-10673
> URL: https://issues.apache.org/jira/browse/SPARK-10673
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0, 1.5.0
>Reporter: Miklos Christine
>Priority: Minor
>
> In Spark 1.4, spark.sql.hive.verifyPartitionPath was set to true by default. 
> In Spark 1.5, it is now set to false by default. 
> If a table has a lot of partitions in the underlying filesystem, the code 
> unnecessarily checks for all the underlying directories when executing a 
> query. 
> https://github.com/apache/spark/blob/v1.5.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala#L162
> Structure:
> {code}
> /user/hive/warehouse/table1/year=2015/month=01/
> /user/hive/warehouse/table1/year=2015/month=02/
> /user/hive/warehouse/table1/year=2015/month=03/
> ...
> /user/hive/warehouse/table1/year=2014/month=01/
> /user/hive/warehouse/table1/year=2014/month=02/
> {code}
> If the registered partitions only contain year=2015 when you run "show 
> partitions table1", this code path checks for all directories under the 
> table's root directory. This incurs a significant performance penalty if 
> there are a lot of partition directories. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-11721) The programming guide for Spark SQL in Spark 1.3.0 needs additional imports to work

2015-11-14 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reopened SPARK-11721:
---

> The programming guide for Spark SQL in Spark 1.3.0 needs additional imports 
> to work
> ---
>
> Key: SPARK-11721
> URL: https://issues.apache.org/jira/browse/SPARK-11721
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, SQL
>Affects Versions: 1.3.0
>Reporter: Neelesh Srinivas Salian
>Priority: Trivial
> Fix For: 1.3.0
>
>
> The documentation in 
> http://spark.apache.org/docs/1.3.0/sql-programming-guide.html in the 
> Programmatically Specifying the Schema section needs to add couple more 
> imports to get the example to run.
> Import statements for Row and sql.types.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-11721) The programming guide for Spark SQL in Spark 1.3.0 needs additional imports to work

2015-11-14 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-11721.
---
Resolution: Not A Problem

> The programming guide for Spark SQL in Spark 1.3.0 needs additional imports 
> to work
> ---
>
> Key: SPARK-11721
> URL: https://issues.apache.org/jira/browse/SPARK-11721
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, SQL
>Affects Versions: 1.3.0
>Reporter: Neelesh Srinivas Salian
>Priority: Trivial
> Fix For: 1.3.0
>
>
> The documentation in 
> http://spark.apache.org/docs/1.3.0/sql-programming-guide.html in the 
> Programmatically Specifying the Schema section needs to add couple more 
> imports to get the example to run.
> Import statements for Row and sql.types.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11553) row.getInt(i) if row[i]=null returns 0

2015-11-14 Thread Bartlomiej Alberski (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15005270#comment-15005270
 ] 

Bartlomiej Alberski commented on SPARK-11553:
-

Please assign me to this issue as I already prepare PR

> row.getInt(i) if row[i]=null returns 0
> --
>
> Key: SPARK-11553
> URL: https://issues.apache.org/jira/browse/SPARK-11553
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Tofigh
>Priority: Minor
>
> row.getInt|Float|Double in SPARK RDD return 0 if row[index] is null. (Even 
> according to the document they should throw nullException error)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11553) row.getInt(i) if row[i]=null returns 0

2015-11-14 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15005284#comment-15005284
 ] 

Sean Owen commented on SPARK-11553:
---

That's clear already. We normally assign after it's fixed.

> row.getInt(i) if row[i]=null returns 0
> --
>
> Key: SPARK-11553
> URL: https://issues.apache.org/jira/browse/SPARK-11553
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Tofigh
>Priority: Minor
>
> row.getInt|Float|Double in SPARK RDD return 0 if row[index] is null. (Even 
> according to the document they should throw nullException error)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11553) row.getInt(i) if row[i]=null returns 0

2015-11-14 Thread Bartlomiej Alberski (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15005292#comment-15005292
 ] 

Bartlomiej Alberski commented on SPARK-11553:
-

Thanks - good to know

> row.getInt(i) if row[i]=null returns 0
> --
>
> Key: SPARK-11553
> URL: https://issues.apache.org/jira/browse/SPARK-11553
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Tofigh
>Priority: Minor
>
> row.getInt|Float|Double in SPARK RDD return 0 if row[index] is null. (Even 
> according to the document they should throw nullException error)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-11694) Parquet logical types are not being tested properly

2015-11-14 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-11694.

   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9660
[https://github.com/apache/spark/pull/9660]

> Parquet logical types are not being tested properly
> ---
>
> Key: SPARK-11694
> URL: https://issues.apache.org/jira/browse/SPARK-11694
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 1.6.0
>
>
> All the physical types are properly tested at {{ParquetIOSuite}} but logical 
> type mapping is not being tested.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11694) Parquet logical types are not being tested properly

2015-11-14 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-11694:
---
Assignee: Hyukjin Kwon

> Parquet logical types are not being tested properly
> ---
>
> Key: SPARK-11694
> URL: https://issues.apache.org/jira/browse/SPARK-11694
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 1.6.0
>
>
> All the physical types are properly tested at {{ParquetIOSuite}} but logical 
> type mapping is not being tested.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11728) Replace example code in ml-ensembles.md using include_example

2015-11-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11728:


Assignee: Apache Spark

> Replace example code in ml-ensembles.md using include_example
> -
>
> Key: SPARK-11728
> URL: https://issues.apache.org/jira/browse/SPARK-11728
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Xusen Yin
>Assignee: Apache Spark
>  Labels: starter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11728) Replace example code in ml-ensembles.md using include_example

2015-11-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15005334#comment-15005334
 ] 

Apache Spark commented on SPARK-11728:
--

User 'yinxusen' has created a pull request for this issue:
https://github.com/apache/spark/pull/9716

> Replace example code in ml-ensembles.md using include_example
> -
>
> Key: SPARK-11728
> URL: https://issues.apache.org/jira/browse/SPARK-11728
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Xusen Yin
>  Labels: starter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11728) Replace example code in ml-ensembles.md using include_example

2015-11-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11728:


Assignee: (was: Apache Spark)

> Replace example code in ml-ensembles.md using include_example
> -
>
> Key: SPARK-11728
> URL: https://issues.apache.org/jira/browse/SPARK-11728
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Xusen Yin
>  Labels: starter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11337) Make example code in user guide testable

2015-11-14 Thread Xusen Yin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15005338#comment-15005338
 ] 

Xusen Yin commented on SPARK-11337:
---

[~mengxr] Until now, all docs of ML and MLlib packages are substituted with 
include_example except for the DataTypes and BasicStatistics pages. As we 
talked before, these two files are dependent on the SPARK-11399. 

After we finished all the replacements of docs, I think we need to sweep the 
example codes again since there are some trivial issues in some of the codes.

> Make example code in user guide testable
> 
>
> Key: SPARK-11337
> URL: https://issues.apache.org/jira/browse/SPARK-11337
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation
>Reporter: Xiangrui Meng
>Assignee: Xusen Yin
>Priority: Critical
>
> The example code in the user guide is embedded in the markdown and hence it 
> is not easy to test. It would be nice to automatically test them. This JIRA 
> is to discuss options to automate example code testing and see what we can do 
> in Spark 1.6.
> One option I propose is to move actual example code to spark/examples and 
> test compilation in Jenkins builds. Then in the markdown, we can reference 
> part of the code to show in the user guide. This requires adding a Jekyll tag 
> that is similar to 
> https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, 
> e.g., called include_example.
> {code}
> {% include_example scala ml.KMeansExample guide %}
> {code}
> Jekyll will find 
> `examples/src/main/scala/org/apache/spark/examples/ml/KMeansExample.scala` 
> and pick code blocks marked "example" and put them under `{% highlight %}` in 
> the markdown. We can discuss the syntax for marker comments.
> Sub-tasks are created to move example code from user guide to `examples/`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-11573) correct 'reflective access of structural type member method should be enabled' Scala warnings

2015-11-14 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-11573.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9550
[https://github.com/apache/spark/pull/9550]

> correct 'reflective access of structural type member method should be 
> enabled' Scala warnings
> -
>
> Key: SPARK-11573
> URL: https://issues.apache.org/jira/browse/SPARK-11573
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Gabor Liptak
>Priority: Minor
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11573) correct 'reflective access of structural type member method should be enabled' Scala warnings

2015-11-14 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11573:
--
   Assignee: Gabor Liptak
   Priority: Trivial  (was: Minor)
Description: 



  was:





> correct 'reflective access of structural type member method should be 
> enabled' Scala warnings
> -
>
> Key: SPARK-11573
> URL: https://issues.apache.org/jira/browse/SPARK-11573
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Gabor Liptak
>Assignee: Gabor Liptak
>Priority: Trivial
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5226) Add DBSCAN Clustering Algorithm to MLlib

2015-11-14 Thread mustafa elbehery (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15005355#comment-15005355
 ] 

mustafa elbehery commented on SPARK-5226:
-

Hello, 

I would like to use DBSCAN on spark. [~alitouka] I have tried to use ur 
implementation, on 500 MG of data. However, I think the **Population of 
partition index** step is to expensive. 

Is this implementation is going to be online soon, 

Regards.

> Add DBSCAN Clustering Algorithm to MLlib
> 
>
> Key: SPARK-5226
> URL: https://issues.apache.org/jira/browse/SPARK-5226
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Muhammad-Ali A'rabi
>Priority: Minor
>  Labels: DBSCAN, clustering
>
> MLlib is all k-means now, and I think we should add some new clustering 
> algorithms to it. First candidate is DBSCAN as I think.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11725) Let UDF to handle null value

2015-11-14 Thread Herman van Hovell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15005391#comment-15005391
 ] 

Herman van Hovell commented on SPARK-11725:
---

I'd rather add a warning than prevent this from happening.

I cannot reproduce the {{-1}} default values on Spark 1.5.2. For example:
{noformat}
val id = udf((x: Int) => {
x
})
val q = sqlContext
  .range(1 << 10)
  .select($"id", when(($"id" mod 2) === 1, $"id").as("val1"))
  .select($"id", $"val1", id($"val1").as("val2"))
q.show

// Result:
id: org.apache.spark.sql.UserDefinedFunction = 
UserDefinedFunction(,IntegerType,List(IntegerType))
q: org.apache.spark.sql.DataFrame = [id: bigint, val1: bigint, val2: int]
+---+++
| id|val1|val2|
+---+++
|  0|null|   0|
|  1|   1|   1|
|  2|null|   0|
|  3|   3|   3|
|  4|null|   0|
|  5|   5|   5|
|  6|null|   0|
|  7|   7|   7|
|  8|null|   0|
|  9|   9|   9|
| 10|null|   0|
| 11|  11|  11|
| 12|null|   0|
| 13|  13|  13|
| 14|null|   0|
| 15|  15|  15|
| 16|null|   0|
| 17|  17|  17|
| 18|null|   0|
| 19|  19|  19|
+---+++
only showing top 20 rows
{noformat}

What version of Spark are you using?




> Let UDF to handle null value
> 
>
> Key: SPARK-11725
> URL: https://issues.apache.org/jira/browse/SPARK-11725
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Jeff Zhang
>
> I notice that currently spark will take the long field as -1 if it is null.
> Here's the sample code.
> {code}
> sqlContext.udf.register("f", (x:Int)=>x+1)
> df.withColumn("age2", expr("f(age)")).show()
>  Output ///
> ++---++
> | age|   name|age2|
> ++---++
> |null|Michael|   0|
> |  30|   Andy|  31|
> |  19| Justin|  20|
> ++---++
> {code}
> I think for the null value we have 3 options
> * Use a special value to represent it (what spark does now)
> * Always return null if the udf input has null value argument 
> * Let udf itself to handle null
> I would prefer the third option 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11727) split ExpressionEncoder into FlatEncoder and ProductEncoder

2015-11-14 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11727:
--
Assignee: Wenchen Fan

> split ExpressionEncoder into FlatEncoder and ProductEncoder
> ---
>
> Key: SPARK-11727
> URL: https://issues.apache.org/jira/browse/SPARK-11727
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11732) MiMa excludes miss private classes

2015-11-14 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11732:
--
   Labels:   (was: newbie)
Fix Version/s: (was: 1.6.0)

[~thunterdb] don't set Fix version unless it's fixed

> MiMa excludes miss private classes
> --
>
> Key: SPARK-11732
> URL: https://issues.apache.org/jira/browse/SPARK-11732
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.5.1
>Reporter: Tim Hunter
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> The checks in GenerateMIMAIgnore only check for package private classes, not 
> private classes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11669) Python interface to SparkR GLM module

2015-11-14 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11669:
--
Target Version/s:   (was: 1.5.0, 1.5.1)

[~shubhanshumis...@gmail.com] it doesn't make sense to target released versions.

If someone can explain this to me, feel free to reopen, but it sounds like 
you're requesting Python APIs to R.

> Python interface to SparkR GLM module
> -
>
> Key: SPARK-11669
> URL: https://issues.apache.org/jira/browse/SPARK-11669
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SparkR
>Affects Versions: 1.5.0, 1.5.1
> Environment: LINUX
> MAC
> WINDOWS
>Reporter: Shubhanshu Mishra
>Priority: Minor
>  Labels: GLM, pyspark, sparkR, statistics
>
> There should be a python interface to the sparkR GLM module. Currently the 
> only python library which creates R style GLM module results in statsmodels. 
> Inspiration for the API can be taken from the following page. 
> http://statsmodels.sourceforge.net/devel/examples/notebooks/generated/formulas.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11702) Guava ClassLoading Issue When Using Different Hive Metastore Version

2015-11-14 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11702:
--
Component/s: Spark Core

Got it, makes more sense now.

> Guava ClassLoading Issue When Using Different Hive Metastore Version
> 
>
> Key: SPARK-11702
> URL: https://issues.apache.org/jira/browse/SPARK-11702
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
>Reporter: Joey Paskhay
>
> A Guava classloading error can occur when using a different version of the 
> Hive metastore.
> Running the latest version of Spark at this time (1.5.1) and patched versions 
> of Hadoop 2.2.0 and Hive 1.0.0. We set "spark.sql.hive.metastore.version" to 
> "1.0.0" and "spark.sql.hive.metastore.jars" to 
> "/lib/*:". When trying to 
> launch the spark-shell, the sqlContext would fail to initialize with:
> {code}
> java.lang.ClassNotFoundException: java.lang.NoClassDefFoundError: 
> com/google/common/base/Predicate when creating Hive client using classpath: 
> 
> Please make sure that jars for your version of hive and hadoop are included 
> in the paths passed to SQLConfEntry(key = spark.sql.hive.metastore.jars, 
> defaultValue=builtin, doc=...
> {code}
> We verified the Guava libraries are in the huge list of the included jars, 
> but we saw that in the 
> org.apache.spark.sql.hive.client.IsolatedClientLoader.isSharedClass method it 
> seems to assume that *all* "com.google" (excluding "com.google.cloud") 
> classes should be loaded from the base class loader. The Spark libraries seem 
> to have *some* "com.google.common.base" classes shaded in but not all.
> See 
> [https://mail-archives.apache.org/mod_mbox/spark-user/201511.mbox/%3CCAB51Vx4ipV34e=eishlg7bzldm0uefd_mpyqfe4dodbnbv9...@mail.gmail.com%3E]
>  and its replies.
> The work-around is to add the guava JAR to the "spark.driver.extraClassPath" 
> property.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-11669) Python interface to SparkR GLM module

2015-11-14 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-11669.
---
Resolution: Not A Problem

> Python interface to SparkR GLM module
> -
>
> Key: SPARK-11669
> URL: https://issues.apache.org/jira/browse/SPARK-11669
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SparkR
>Affects Versions: 1.5.0, 1.5.1
> Environment: LINUX
> MAC
> WINDOWS
>Reporter: Shubhanshu Mishra
>Priority: Minor
>  Labels: GLM, pyspark, sparkR, statistics
>
> There should be a python interface to the sparkR GLM module. Currently the 
> only python library which creates R style GLM module results in statsmodels. 
> Inspiration for the API can be taken from the following page. 
> http://statsmodels.sourceforge.net/devel/examples/notebooks/generated/formulas.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11720) Return Double.NaN instead of null for Mean and Average when count = 0

2015-11-14 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11720:
--
Component/s: SQL

> Return Double.NaN instead of null for Mean and Average when count = 0
> -
>
> Key: SPARK-11720
> URL: https://issues.apache.org/jira/browse/SPARK-11720
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Jihong MA
>Priority: Minor
>
> change the default behavior of mean in case of count = 0 from null to 
> Double.NaN, to make it inline with all other univariate stats function. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6280) Remove Akka systemName from Spark

2015-11-14 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15005396#comment-15005396
 ] 

Sean Owen commented on SPARK-6280:
--

Are this and the other Akka-related items targeted for 1.6 actually going in? 
the parent targets 2+.
[~zsxwing]


> Remove Akka systemName from Spark
> -
>
> Key: SPARK-6280
> URL: https://issues.apache.org/jira/browse/SPARK-6280
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Shixiong Zhu
>
> `systemName` is a Akka concept. A RPC implementation does not need to support 
> it. 
> We can hard code the system name in Spark and hide it in the internal Akka 
> RPC implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6227) PCA and SVD for PySpark

2015-11-14 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6227:
-
Target Version/s:   (was: 1.6.0)

> PCA and SVD for PySpark
> ---
>
> Key: SPARK-6227
> URL: https://issues.apache.org/jira/browse/SPARK-6227
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, PySpark
>Affects Versions: 1.2.1
>Reporter: Julien Amelot
>Assignee: Manoj Kumar
>
> The Dimensionality Reduction techniques are not available via Python (Scala + 
> Java only).
> * Principal component analysis (PCA)
> * Singular value decomposition (SVD)
> Doc:
> http://spark.apache.org/docs/1.2.1/mllib-dimensionality-reduction.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7441) Implement microbatch functionality so that Spark Streaming can process a large backlog of existing files discovered in batch in smaller batches

2015-11-14 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-7441:
-
Target Version/s:   (was: 1.6.0)

> Implement microbatch functionality so that Spark Streaming can process a 
> large backlog of existing files discovered in batch in smaller batches
> ---
>
> Key: SPARK-7441
> URL: https://issues.apache.org/jira/browse/SPARK-7441
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Emre Sevinç
>  Labels: performance
>
> Implement microbatch functionality so that Spark Streaming can process a huge 
> backlog of existing files discovered in batch in smaller batches.
> Spark Streaming can process already existing files in a directory, and 
> depending on the value of "{{spark.streaming.minRememberDuration}}" (60 
> seconds by default, see SPARK-3276 for more details), this might mean that a 
> Spark Streaming application can receive thousands, or hundreds of thousands 
> of files within the first batch interval. This, in turn, leads to something 
> like a 'flooding' effect for the streaming application, that tries to deal 
> with a huge number of existing files in a single batch interval.
>  We will propose a very simple change to 
> {{org.apache.spark.streaming.dstream.FileInputDStream}}, so that, based on a 
> configuration property such as "{{spark.streaming.microbatch.size}}", it will 
> either keep its default behavior when  {{spark.streaming.microbatch.size}} 
> will have the default value of {{0}} (meaning as many as has been discovered 
> as new files in the current batch interval), or will process new files in 
> groups of {{spark.streaming.microbatch.size}} (e.g. in groups of 100s).
> We have tested this patch in one of our customers, and it's been running 
> successfully for weeks (e.g. there were cases where our Spark Streaming 
> application was stopped, and in the meantime tens of thousands file were 
> created in a directory, and our Spark Streaming application had to process 
> those existing files after it was started).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10250) Scala PairRDDFuncitons.groupByKey() should be fault-tolerant of single large groups

2015-11-14 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10250:
--
Target Version/s:   (was: 1.6.0)

> Scala PairRDDFuncitons.groupByKey() should be fault-tolerant of single large 
> groups
> ---
>
> Key: SPARK-10250
> URL: https://issues.apache.org/jira/browse/SPARK-10250
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.1
>Reporter: Matt Cheah
>Priority: Minor
>
> PairRDDFunctions.groupByKey() is less robust that Python's equivalent, as 
> PySpark's groupByKey can spill single large groups to disk. We should bring 
> the Scala implementation up to parity.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10062) Use tut for typechecking and running code in user guides

2015-11-14 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10062:
--
Target Version/s:   (was: 1.6.0)

> Use tut for typechecking and running code in user guides
> 
>
> Key: SPARK-10062
> URL: https://issues.apache.org/jira/browse/SPARK-10062
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Feynman Liang
>
> The current process for contributing to the user guide requires 
> authors/reviewers to manually run any added example code.
> We can automate this process by integrating 
> [tut|https://github.com/tpolecat/tut] into user guide documentation 
> generation. Tut runs code enclosed inside "```tut ... ```" blocks, providing 
> typechecking, ensuring that the example code we provide runs, and displaying 
> the output.
> An example project using tuts is 
> [cats|http://non.github.io/cats//typeclasses.html].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9516) Improve Thread Dump page

2015-11-14 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-9516:
-
Priority: Minor  (was: Major)

> Improve Thread Dump page
> 
>
> Key: SPARK-9516
> URL: https://issues.apache.org/jira/browse/SPARK-9516
> Project: Spark
>  Issue Type: New Feature
>  Components: Web UI
>Reporter: Nan Zhu
>Assignee: Nan Zhu
>Priority: Minor
>
> Originally proposed by [~irashid] in 
> https://github.com/apache/spark/pull/7808#issuecomment-126788335:
> we can enhance the current thread dump page with at least the following two 
> new features:
> 1) sort threads by thread status, 
> 2) a filter to grep the threads



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9516) Improve Thread Dump page

2015-11-14 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-9516:
-
Target Version/s:   (was: 1.6.0)

> Improve Thread Dump page
> 
>
> Key: SPARK-9516
> URL: https://issues.apache.org/jira/browse/SPARK-9516
> Project: Spark
>  Issue Type: New Feature
>  Components: Web UI
>Reporter: Nan Zhu
>Assignee: Nan Zhu
>
> Originally proposed by [~irashid] in 
> https://github.com/apache/spark/pull/7808#issuecomment-126788335:
> we can enhance the current thread dump page with at least the following two 
> new features:
> 1) sort threads by thread status, 
> 2) a filter to grep the threads



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-10526) Display cores/memory on ExecutorsTab

2015-11-14 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-10526.
---
  Resolution: Won't Fix
Target Version/s:   (was: 1.6.0)

> Display cores/memory on ExecutorsTab
> 
>
> Key: SPARK-10526
> URL: https://issues.apache.org/jira/browse/SPARK-10526
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Reporter: Jeff Zhang
>Priority: Minor
>
> It would be nice to display the resource of each executor on web ui. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10530) Kill other task attempts when one taskattempt belonging the same task is succeeded in speculation

2015-11-14 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10530:
--
Target Version/s:   (was: 1.6.0)
Priority: Minor  (was: Major)

> Kill other task attempts when one taskattempt belonging the same task is 
> succeeded in speculation
> -
>
> Key: SPARK-10530
> URL: https://issues.apache.org/jira/browse/SPARK-10530
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core
>Reporter: Jeff Zhang
>Priority: Minor
>
> Currently when speculation is enabled, other task attempts won't be killed if 
> one task attempt in the same task is succeeded, it is not resource efficient, 
> it would be better to kill other task attempts when one taskattempt belonging 
> the same task is succeeded in speculation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-10081) Skip re-computing getMissingParentStages in DAGScheduler

2015-11-14 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-10081.
---
  Resolution: Won't Fix
Target Version/s:   (was: 1.6.0)

> Skip re-computing getMissingParentStages in DAGScheduler
> 
>
> Key: SPARK-10081
> URL: https://issues.apache.org/jira/browse/SPARK-10081
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Liang-Chi Hsieh
>
> In DAGScheduler, we can skip re-computing getMissingParentStages when calling 
> submitStage in handleJobSubmitted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7799) Move "StreamingContext.actorStream" to a separate project and deprecate it in StreamingContext

2015-11-14 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-7799:
-
Target Version/s:   (was: 1.6.0)

> Move "StreamingContext.actorStream" to a separate project and deprecate it in 
> StreamingContext
> --
>
> Key: SPARK-7799
> URL: https://issues.apache.org/jira/browse/SPARK-7799
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Shixiong Zhu
>
> Move {{StreamingContext.actorStream}} to a separate project and deprecate it 
> in {{StreamingContext}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11725) Let UDF to handle null value

2015-11-14 Thread Jeff Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15005399#comment-15005399
 ] 

Jeff Zhang commented on SPARK-11725:


I am on master 

> Let UDF to handle null value
> 
>
> Key: SPARK-11725
> URL: https://issues.apache.org/jira/browse/SPARK-11725
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Jeff Zhang
>
> I notice that currently spark will take the long field as -1 if it is null.
> Here's the sample code.
> {code}
> sqlContext.udf.register("f", (x:Int)=>x+1)
> df.withColumn("age2", expr("f(age)")).show()
>  Output ///
> ++---++
> | age|   name|age2|
> ++---++
> |null|Michael|   0|
> |  30|   Andy|  31|
> |  19| Justin|  20|
> ++---++
> {code}
> I think for the null value we have 3 options
> * Use a special value to represent it (what spark does now)
> * Always return null if the udf input has null value argument 
> * Let udf itself to handle null
> I would prefer the third option 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11725) Let UDF to handle null value

2015-11-14 Thread Herman van Hovell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15005433#comment-15005433
 ] 

Herman van Hovell commented on SPARK-11725:
---

I can reproduce the {{-1}} default values on master. This is not the expected 
behavior.

[~marmbrus]/[~rxin] Any what causes this?

> Let UDF to handle null value
> 
>
> Key: SPARK-11725
> URL: https://issues.apache.org/jira/browse/SPARK-11725
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Jeff Zhang
>
> I notice that currently spark will take the long field as -1 if it is null.
> Here's the sample code.
> {code}
> sqlContext.udf.register("f", (x:Int)=>x+1)
> df.withColumn("age2", expr("f(age)")).show()
>  Output ///
> ++---++
> | age|   name|age2|
> ++---++
> |null|Michael|   0|
> |  30|   Andy|  31|
> |  19| Justin|  20|
> ++---++
> {code}
> I think for the null value we have 3 options
> * Use a special value to represent it (what spark does now)
> * Always return null if the udf input has null value argument 
> * Let udf itself to handle null
> I would prefer the third option 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9844) File appender race condition during SparkWorker shutdown

2015-11-14 Thread Jason Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15005504#comment-15005504
 ] 

Jason Huang commented on SPARK-9844:


Got the same error log in workers and my workers keep being disassociated.

15/11/15 01:25:26 INFO worker.Worker: Asked to kill executor 
app-20151115012248-0081/2
15/11/15 01:25:26 INFO worker.ExecutorRunner: Runner thread for executor 
app-20151115012248-0081/2 interrupted
15/11/15 01:25:26 INFO worker.ExecutorRunner: Killing process!
15/11/15 01:25:26 ERROR logging.FileAppender: Error writing stream to file 
/usr/local/spark-1.5.1-bin-hadoop2.6/work/app-20151115012248-0081/2/stderr
java.io.IOException: Stream closed
at 
java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:162)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:272)
at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at 
org.apache.spark.util.logging.FileAppender.appendStreamToFile(FileAppender.scala:70)
at 
org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply$mcV$sp(FileAppender.scala:39)
at 
org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39)
at 
org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)
at 
org.apache.spark.util.logging.FileAppender$$anon$1.run(FileAppender.scala:38)
15/11/15 01:25:26 INFO worker.Worker: Executor app-20151115012248-0081/2 
finished with state KILLED exitStatus 143
15/11/15 01:25:26 INFO worker.Worker: Cleaning up local directories for 
application app-20151115012248-0081
15/11/15 01:25:26 WARN remote.ReliableDeliverySupervisor: Association with 
remote system [akka.tcp://sparkExecutor@10.1.2.1:46780] has failed, address is
 now gated for [5000] ms. Reason: [Disassociated]
15/11/15 01:25:26 INFO shuffle.ExternalShuffleBlockResolver: Application 
app-20151115012248-0081 removed, cleanupLocalDirs = true

We use python3 to run our Spark jobs

#!/usr/bin/python3

import os
import sys

SPARK_HOME = "/usr/local/spark"

os.environ["SPARK_HOME"] = SPARK_HOME
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-7-oracle"
os.environ["PYSPARK_PYTHON"] = "/usr/bin/python3"
sys.path.append(os.path.join(SPARK_HOME, 'python'))
sys.path.append(os.path.join(SPARK_HOME, 'python/lib/py4j-0.8.2.1-src.zip'))

from pyspark import SparkContext, SparkConf

conf = (SparkConf().setMaster("spark://10.1.2.1:7077")
.setAppName("Generate")
.setAll((
("spark.cores.max", "1"),
("spark.driver.memory", "1g"),
("spark.executor.memory", "1g"),
("spark.python.worker.memory", "1g"

> File appender race condition during SparkWorker shutdown
> 
>
> Key: SPARK-9844
> URL: https://issues.apache.org/jira/browse/SPARK-9844
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.0, 1.4.0
>Reporter: Alex Liu
>
> We find this issue still exists in 1.3.1
> {code}
> ERROR [Thread-6] 2015-07-28 22:49:57,653 SparkWorker-0 ExternalLogger.java:96 
> - Error writing stream to file 
> /var/lib/spark/worker/worker-0/app-20150728224954-0003/0/stderr
> ERROR [Thread-6] 2015-07-28 22:49:57,653 SparkWorker-0 ExternalLogger.java:96 
> - java.io.IOException: Stream closed
> ERROR [Thread-6] 2015-07-28 22:49:57,654 SparkWorker-0 ExternalLogger.java:96 
> -   at 
> java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:170) 
> ~[na:1.8.0_40]
> ERROR [Thread-6] 2015-07-28 22:49:57,654 SparkWorker-0 ExternalLogger.java:96 
> -   at java.io.BufferedInputStream.read1(BufferedInputStream.java:283) 
> ~[na:1.8.0_40]
> ERROR [Thread-6] 2015-07-28 22:49:57,654 SparkWorker-0 ExternalLogger.java:96 
> -   at java.io.BufferedInputStream.read(BufferedInputStream.java:345) 
> ~[na:1.8.0_40]
> ERROR [Thread-6] 2015-07-28 22:49:57,654 SparkWorker-0 ExternalLogger.java:96 
> -   at java.io.FilterInputStream.read(FilterInputStream.java:107) 
> ~[na:1.8.0_40]
> ERROR [Thread-6] 2015-07-28 22:49:57,655 SparkWorker-0 ExternalLogger.java:96 
> -   at 
> org.apache.spark.util.logging.FileAppender.appendStreamToFile(FileAppender.scala:70)
>  ~[spark-core_2.10-1.3.1.1.jar:1.3.1.1]
> ERROR [Thread-6] 2015-07-28 22:49:57,655 SparkWorker-0 ExternalLogger.java:96 
> -   at 
> org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply$mcV$sp(FileAppender.scala:39)
>  [spark-core_2.10-1.3.1.1.jar:1.3.1.1]
> ERROR [Thread-6] 2015-07-28 22:49:57,655 SparkWorker-0 ExternalLogger.java:96 
> -   at 
> org.apache.spark.util.logging.FileAppender

[jira] [Comment Edited] (SPARK-9844) File appender race condition during SparkWorker shutdown

2015-11-14 Thread Jason Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15005504#comment-15005504
 ] 

Jason Huang edited comment on SPARK-9844 at 11/14/15 5:38 PM:
--

Got the same error log in workers and my workers keep being disassociated.

{code:java}
15/11/15 01:25:26 INFO worker.Worker: Asked to kill executor 
app-20151115012248-0081/2
15/11/15 01:25:26 INFO worker.ExecutorRunner: Runner thread for executor 
app-20151115012248-0081/2 interrupted
15/11/15 01:25:26 INFO worker.ExecutorRunner: Killing process!
15/11/15 01:25:26 ERROR logging.FileAppender: Error writing stream to file 
/usr/local/spark-1.5.1-bin-hadoop2.6/work/app-20151115012248-0081/2/stderr
java.io.IOException: Stream closed
at 
java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:162)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:272)
at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at 
org.apache.spark.util.logging.FileAppender.appendStreamToFile(FileAppender.scala:70)
at 
org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply$mcV$sp(FileAppender.scala:39)
at 
org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39)
at 
org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)
at 
org.apache.spark.util.logging.FileAppender$$anon$1.run(FileAppender.scala:38)
15/11/15 01:25:26 INFO worker.Worker: Executor app-20151115012248-0081/2 
finished with state KILLED exitStatus 143
15/11/15 01:25:26 INFO worker.Worker: Cleaning up local directories for 
application app-20151115012248-0081
15/11/15 01:25:26 WARN remote.ReliableDeliverySupervisor: Association with 
remote system [akka.tcp://sparkExecutor@10.1.2.1:46780] has failed, address is
 now gated for [5000] ms. Reason: [Disassociated]
15/11/15 01:25:26 INFO shuffle.ExternalShuffleBlockResolver: Application 
app-20151115012248-0081 removed, cleanupLocalDirs = true
{code}

We use python3 to run our Spark jobs

{code:java}
#!/usr/bin/python3

import os
import sys

SPARK_HOME = "/usr/local/spark"

os.environ["SPARK_HOME"] = SPARK_HOME
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-7-oracle"
os.environ["PYSPARK_PYTHON"] = "/usr/bin/python3"
sys.path.append(os.path.join(SPARK_HOME, 'python'))
sys.path.append(os.path.join(SPARK_HOME, 'python/lib/py4j-0.8.2.1-src.zip'))

from pyspark import SparkContext, SparkConf

conf = (SparkConf().setMaster("spark://10.1.2.1:7077")
.setAppName("Generate")
.setAll((
("spark.cores.max", "1"),
("spark.driver.memory", "1g"),
("spark.executor.memory", "1g"),
("spark.python.worker.memory", "1g"
{code}


was (Author: jasson15):
Got the same error log in workers and my workers keep being disassociated.

15/11/15 01:25:26 INFO worker.Worker: Asked to kill executor 
app-20151115012248-0081/2
15/11/15 01:25:26 INFO worker.ExecutorRunner: Runner thread for executor 
app-20151115012248-0081/2 interrupted
15/11/15 01:25:26 INFO worker.ExecutorRunner: Killing process!
15/11/15 01:25:26 ERROR logging.FileAppender: Error writing stream to file 
/usr/local/spark-1.5.1-bin-hadoop2.6/work/app-20151115012248-0081/2/stderr
java.io.IOException: Stream closed
at 
java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:162)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:272)
at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at 
org.apache.spark.util.logging.FileAppender.appendStreamToFile(FileAppender.scala:70)
at 
org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply$mcV$sp(FileAppender.scala:39)
at 
org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39)
at 
org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)
at 
org.apache.spark.util.logging.FileAppender$$anon$1.run(FileAppender.scala:38)
15/11/15 01:25:26 INFO worker.Worker: Executor app-20151115012248-0081/2 
finished with state KILLED exitStatus 143
15/11/15 01:25:26 INFO worker.Worker: Cleaning up local directories for 
application app-20151115012248-0081
15/11/15 01:25:26 WARN remote.ReliableDeliverySupervisor: Association with 
remote system [akka.tcp://sparkExecutor@10.1.2.1:46780] has failed, address is
 now gated for [5000] ms. Reason: [Disassociated]
15/11/15 01:25:26 INFO shuffle.ExternalShuffleBlockResolver: Application 
app-201511

[jira] [Commented] (SPARK-11725) Let UDF to handle null value

2015-11-14 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15005509#comment-15005509
 ] 

Reynold Xin commented on SPARK-11725:
-

This is the problem of default value in codegen I suspect.  
https://github.com/apache/spark/blob/22e96b87fb0a0eb4f2f1a8fc29a742ceabff952a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala#L229

> Let UDF to handle null value
> 
>
> Key: SPARK-11725
> URL: https://issues.apache.org/jira/browse/SPARK-11725
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Jeff Zhang
>
> I notice that currently spark will take the long field as -1 if it is null.
> Here's the sample code.
> {code}
> sqlContext.udf.register("f", (x:Int)=>x+1)
> df.withColumn("age2", expr("f(age)")).show()
>  Output ///
> ++---++
> | age|   name|age2|
> ++---++
> |null|Michael|   0|
> |  30|   Andy|  31|
> |  19| Justin|  20|
> ++---++
> {code}
> I think for the null value we have 3 options
> * Use a special value to represent it (what spark does now)
> * Always return null if the udf input has null value argument 
> * Let udf itself to handle null
> I would prefer the third option 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10759) Missing Python code example in ML Programming guide

2015-11-14 Thread Nathan Davis (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15005522#comment-15005522
 ] 

Nathan Davis commented on SPARK-10759:
--

[~lmoos], is this in progress? I can take it

> Missing Python code example in ML Programming guide
> ---
>
> Key: SPARK-10759
> URL: https://issues.apache.org/jira/browse/SPARK-10759
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 1.5.0
>Reporter: Raela Wang
>Assignee: Lauren Moos
>Priority: Minor
>  Labels: starter
>
> http://spark.apache.org/docs/latest/ml-guide.html#example-model-selection-via-cross-validation
> http://spark.apache.org/docs/latest/ml-guide.html#example-model-selection-via-train-validation-split



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9928) LogicalLocalTable in ExistingRDD.scala is not referenced by any code else

2015-11-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9928:
---

Assignee: (was: Apache Spark)

> LogicalLocalTable in ExistingRDD.scala is not referenced by any code else
> -
>
> Key: SPARK-9928
> URL: https://issues.apache.org/jira/browse/SPARK-9928
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: Gen TANG
>Priority: Trivial
>  Labels: sparksql
>
> The case class 
> [LogicalLocalTable|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala#L118]
>  in 
> [ExistingRDD.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala]
>  is not referenced by anywhere else in the source code. It might be a dead 
> code



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9928) LogicalLocalTable in ExistingRDD.scala is not referenced by any code else

2015-11-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9928:
---

Assignee: Apache Spark

> LogicalLocalTable in ExistingRDD.scala is not referenced by any code else
> -
>
> Key: SPARK-9928
> URL: https://issues.apache.org/jira/browse/SPARK-9928
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: Gen TANG
>Assignee: Apache Spark
>Priority: Trivial
>  Labels: sparksql
>
> The case class 
> [LogicalLocalTable|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala#L118]
>  in 
> [ExistingRDD.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala]
>  is not referenced by anywhere else in the source code. It might be a dead 
> code



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9928) LogicalLocalTable in ExistingRDD.scala is not referenced by any code else

2015-11-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15005525#comment-15005525
 ] 

Apache Spark commented on SPARK-9928:
-

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/9717

> LogicalLocalTable in ExistingRDD.scala is not referenced by any code else
> -
>
> Key: SPARK-9928
> URL: https://issues.apache.org/jira/browse/SPARK-9928
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: Gen TANG
>Priority: Trivial
>  Labels: sparksql
>
> The case class 
> [LogicalLocalTable|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala#L118]
>  in 
> [ExistingRDD.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala]
>  is not referenced by anywhere else in the source code. It might be a dead 
> code



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11153) Turns off Parquet filter push-down for string and binary columns

2015-11-14 Thread Mark Hamstra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15005542#comment-15005542
 ] 

Mark Hamstra commented on SPARK-11153:
--

Thanks.

> Turns off Parquet filter push-down for string and binary columns
> 
>
> Key: SPARK-11153
> URL: https://issues.apache.org/jira/browse/SPARK-11153
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Blocker
> Fix For: 1.5.2, 1.6.0
>
>
> Due to PARQUET-251, {{BINARY}} columns in existing Parquet files may be 
> written with corrupted statistics information. This information is used by 
> filter push-down optimization. Since Spark 1.5 turns on Parquet filter 
> push-down by default, we may end up with wrong query results. PARQUET-251 has 
> been fixed in parquet-mr 1.8.1, but Spark 1.5 is still using 1.7.0.
> Note that this kind of corrupted Parquet files could be produced by any 
> Parquet data models.
> This affects all Spark SQL data types that can be mapped to Parquet 
> {{BINARY}}, namely:
> - {{StringType}}
> - {{BinaryType}}
> - {{DecimalType}} (but Spark SQL doesn't support pushing down {{DecimalType}} 
> columns for now.)
> To avoid wrong query results, we should disable filter push-down for columns 
> of {{StringType}} and {{BinaryType}} until we upgrade to parquet-mr 1.8.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-11744) bin/pyspark --version doesn't return version and exit

2015-11-14 Thread Nicholas Chammas (JIRA)

Nicholas Chammas created SPARK-11744:


 Summary: bin/pyspark --version doesn't return version and exit
 Key: SPARK-11744
 URL: https://issues.apache.org/jira/browse/SPARK-11744
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.5.2
Reporter: Nicholas Chammas
Priority: Minor


{{bin/pyspark --help}} offers a {{--version}} option:

{code}
$ ./spark/bin/pyspark --help
Usage: ./bin/pyspark [options]

Options:
...
  --version,  Print the version of current Spark
...
{code}

However, trying to get the version in this way doesn't yield the expected 
results.

Instead of printing the version and exiting, we get the version, a stack trace, 
and then get dropped into a plain Python shell ({{sc}} is not defined).

{code}
$ ./spark/bin/pyspark --version
Python 2.7.10 (default, Aug 11 2015, 23:39:10) 
[GCC 4.8.3 20140911 (Red Hat 4.8.3-9)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.5.2
  /_/

Type --help for more information.
Traceback (most recent call last):
  File "/home/ec2-user/spark/python/pyspark/shell.py", line 43, in 
sc = SparkContext(pyFiles=add_files)
  File "/home/ec2-user/spark/python/pyspark/context.py", line 110, in __init__
SparkContext._ensure_initialized(self, gateway=gateway)
  File "/home/ec2-user/spark/python/pyspark/context.py", line 234, in 
_ensure_initialized
SparkContext._gateway = gateway or launch_gateway()
  File "/home/ec2-user/spark/python/pyspark/java_gateway.py", line 94, in 
launch_gateway
raise Exception("Java gateway process exited before sending the driver its 
port number")
Exception: Java gateway process exited before sending the driver its port number
>>> 
>>> sc
Traceback (most recent call last):
  File "", line 1, in 
NameError: name 'sc' is not defined
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11744) bin/pyspark --version doesn't return version and exit

2015-11-14 Thread Nicholas Chammas (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-11744:
-
Description: 
{{bin/pyspark \-\-help}} offers a {{\-\-version}} option:

{code}
$ ./spark/bin/pyspark --help
Usage: ./bin/pyspark [options]

Options:
...
  --version,  Print the version of current Spark
...
{code}

However, trying to get the version in this way doesn't yield the expected 
results.

Instead of printing the version and exiting, we get the version, a stack trace, 
and then get dropped into a plain Python shell ({{sc}} is not defined).

{code}
$ ./spark/bin/pyspark --version
Python 2.7.10 (default, Aug 11 2015, 23:39:10) 
[GCC 4.8.3 20140911 (Red Hat 4.8.3-9)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.5.2
  /_/

Type --help for more information.
Traceback (most recent call last):
  File "/home/ec2-user/spark/python/pyspark/shell.py", line 43, in 
sc = SparkContext(pyFiles=add_files)
  File "/home/ec2-user/spark/python/pyspark/context.py", line 110, in __init__
SparkContext._ensure_initialized(self, gateway=gateway)
  File "/home/ec2-user/spark/python/pyspark/context.py", line 234, in 
_ensure_initialized
SparkContext._gateway = gateway or launch_gateway()
  File "/home/ec2-user/spark/python/pyspark/java_gateway.py", line 94, in 
launch_gateway
raise Exception("Java gateway process exited before sending the driver its 
port number")
Exception: Java gateway process exited before sending the driver its port number
>>> 
>>> sc
Traceback (most recent call last):
  File "", line 1, in 
NameError: name 'sc' is not defined
{code}

  was:
{{bin/pyspark --help}} offers a {{--version}} option:

{code}
$ ./spark/bin/pyspark --help
Usage: ./bin/pyspark [options]

Options:
...
  --version,  Print the version of current Spark
...
{code}

However, trying to get the version in this way doesn't yield the expected 
results.

Instead of printing the version and exiting, we get the version, a stack trace, 
and then get dropped into a plain Python shell ({{sc}} is not defined).

{code}
$ ./spark/bin/pyspark --version
Python 2.7.10 (default, Aug 11 2015, 23:39:10) 
[GCC 4.8.3 20140911 (Red Hat 4.8.3-9)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.5.2
  /_/

Type --help for more information.
Traceback (most recent call last):
  File "/home/ec2-user/spark/python/pyspark/shell.py", line 43, in 
sc = SparkContext(pyFiles=add_files)
  File "/home/ec2-user/spark/python/pyspark/context.py", line 110, in __init__
SparkContext._ensure_initialized(self, gateway=gateway)
  File "/home/ec2-user/spark/python/pyspark/context.py", line 234, in 
_ensure_initialized
SparkContext._gateway = gateway or launch_gateway()
  File "/home/ec2-user/spark/python/pyspark/java_gateway.py", line 94, in 
launch_gateway
raise Exception("Java gateway process exited before sending the driver its 
port number")
Exception: Java gateway process exited before sending the driver its port number
>>> 
>>> sc
Traceback (most recent call last):
  File "", line 1, in 
NameError: name 'sc' is not defined
{code}


> bin/pyspark --version doesn't return version and exit
> -
>
> Key: SPARK-11744
> URL: https://issues.apache.org/jira/browse/SPARK-11744
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.2
>Reporter: Nicholas Chammas
>Priority: Minor
>
> {{bin/pyspark \-\-help}} offers a {{\-\-version}} option:
> {code}
> $ ./spark/bin/pyspark --help
> Usage: ./bin/pyspark [options]
> Options:
> ...
>   --version,  Print the version of current Spark
> ...
> {code}
> However, trying to get the version in this way doesn't yield the expected 
> results.
> Instead of printing the version and exiting, we get the version, a stack 
> trace, and then get dropped into a plain Python shell ({{sc}} is not defined).
> {code}
> $ ./spark/bin/pyspark --version
> Python 2.7.10 (default, Aug 11 2015, 23:39:10) 
> [GCC 4.8.3 20140911 (Red Hat 4.8.3-9)] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 1.5.2
>   /_/
> 
> Type --help for more information.
> Traceback (most recent call last):
>   File "/home/ec2-user/spark/python/pyspark

[jira] [Commented] (SPARK-11744) bin/pyspark --version doesn't return version and exit

2015-11-14 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15005572#comment-15005572
 ] 

Nicholas Chammas commented on SPARK-11744:
--

Not sure who would be the best person to comment on this. Perhaps [~vanzin], 
since this is part of the launcher?

> bin/pyspark --version doesn't return version and exit
> -
>
> Key: SPARK-11744
> URL: https://issues.apache.org/jira/browse/SPARK-11744
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.2
>Reporter: Nicholas Chammas
>Priority: Minor
>
> {{bin/pyspark \-\-help}} offers a {{\-\-version}} option:
> {code}
> $ ./spark/bin/pyspark --help
> Usage: ./bin/pyspark [options]
> Options:
> ...
>   --version,  Print the version of current Spark
> ...
> {code}
> However, trying to get the version in this way doesn't yield the expected 
> results.
> Instead of printing the version and exiting, we get the version, a stack 
> trace, and then get dropped into a plain Python shell ({{sc}} is not defined).
> {code}
> $ ./spark/bin/pyspark --version
> Python 2.7.10 (default, Aug 11 2015, 23:39:10) 
> [GCC 4.8.3 20140911 (Red Hat 4.8.3-9)] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 1.5.2
>   /_/
> 
> Type --help for more information.
> Traceback (most recent call last):
>   File "/home/ec2-user/spark/python/pyspark/shell.py", line 43, in 
> sc = SparkContext(pyFiles=add_files)
>   File "/home/ec2-user/spark/python/pyspark/context.py", line 110, in __init__
> SparkContext._ensure_initialized(self, gateway=gateway)
>   File "/home/ec2-user/spark/python/pyspark/context.py", line 234, in 
> _ensure_initialized
> SparkContext._gateway = gateway or launch_gateway()
>   File "/home/ec2-user/spark/python/pyspark/java_gateway.py", line 94, in 
> launch_gateway
> raise Exception("Java gateway process exited before sending the driver 
> its port number")
> Exception: Java gateway process exited before sending the driver its port 
> number
> >>> 
> >>> sc
> Traceback (most recent call last):
>   File "", line 1, in 
> NameError: name 'sc' is not defined
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11744) bin/pyspark --version doesn't return version and exit

2015-11-14 Thread Nicholas Chammas (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-11744:
-
Description: 
{{bin/pyspark \-\-help}} offers a {{\-\-version}} option:

{code}
$ ./spark/bin/pyspark --help
Usage: ./bin/pyspark [options]

Options:
...
  --version,  Print the version of current Spark
...
{code}

However, trying to get the version in this way doesn't yield the expected 
results.

Instead of printing the version and exiting, we get the version, a stack trace, 
and then get dropped into a broken PySpark shell.

{code}
$ ./spark/bin/pyspark --version
Python 2.7.10 (default, Aug 11 2015, 23:39:10) 
[GCC 4.8.3 20140911 (Red Hat 4.8.3-9)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.5.2
  /_/

Type --help for more information.
Traceback (most recent call last):
  File "/home/ec2-user/spark/python/pyspark/shell.py", line 43, in 
sc = SparkContext(pyFiles=add_files)
  File "/home/ec2-user/spark/python/pyspark/context.py", line 110, in __init__
SparkContext._ensure_initialized(self, gateway=gateway)
  File "/home/ec2-user/spark/python/pyspark/context.py", line 234, in 
_ensure_initialized
SparkContext._gateway = gateway or launch_gateway()
  File "/home/ec2-user/spark/python/pyspark/java_gateway.py", line 94, in 
launch_gateway
raise Exception("Java gateway process exited before sending the driver its 
port number")
Exception: Java gateway process exited before sending the driver its port number
>>> 
>>> sc
Traceback (most recent call last):
  File "", line 1, in 
NameError: name 'sc' is not defined
{code}

  was:
{{bin/pyspark \-\-help}} offers a {{\-\-version}} option:

{code}
$ ./spark/bin/pyspark --help
Usage: ./bin/pyspark [options]

Options:
...
  --version,  Print the version of current Spark
...
{code}

However, trying to get the version in this way doesn't yield the expected 
results.

Instead of printing the version and exiting, we get the version, a stack trace, 
and then get dropped into a plain Python shell ({{sc}} is not defined).

{code}
$ ./spark/bin/pyspark --version
Python 2.7.10 (default, Aug 11 2015, 23:39:10) 
[GCC 4.8.3 20140911 (Red Hat 4.8.3-9)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.5.2
  /_/

Type --help for more information.
Traceback (most recent call last):
  File "/home/ec2-user/spark/python/pyspark/shell.py", line 43, in 
sc = SparkContext(pyFiles=add_files)
  File "/home/ec2-user/spark/python/pyspark/context.py", line 110, in __init__
SparkContext._ensure_initialized(self, gateway=gateway)
  File "/home/ec2-user/spark/python/pyspark/context.py", line 234, in 
_ensure_initialized
SparkContext._gateway = gateway or launch_gateway()
  File "/home/ec2-user/spark/python/pyspark/java_gateway.py", line 94, in 
launch_gateway
raise Exception("Java gateway process exited before sending the driver its 
port number")
Exception: Java gateway process exited before sending the driver its port number
>>> 
>>> sc
Traceback (most recent call last):
  File "", line 1, in 
NameError: name 'sc' is not defined
{code}


> bin/pyspark --version doesn't return version and exit
> -
>
> Key: SPARK-11744
> URL: https://issues.apache.org/jira/browse/SPARK-11744
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.2
>Reporter: Nicholas Chammas
>Priority: Minor
>
> {{bin/pyspark \-\-help}} offers a {{\-\-version}} option:
> {code}
> $ ./spark/bin/pyspark --help
> Usage: ./bin/pyspark [options]
> Options:
> ...
>   --version,  Print the version of current Spark
> ...
> {code}
> However, trying to get the version in this way doesn't yield the expected 
> results.
> Instead of printing the version and exiting, we get the version, a stack 
> trace, and then get dropped into a broken PySpark shell.
> {code}
> $ ./spark/bin/pyspark --version
> Python 2.7.10 (default, Aug 11 2015, 23:39:10) 
> [GCC 4.8.3 20140911 (Red Hat 4.8.3-9)] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 1.5.2
>   /_/
> 
> Type --help for more information.
> Traceback (most recent call last):
>   File "/home/ec2-user/spark/python/pyspark/shell.py", line 43, in 
> sc = Spar

[jira] [Commented] (SPARK-10673) spark.sql.hive.verifyPartitionPath Attempts to Verify Unregistered Partitions

2015-11-14 Thread Xin Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15005577#comment-15005577
 ] 

Xin Wu commented on SPARK-10673:


The fix is being tested.. will submit PR shortly.

> spark.sql.hive.verifyPartitionPath Attempts to Verify Unregistered Partitions
> -
>
> Key: SPARK-10673
> URL: https://issues.apache.org/jira/browse/SPARK-10673
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0, 1.5.0
>Reporter: Miklos Christine
>Priority: Minor
>
> In Spark 1.4, spark.sql.hive.verifyPartitionPath was set to true by default. 
> In Spark 1.5, it is now set to false by default. 
> If a table has a lot of partitions in the underlying filesystem, the code 
> unnecessarily checks for all the underlying directories when executing a 
> query. 
> https://github.com/apache/spark/blob/v1.5.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala#L162
> Structure:
> {code}
> /user/hive/warehouse/table1/year=2015/month=01/
> /user/hive/warehouse/table1/year=2015/month=02/
> /user/hive/warehouse/table1/year=2015/month=03/
> ...
> /user/hive/warehouse/table1/year=2014/month=01/
> /user/hive/warehouse/table1/year=2014/month=02/
> {code}
> If the registered partitions only contain year=2015 when you run "show 
> partitions table1", this code path checks for all directories under the 
> table's root directory. This incurs a significant performance penalty if 
> there are a lot of partition directories. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11738) Make ArrayType orderable

2015-11-14 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-11738:
-
Summary: Make ArrayType orderable  (was: Make array orderable)

> Make ArrayType orderable
> 
>
> Key: SPARK-11738
> URL: https://issues.apache.org/jira/browse/SPARK-11738
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yin Huai
>Priority: Blocker
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11704) Optimize the Cartesian Join

2015-11-14 Thread Zhan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15005682#comment-15005682
 ] 

Zhan Zhang commented on SPARK-11704:


[~maropu] You are right. I mean fetching from network is a big overhead. Feel 
free to work on it.

> Optimize the Cartesian Join
> ---
>
> Key: SPARK-11704
> URL: https://issues.apache.org/jira/browse/SPARK-11704
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Zhan Zhang
>
> Currently CartesianProduct relies on RDD.cartesian, in which the computation 
> is realized as follows
>   override def compute(split: Partition, context: TaskContext): Iterator[(T, 
> U)] = {
> val currSplit = split.asInstanceOf[CartesianPartition]
> for (x <- rdd1.iterator(currSplit.s1, context);
>  y <- rdd2.iterator(currSplit.s2, context)) yield (x, y)
>   }
> From the above loop, if rdd1.count is n, rdd2 needs to be recomputed n times. 
> Which is really heavy and may never finished if n is large, especially when 
> rdd2 is coming from ShuffleRDD.
> We should have some optimization on CartesianProduct by caching rightResults. 
> The problem is that we don’t have cleanup hook to unpersist rightResults 
> AFAIK. I think we should have some cleanup hook after query execution.
> With the hook available, we can easily optimize such Cartesian join. I 
> believe such cleanup hook may also benefit other query optimizations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11738) Make ArrayType orderable

2015-11-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11738:


Assignee: Apache Spark

> Make ArrayType orderable
> 
>
> Key: SPARK-11738
> URL: https://issues.apache.org/jira/browse/SPARK-11738
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Apache Spark
>Priority: Blocker
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11738) Make ArrayType orderable

2015-11-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11738:


Assignee: (was: Apache Spark)

> Make ArrayType orderable
> 
>
> Key: SPARK-11738
> URL: https://issues.apache.org/jira/browse/SPARK-11738
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yin Huai
>Priority: Blocker
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11738) Make ArrayType orderable

2015-11-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15005689#comment-15005689
 ] 

Apache Spark commented on SPARK-11738:
--

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/9718

> Make ArrayType orderable
> 
>
> Key: SPARK-11738
> URL: https://issues.apache.org/jira/browse/SPARK-11738
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yin Huai
>Priority: Blocker
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-11736) Add MonotonicallyIncreasingID to function registry

2015-11-14 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-11736.
-
   Resolution: Fixed
Fix Version/s: 1.6.0

> Add MonotonicallyIncreasingID to function registry
> --
>
> Key: SPARK-11736
> URL: https://issues.apache.org/jira/browse/SPARK-11736
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11672) Flaky test: ml.JavaDefaultReadWriteSuite

2015-11-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15005782#comment-15005782
 ] 

Apache Spark commented on SPARK-11672:
--

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/9719

> Flaky test: ml.JavaDefaultReadWriteSuite
> 
>
> Key: SPARK-11672
> URL: https://issues.apache.org/jira/browse/SPARK-11672
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Critical
> Fix For: 1.6.0
>
>
> Saw several failures on Jenkins, e.g., 
> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2040/testReport/org.apache.spark.ml.util/JavaDefaultReadWriteSuite/testDefaultReadWrite/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

56 matches

Mail list logo