[jira] [Created] (SPARK-11299) SQL Programming Guide's link to DataFrame Function Reference is wrong

2015-10-25 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-11299:
--

 Summary: SQL Programming Guide's link to DataFrame Function 
Reference is wrong
 Key: SPARK-11299
 URL: https://issues.apache.org/jira/browse/SPARK-11299
 Project: Spark
  Issue Type: Bug
  Components: Documentation, SQL
Reporter: Josh Rosen
Assignee: Josh Rosen
Priority: Trivial


The SQL Programming Guide's link to the DataFrame Functions Reference points to 
the wrong location: it points to the docs for DataFrame itself, not the 
functions package. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9162) Implement code generation for ScalaUDF

2015-10-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9162:
---

Assignee: Apache Spark

> Implement code generation for ScalaUDF
> --
>
> Key: SPARK-9162
> URL: https://issues.apache.org/jira/browse/SPARK-9162
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9162) Implement code generation for ScalaUDF

2015-10-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9162:
---

Assignee: (was: Apache Spark)

> Implement code generation for ScalaUDF
> --
>
> Key: SPARK-9162
> URL: https://issues.apache.org/jira/browse/SPARK-9162
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11298) When driver sends message "GetExecutorLossReason" to AM, the AM stops.

2015-10-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14973081#comment-14973081
 ] 

Apache Spark commented on SPARK-11298:
--

User 'KaiXinXiaoLei' has created a pull request for this issue:
https://github.com/apache/spark/pull/9268

> When driver sends message "GetExecutorLossReason" to AM, the AM stops.
> --
>
> Key: SPARK-11298
> URL: https://issues.apache.org/jira/browse/SPARK-11298
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 1.6.0
>Reporter: KaiXinXIaoLei
> Fix For: 1.6.0, 2+
>
> Attachments: driver.log
>
>
> I get lastest code form github, and just run "bin/spark-shell --master yarn 
> --conf spark.dynamicAllocation.enabled=true --conf 
> spark.dynamicAllocation.initialExecutors=1 --conf 
> spark.shuffle.service.enabled=true". There is error infor:
> 15/10/25 12:11:02 ERROR TransportChannelHandler: Connection to 
> /9.96.1.113:35066 has been quiet for 12 ms while there are outstanding 
> requests. Assuming connection is dead; please adjust spark.network.timeout if 
> this is wrong.
> 15/10/25 12:11:02 ERROR TransportResponseHandler: Still have 1 requests 
> outstanding when connection from vm113/9.96.1.113:35066 is closed
> 15/10/25 12:11:02 WARN NettyRpcEndpointRef: Ignore message 
> Failure(java.io.IOException: Connection from vm113/9.96.1.113:35066 closed)
> 15/10/25 12:11:02 ERROR YarnScheduler: Lost executor 1 on vm111: Slave lost
> From log, when driver sends message "GetExecutorLossReason" to AM, the error 
> appears. From code, i think AM gets this message, should reply.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11298) When driver sends message "GetExecutorLossReason" to AM, the AM stops.

2015-10-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11298:


Assignee: Apache Spark

> When driver sends message "GetExecutorLossReason" to AM, the AM stops.
> --
>
> Key: SPARK-11298
> URL: https://issues.apache.org/jira/browse/SPARK-11298
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 1.6.0
>Reporter: KaiXinXIaoLei
>Assignee: Apache Spark
> Fix For: 1.6.0, 2+
>
> Attachments: driver.log
>
>
> I get lastest code form github, and just run "bin/spark-shell --master yarn 
> --conf spark.dynamicAllocation.enabled=true --conf 
> spark.dynamicAllocation.initialExecutors=1 --conf 
> spark.shuffle.service.enabled=true". There is error infor:
> 15/10/25 12:11:02 ERROR TransportChannelHandler: Connection to 
> /9.96.1.113:35066 has been quiet for 12 ms while there are outstanding 
> requests. Assuming connection is dead; please adjust spark.network.timeout if 
> this is wrong.
> 15/10/25 12:11:02 ERROR TransportResponseHandler: Still have 1 requests 
> outstanding when connection from vm113/9.96.1.113:35066 is closed
> 15/10/25 12:11:02 WARN NettyRpcEndpointRef: Ignore message 
> Failure(java.io.IOException: Connection from vm113/9.96.1.113:35066 closed)
> 15/10/25 12:11:02 ERROR YarnScheduler: Lost executor 1 on vm111: Slave lost
> From log, when driver sends message "GetExecutorLossReason" to AM, the error 
> appears. From code, i think AM gets this message, should reply.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11298) When driver sends message "GetExecutorLossReason" to AM, the AM stops.

2015-10-25 Thread KaiXinXIaoLei (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

KaiXinXIaoLei updated SPARK-11298:
--
Component/s: YARN

> When driver sends message "GetExecutorLossReason" to AM, the AM stops.
> --
>
> Key: SPARK-11298
> URL: https://issues.apache.org/jira/browse/SPARK-11298
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 1.6.0
>Reporter: KaiXinXIaoLei
> Fix For: 1.6.0, 2+
>
> Attachments: driver.log
>
>
> I get lastest code form github, and just run "bin/spark-shell --master yarn 
> --conf spark.dynamicAllocation.enabled=true --conf 
> spark.dynamicAllocation.initialExecutors=1 --conf 
> spark.shuffle.service.enabled=true". There is error infor:
> 15/10/25 12:11:02 ERROR TransportChannelHandler: Connection to 
> /9.96.1.113:35066 has been quiet for 12 ms while there are outstanding 
> requests. Assuming connection is dead; please adjust spark.network.timeout if 
> this is wrong.
> 15/10/25 12:11:02 ERROR TransportResponseHandler: Still have 1 requests 
> outstanding when connection from vm113/9.96.1.113:35066 is closed
> 15/10/25 12:11:02 WARN NettyRpcEndpointRef: Ignore message 
> Failure(java.io.IOException: Connection from vm113/9.96.1.113:35066 closed)
> 15/10/25 12:11:02 ERROR YarnScheduler: Lost executor 1 on vm111: Slave lost
> From log, when driver sends message "GetExecutorLossReason" to AM, the error 
> appears. From code, i think AM gets this message, should reply.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11298) When driver sends message "GetExecutorLossReason" to AM, the AM stops.

2015-10-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11298:


Assignee: (was: Apache Spark)

> When driver sends message "GetExecutorLossReason" to AM, the AM stops.
> --
>
> Key: SPARK-11298
> URL: https://issues.apache.org/jira/browse/SPARK-11298
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 1.6.0
>Reporter: KaiXinXIaoLei
> Fix For: 1.6.0, 2+
>
> Attachments: driver.log
>
>
> I get lastest code form github, and just run "bin/spark-shell --master yarn 
> --conf spark.dynamicAllocation.enabled=true --conf 
> spark.dynamicAllocation.initialExecutors=1 --conf 
> spark.shuffle.service.enabled=true". There is error infor:
> 15/10/25 12:11:02 ERROR TransportChannelHandler: Connection to 
> /9.96.1.113:35066 has been quiet for 12 ms while there are outstanding 
> requests. Assuming connection is dead; please adjust spark.network.timeout if 
> this is wrong.
> 15/10/25 12:11:02 ERROR TransportResponseHandler: Still have 1 requests 
> outstanding when connection from vm113/9.96.1.113:35066 is closed
> 15/10/25 12:11:02 WARN NettyRpcEndpointRef: Ignore message 
> Failure(java.io.IOException: Connection from vm113/9.96.1.113:35066 closed)
> 15/10/25 12:11:02 ERROR YarnScheduler: Lost executor 1 on vm111: Slave lost
> From log, when driver sends message "GetExecutorLossReason" to AM, the error 
> appears. From code, i think AM gets this message, should reply.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11299) SQL Programming Guide's link to DataFrame Function Reference is wrong

2015-10-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14973108#comment-14973108
 ] 

Apache Spark commented on SPARK-11299:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/9269

> SQL Programming Guide's link to DataFrame Function Reference is wrong
> -
>
> Key: SPARK-11299
> URL: https://issues.apache.org/jira/browse/SPARK-11299
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Trivial
>
> The SQL Programming Guide's link to the DataFrame Functions Reference points 
> to the wrong location: it points to the docs for DataFrame itself, not the 
> functions package. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11299) SQL Programming Guide's link to DataFrame Function Reference is wrong

2015-10-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11299:


Assignee: Apache Spark  (was: Josh Rosen)

> SQL Programming Guide's link to DataFrame Function Reference is wrong
> -
>
> Key: SPARK-11299
> URL: https://issues.apache.org/jira/browse/SPARK-11299
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, SQL
>Reporter: Josh Rosen
>Assignee: Apache Spark
>Priority: Trivial
>
> The SQL Programming Guide's link to the DataFrame Functions Reference points 
> to the wrong location: it points to the docs for DataFrame itself, not the 
> functions package. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11299) SQL Programming Guide's link to DataFrame Function Reference is wrong

2015-10-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11299:


Assignee: Josh Rosen  (was: Apache Spark)

> SQL Programming Guide's link to DataFrame Function Reference is wrong
> -
>
> Key: SPARK-11299
> URL: https://issues.apache.org/jira/browse/SPARK-11299
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Trivial
>
> The SQL Programming Guide's link to the DataFrame Functions Reference points 
> to the wrong location: it points to the docs for DataFrame itself, not the 
> functions package. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9162) Implement code generation for ScalaUDF

2015-10-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14973130#comment-14973130
 ] 

Apache Spark commented on SPARK-9162:
-

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/9270

> Implement code generation for ScalaUDF
> --
>
> Key: SPARK-9162
> URL: https://issues.apache.org/jira/browse/SPARK-9162
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11239) PMML export for ML linear regression

2015-10-25 Thread Kai Sasaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14973214#comment-14973214
 ] 

Kai Sasaki commented on SPARK-11239:


[~holdenk] Hi, these tickets under SPARK-11171 are blocked by SPARK-11241?

> PMML export for ML linear regression
> 
>
> Key: SPARK-11239
> URL: https://issues.apache.org/jira/browse/SPARK-11239
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: holdenk
>
> Add PMML export for linear regression models form the ML pipeline.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10386) Model import/export for PrefixSpan

2015-10-25 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14973234#comment-14973234
 ] 

Yanbo Liang commented on SPARK-10386:
-

This is partly depends on SPARK-6724 which we need to figure out the best way 
to resolve *Item* type related problem when load model.

> Model import/export for PrefixSpan
> --
>
> Key: SPARK-10386
> URL: https://issues.apache.org/jira/browse/SPARK-10386
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Reporter: Xiangrui Meng
>
> Support save/load for PrefixSpanModel. Should be similar to save/load for 
> FPGrowth.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6724) Model import/export for FPGrowth

2015-10-25 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14973222#comment-14973222
 ] 

Yanbo Liang edited comment on SPARK-6724 at 10/25/15 12:51 PM:
---

[~josephkb] Now we can save FPGrowthModel with arbitrary types leverage 
*ScalaReflection.schemaFor*. However when we load this model we have to get the 
*Item* type first, there is no exist API to do the map between DataFrame types 
and the Scala types, so I defined this map in a function. I think this is only 
a workaround, we should figure out a better way. Looking forward your comments.


was (Author: yanboliang):
[~josephkb] Now we can save FPGrowthModel with arbitrary types leverage 
{ScalaReflection.schemaFor}. However when we load this model we have to get the 
{Item} type first, there is no exist API to do the map between DataFrame types 
and the Scala types, so I defined this map in a function. I think this is only 
a workaround, we should figure out a better way. Looking forward your comments.

> Model import/export for FPGrowth
> 
>
> Key: SPARK-6724
> URL: https://issues.apache.org/jira/browse/SPARK-6724
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Note: experimental model API



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6333) saveAsObjectFile support for compression codec

2015-10-25 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-6333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14973232#comment-14973232
 ] 

Maciej Bryński commented on SPARK-6333:
---

[~srowen]
I'd very like to have this functionality.
Resolution: Won't Fix tells that never ever in Spark will be that option or 
you're waiting for PR ?



> saveAsObjectFile support for compression codec
> --
>
> Key: SPARK-6333
> URL: https://issues.apache.org/jira/browse/SPARK-6333
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.2.1
>Reporter: Deenar Toraskar
>Priority: Minor
>
> saveAsObjectFile current does not support a compression codec.  This story is 
> about adding saveAsObjectFile (path, codec) support into spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6724) Model import/export for FPGrowth

2015-10-25 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14973222#comment-14973222
 ] 

Yanbo Liang commented on SPARK-6724:


[~josephkb] Now we can save FPGrowthModel with arbitrary types leverage 
{ScalaReflection.schemaFor}. However when we load this model we have to get the 
{Item} type first, there is no exist API to do the map between DataFrame types 
and the Scala types, so I defined this map in a function. I think this is only 
a workaround, we should figure out a better way. Looking forward your comments.

> Model import/export for FPGrowth
> 
>
> Key: SPARK-6724
> URL: https://issues.apache.org/jira/browse/SPARK-6724
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Note: experimental model API



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10562) .partitionBy() creates the metastore partition columns in all lowercase, but persists the data path as MixedCase resulting in an error when the data is later attempted t

2015-10-25 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-10562:
---
Description: 
When using DataFrame.write.partitionBy().saveAsTable() it creates the partiton 
by columns in all lowercase in the meta-store.  However, it writes the data to 
the filesystem using mixed-case.

This causes an error when running a select against the table.
{noformat}
from pyspark.sql import Row

# Create a data frame with mixed case column names
myRDD = sc.parallelize([Row(Name="John Terry", Goals=1, Year=2015),
   Row(Name="Frank Lampard", Goals=15, Year=2012)])

myDF = sqlContext.createDataFrame(myRDD)

# Write this data out to a parquet file and partition by the Year (which is a 
mixedCase name)
myDF.write.partitionBy("Year").saveAsTable("chelsea_goals")

%sql show create table chelsea_goals;
--The metastore is showwing a partition column name of all lowercase "year"

# Verify that the data is written with appropriate partitions
display(dbutils.fs.ls("/user/hive/warehouse/chelsea_goals"))
{noformat}

{code:sql}
%sql -- Now try to run a query against this table
select * from chelsea_goals
{code}

{noformat}
Error in SQL statement: UncheckedExecutionException: 
java.lang.RuntimeException: Partition column year not found in schema 
StructType(StructField(Goals,LongType,true), StructField(Name,StringType,true), 
StructField(Year,LongType,true))
{noformat}

{noformat}
# Now lets try this again using a lowercase column name
myRDD2 = sc.parallelize([Row(Name="John Terry", Goals=1, year=2015),
 Row(Name="Frank Lampard", Goals=15, year=2012)])

myDF2 = sqlContext.createDataFrame(myRDD2)

myDF2.write.partitionBy("year").saveAsTable("chelsea_goals2")
{noformat}

{code:sql}
%sql select * from chelsea_goals2;
--Now everything works
{code}

  was:
When using DataFrame.write.partitionBy().saveAsTable() it creates the partiton 
by columns in all lowercase in the meta-store.  However, it writes the data to 
the filesystem using mixed-case.

This causes an error when running a select against the table.
--
from pyspark.sql import Row

# Create a data frame with mixed case column names
myRDD = sc.parallelize([Row(Name="John Terry", Goals=1, Year=2015),
   Row(Name="Frank Lampard", Goals=15, Year=2012)])

myDF = sqlContext.createDataFrame(myRDD)

# Write this data out to a parquet file and partition by the Year (which is a 
mixedCase name)
myDF.write.partitionBy("Year").saveAsTable("chelsea_goals")

%sql show create table chelsea_goals;
--The metastore is showwing a partition column name of all lowercase "year"

# Verify that the data is written with appropriate partitions
display(dbutils.fs.ls("/user/hive/warehouse/chelsea_goals"))

%sql
--Now try to run a query against this table
select * from chelsea_goals

Error in SQL statement: UncheckedExecutionException: 
java.lang.RuntimeException: Partition column year not found in schema 
StructType(StructField(Goals,LongType,true), StructField(Name,StringType,true), 
StructField(Year,LongType,true))

# Now lets try this again using a lowercase column name
myRDD2 = sc.parallelize([Row(Name="John Terry", Goals=1, year=2015),
 Row(Name="Frank Lampard", Goals=15, year=2012)])

myDF2 = sqlContext.createDataFrame(myRDD2)

myDF2.write.partitionBy("year").saveAsTable("chelsea_goals2")

%sql select * from chelsea_goals2;
--Now everything works





> .partitionBy() creates the metastore partition columns in all lowercase, but 
> persists the data path as MixedCase resulting in an error when the data is 
> later attempted to query.
> -
>
> Key: SPARK-10562
> URL: https://issues.apache.org/jira/browse/SPARK-10562
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: Jason Pohl
>Assignee: Wenchen Fan
> Attachments: MixedCasePartitionBy.dbc
>
>
> When using DataFrame.write.partitionBy().saveAsTable() it creates the 
> partiton by columns in all lowercase in the meta-store.  However, it writes 
> the data to the filesystem using mixed-case.
> This causes an error when running a select against the table.
> {noformat}
> from pyspark.sql import Row
> # Create a data frame with mixed case column names
> myRDD = sc.parallelize([Row(Name="John Terry", Goals=1, Year=2015),
>Row(Name="Frank Lampard", Goals=15, Year=2012)])
> myDF = sqlContext.createDataFrame(myRDD)
> # Write this data out to a parquet file and partition by the Year (which is a 
> mixedCase name)
> myDF.write.partitionBy("Year").saveAsTable("chelsea_goals")
> %sql show create table chelsea_goals;
> 

[jira] [Created] (SPARK-11300) Support for string length when writing to JDBC

2015-10-25 Thread JIRA
Maciej Bryński created SPARK-11300:
--

 Summary: Support for string length when writing to JDBC
 Key: SPARK-11300
 URL: https://issues.apache.org/jira/browse/SPARK-11300
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 1.5.1
Reporter: Maciej Bryński


Right now every StringType fields are written to JDBC as TEXT.
I'd like to have option to write it as VARCHAR(size).
Maybe we could use StringType(size) ?







--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11234) What's cooking classification

2015-10-25 Thread Kai Sasaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14973221#comment-14973221
 ] 

Kai Sasaki commented on SPARK-11234:


[~xusen] Thank you so much for very insightful experiments!

{quote}
4. The evaluator forces me to select a metric method. But sometimes I want to 
see all the evaluation results, say F1, precision-recall, AUC, etc.
{quote}

Yes, I agree with you. At the initial phase to run machine learning algorithm, 
it is the case when we don't know what metrics we should see.

{quote}
5. ML transformers will get stuck when facing with Int type. It's strange that 
we have to transform all Int values to double values before hand. I think a 
wise auto casting is helpful.
{quote}
What kind of Transformer got stuck? The first transformer cannot handle input 
int values, do you think?


> What's cooking classification
> -
>
> Key: SPARK-11234
> URL: https://issues.apache.org/jira/browse/SPARK-11234
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Xusen Yin
>
> I add the subtask to post the work on this dataset:  
> https://www.kaggle.com/c/whats-cooking



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10386) Model import/export for PrefixSpan

2015-10-25 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14973234#comment-14973234
 ] 

Yanbo Liang edited comment on SPARK-10386 at 10/25/15 1:21 PM:
---

This is partly depends on SPARK-6724 which we need to figure out the best way 
to resolve *Item* type related problem when load model. Please feel free to 
comment on my PR for SPARK-6724 and join the discussion.


was (Author: yanboliang):
This is partly depends on SPARK-6724 which we need to figure out the best way 
to resolve *Item* type related problem when load model.

> Model import/export for PrefixSpan
> --
>
> Key: SPARK-10386
> URL: https://issues.apache.org/jira/browse/SPARK-10386
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Reporter: Xiangrui Meng
>
> Support save/load for PrefixSpanModel. Should be similar to save/load for 
> FPGrowth.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10891) Add MessageHandler to KinesisUtils.createStream similar to Direct Kafka

2015-10-25 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-10891.
---
   Resolution: Fixed
 Assignee: Burak Yavuz
Fix Version/s: 1.6.0

> Add MessageHandler to KinesisUtils.createStream similar to Direct Kafka
> ---
>
> Key: SPARK-10891
> URL: https://issues.apache.org/jira/browse/SPARK-10891
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
> Fix For: 1.6.0
>
>
> There is support for message handler in Direct Kafka Stream, which allows 
> arbitrary T to be the output of the stream instead of Array[Byte]. This is a 
> very useful function, therefore should exist in Kinesis as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11306) Executor JVM loss can lead to a hang in Standalone mode

2015-10-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11306:


Assignee: Kay Ousterhout  (was: Apache Spark)

> Executor JVM loss can lead to a hang in Standalone mode
> ---
>
> Key: SPARK-11306
> URL: https://issues.apache.org/jira/browse/SPARK-11306
> Project: Spark
>  Issue Type: Bug
>Reporter: Kay Ousterhout
>Assignee: Kay Ousterhout
>
> This commit: 
> https://github.com/apache/spark/commit/af3bc59d1f5d9d952c2d7ad1af599c49f1dbdaf0
>  introduced a bug where, in Standalone mode, if a task fails and crashes the 
> JVM, the failure is considered a "normal failure" (meaning it's considered 
> unrelated to the task), so the failure isn't counted against the task's 
> maximum number of failures: 
> https://github.com/apache/spark/commit/af3bc59d1f5d9d952c2d7ad1af599c49f1dbdaf0#diff-a755f3d892ff2506a7aa7db52022d77cL138.
>   As a result, if a task fails in a way that results in it crashing the JVM, 
> it will continuously be re-launched, resulting in a hang.
> Unfortunately this issue is difficult to reproduce because of a race 
> condition where we have multiple code paths that are used to handle executor 
> losses, and in the setup I'm using, Akka's notification that the executor was 
> lost always gets to the TaskSchedulerImpl first, so the task eventually gets 
> killed (see my recent email to the dev list).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11306) Executor JVM loss can lead to a hang in Standalone mode

2015-10-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11306:


Assignee: Apache Spark  (was: Kay Ousterhout)

> Executor JVM loss can lead to a hang in Standalone mode
> ---
>
> Key: SPARK-11306
> URL: https://issues.apache.org/jira/browse/SPARK-11306
> Project: Spark
>  Issue Type: Bug
>Reporter: Kay Ousterhout
>Assignee: Apache Spark
>
> This commit: 
> https://github.com/apache/spark/commit/af3bc59d1f5d9d952c2d7ad1af599c49f1dbdaf0
>  introduced a bug where, in Standalone mode, if a task fails and crashes the 
> JVM, the failure is considered a "normal failure" (meaning it's considered 
> unrelated to the task), so the failure isn't counted against the task's 
> maximum number of failures: 
> https://github.com/apache/spark/commit/af3bc59d1f5d9d952c2d7ad1af599c49f1dbdaf0#diff-a755f3d892ff2506a7aa7db52022d77cL138.
>   As a result, if a task fails in a way that results in it crashing the JVM, 
> it will continuously be re-launched, resulting in a hang.
> Unfortunately this issue is difficult to reproduce because of a race 
> condition where we have multiple code paths that are used to handle executor 
> losses, and in the setup I'm using, Akka's notification that the executor was 
> lost always gets to the TaskSchedulerImpl first, so the task eventually gets 
> killed (see my recent email to the dev list).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7106) Support model save/load in Python's FPGrowth

2015-10-25 Thread Kai Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14973525#comment-14973525
 ] 

Kai Jiang commented on SPARK-7106:
--

I would like to do this one after spark-6724 is done.

> Support model save/load in Python's FPGrowth
> 
>
> Key: SPARK-7106
> URL: https://issues.apache.org/jira/browse/SPARK-7106
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, PySpark
>Reporter: Joseph K. Bradley
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-8890) Reduce memory consumption for dynamic partition insert

2015-10-25 Thread Jerry Lam (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14973529#comment-14973529
 ] 

Jerry Lam edited comment on SPARK-8890 at 10/26/15 1:02 AM:


Hi guys, sorry by injecting comments into the closed jira. I just want to point 
out that I'm using spark 1.5.1, I got OOM in the driver side after all 
partitions are written out (I have over 1 million partitions). The job was 
marked SUCCESS in the output folder but the driver took significant CPU and 
memory. After several hours, the driver dies with OOM. I already configure the 
driver to use 6GB. Therefore, not only the executors use a lot more memory but 
the driver as well. The use case is meaningful because we want to partition 
events by customer. Since we have more than 1 millions customer and we want to 
do analysis per customer, therefore we want to have a quick way to identify the 
customer events instead of filter it everytime. Any help would be greatly 
appreciated.

The jstack of the process is as follows:
{code}
Thread 528: (state = BLOCKED)
 - java.util.Arrays.copyOf(char[], int) @bci=1, line=2367 (Compiled frame)
 - java.lang.AbstractStringBuilder.expandCapacity(int) @bci=43, line=130 
(Compiled frame)
 - java.lang.AbstractStringBuilder.ensureCapacityInternal(int) @bci=12, 
line=114 (Compiled frame)
 - java.lang.AbstractStringBuilder.append(java.lang.String) @bci=19, line=415 
(Compiled frame)
 - java.lang.StringBuilder.append(java.lang.String) @bci=2, line=132 (Compiled 
frame)
 - org.apache.hadoop.fs.Path.toString() @bci=128, line=384 (Compiled frame)
 - 
org.apache.spark.sql.sources.HadoopFsRelation$FileStatusCache$$anonfun$listLeafFiles$1.apply(org.apache.hadoop.fs.FileStatus)
 @bci=4, line=447 (Compiled frame)
 - 
org.apache.spark.sql.sources.HadoopFsRelation$FileStatusCache$$anonfun$listLeafFiles$1.apply(java.lang.Object)
 @bci=5, line=447 (Compiled frame)
 - scala.collection.TraversableLike$$anonfun$map$1.apply(java.lang.Object) 
@bci=9, line=244 (Compiled frame)
 - scala.collection.TraversableLike$$anonfun$map$1.apply(java.lang.Object) 
@bci=2, line=244 (Compiled frame)
 - 
scala.collection.IndexedSeqOptimized$class.foreach(scala.collection.IndexedSeqOptimized,
 scala.Function1) @bci=22, line=33 (Compiled frame)
 - scala.collection.mutable.ArrayOps$ofRef.foreach(scala.Function1) @bci=2, 
line=108 (Compiled frame)
 - scala.collection.TraversableLike$class.map(scala.collection.TraversableLike, 
scala.Function1, scala.collection.generic.CanBuildFrom) @bci=17, line=244 
(Compiled frame)
 - scala.collection.mutable.ArrayOps$ofRef.map(scala.Function1, 
scala.collection.generic.CanBuildFrom) @bci=3, line=108 (Interpreted frame)
 - 
org.apache.spark.sql.sources.HadoopFsRelation$FileStatusCache.listLeafFiles(java.lang.String[])
 @bci=279, line=447 (Interpreted frame)
 - org.apache.spark.sql.sources.HadoopFsRelation$FileStatusCache.refresh() 
@bci=8, line=453 (Interpreted frame)
 - 
org.apache.spark.sql.sources.HadoopFsRelation.org$apache$spark$sql$sources$HadoopFsRelation$$fileStatusCache$lzycompute()
 @bci=26, line=465 (Interpreted frame)
 - 
org.apache.spark.sql.sources.HadoopFsRelation.org$apache$spark$sql$sources$HadoopFsRelation$$fileStatusCache()
 @bci=12, line=463 (Interpreted frame)
 - org.apache.spark.sql.sources.HadoopFsRelation.refresh() @bci=1, line=540 
(Interpreted frame)
 - org.apache.spark.sql.execution.datasources.parquet.ParquetRelation.refresh() 
@bci=1, line=204 (Interpreted frame)
 - 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp()
 @bci=392, line=152 (Interpreted frame)
 - 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply()
 @bci=1, line=108 (Interpreted frame)
 - 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply()
 @bci=1, line=108 (Interpreted frame)
 - 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(org.apache.spark.sql.SQLContext,
 org.apache.spark.sql.SQLContext$QueryExecution, scala.Function0) @bci=96, 
line=56 (Interpreted frame)
 - 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(org.apache.spark.sql.SQLContext)
 @bci=718, line=108 (Interpreted frame)
 - org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute() 
@bci=20, line=57 (Interpreted frame)
 - org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult() @bci=15, 
line=57 (Interpreted frame)
 - org.apache.spark.sql.execution.ExecutedCommand.doExecute() @bci=12, line=69 
(Interpreted frame)
 - org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply() @bci=11, 
line=140 (Interpreted frame)
 - org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply() @bci=1, 
line=138 (Interpreted frame)
 - 
org.apache.spark.rdd.RDDOperationScope$.withScope(org.apache.spark.SparkContext,
 java.lang.String, boolean, boolean, 

[jira] [Commented] (SPARK-10500) sparkr.zip cannot be created if $SPARK_HOME/R/lib is unwritable

2015-10-25 Thread Sun Rui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14973530#comment-14973530
 ] 

Sun Rui commented on SPARK-10500:
-

yes, I am working on this

> sparkr.zip cannot be created if $SPARK_HOME/R/lib is unwritable
> ---
>
> Key: SPARK-10500
> URL: https://issues.apache.org/jira/browse/SPARK-10500
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.0
>Reporter: Jonathan Kelly
>
> As of SPARK-6797, sparkr.zip is re-created each time spark-submit is run with 
> an R application, which fails if Spark has been installed into a directory to 
> which the current user doesn't have write permissions. (e.g., on EMR's 
> emr-4.0.0 release, Spark is installed at /usr/lib/spark, which is only 
> writable by root.)
> Would it be possible to skip creating sparkr.zip if it already exists? That 
> would enable sparkr.zip to be pre-created by the root user and then reused 
> each time spark-submit is run, which I believe is similar to how pyspark 
> works.
> Another option would be to make the location configurable, as it's currently 
> hardcoded to $SPARK_HOME/R/lib/sparkr.zip. Allowing it to be configured to 
> something like the user's home directory or a random path in /tmp would get 
> around the permissions issue.
> By the way, why does spark-submit even need to re-create sparkr.zip every 
> time a new R application is launched? This seems unnecessary and inefficient, 
> unless you are actively developing the SparkR libraries and expect the 
> contents of sparkr.zip to change.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11127) Upgrade Kinesis Client Library to the latest stable version

2015-10-25 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-11127.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

> Upgrade Kinesis Client Library to the latest stable version
> ---
>
> Key: SPARK-11127
> URL: https://issues.apache.org/jira/browse/SPARK-11127
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
> Fix For: 1.6.0
>
>
> We use KCL 1.3.0 in the current master. KCL 1.4.0 added integration with 
> Kinesis Producer Library (KPL) and support auto de-aggregation. It would be 
> great to upgrade KCL to the latest stable version.
> Note that the latest version is 1.6.1 and 1.6.0 restored compatibility with 
> dynamodb-streams-kinesis-adapter, which was broken in 1.4.0. See 
> https://github.com/awslabs/amazon-kinesis-client#release-notes.
> [~tdas] [~brkyvz] Please recommend a version for upgrade.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11304) SparkR in yarn-client mode fails creating sparkr.zip

2015-10-25 Thread Ram Venkatesh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ram Venkatesh resolved SPARK-11304.
---
Resolution: Duplicate

Same as SPARK-10500 

> SparkR in yarn-client mode fails creating sparkr.zip
> 
>
> Key: SPARK-11304
> URL: https://issues.apache.org/jira/browse/SPARK-11304
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.1
>Reporter: Ram Venkatesh
>
> If you run sparkR in yarn-client mode and the spark installation directory is 
> not writable by the current user, it fails with
> Exception in thread "main" java.io.FileNotFoundException:
> /usr/hdp/2.3.2.1-12/spark/R/lib/sparkr.zip (Permission denied)
> at java.io.FileOutputStream.open0(Native Method)
> at java.io.FileOutputStream.open(FileOutputStream.java:270)
> at java.io.FileOutputStream.(FileOutputStream.java:213)
> at
> org.apache.spark.deploy.RPackageUtils$.zipRLibraries(RPackageUtils.scala:215)
> at
> org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:371)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:153)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> The behavior is the same with the pre-built spark-1.5.1-bin-hadoop2.6
> bits also.
> We need to either use an existing sparkr.zip if we find one in the R/lib 
> directory, or create the file in a location accessible to the submitting user.
> Temporary hack workaround - create a world-writable file called sparkr.zip 
> under R/lib. It will still fail if multiple users submit jobs at the same 
> time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5737) Scanning duplicate columns from parquet table

2015-10-25 Thread Kevin Jung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14973523#comment-14973523
 ] 

Kevin Jung commented on SPARK-5737:
---

Based on your comment, It must be marked as resolved. Thanks.

> Scanning duplicate columns from parquet table
> -
>
> Key: SPARK-5737
> URL: https://issues.apache.org/jira/browse/SPARK-5737
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.1
>Reporter: Kevin Jung
> Fix For: 1.5.1
>
>
> {quote}
> import org.apache.spark.sql._
> val sqlContext = new SQLContext(sc)
> import sqlContext._
> val rdd = sqlContext.parquetFile("temp.parquet")
> rdd.select('d1,'d1,'d2,'d2).take(3).foreach(println)
> {quote}
> The results of above code have null values at the preceding columns of 
> duplicate two.
> For example,
> {quote}
> [null,-5.7,null,121.05]
> [null,-61.17,null,108.91]
> [null,50.60,null,72.15]
> {quote}
> This happens only in ParquetTableScan. PysicalRDD works fine and the rows 
> have duplicate values like...
> {quote}
> [-5.7,-5.7,121.05,121.05]
> [-61.17,-61.17,108.91,108.91]
> [50.60,50.60,72.15,72.15]
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-8890) Reduce memory consumption for dynamic partition insert

2015-10-25 Thread Jerry Lam (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14973529#comment-14973529
 ] 

Jerry Lam edited comment on SPARK-8890 at 10/26/15 12:58 AM:
-

Hi guys, sorry by injecting comments into the closed jira. I just want to point 
out that I'm using spark 1.5.1, I got OOM in the driver side after all 
partitions are written out (I have over 1 million partitions). The job was 
marked SUCCESS in the output folder but the driver took significant CPU and 
memory. After several hours, the driver dies with OOM. I already configure the 
driver to use 6GB. The jstack of the process is as follows:
{code}
Thread 528: (state = BLOCKED)
 - java.util.Arrays.copyOf(char[], int) @bci=1, line=2367 (Compiled frame)
 - java.lang.AbstractStringBuilder.expandCapacity(int) @bci=43, line=130 
(Compiled frame)
 - java.lang.AbstractStringBuilder.ensureCapacityInternal(int) @bci=12, 
line=114 (Compiled frame)
 - java.lang.AbstractStringBuilder.append(java.lang.String) @bci=19, line=415 
(Compiled frame)
 - java.lang.StringBuilder.append(java.lang.String) @bci=2, line=132 (Compiled 
frame)
 - org.apache.hadoop.fs.Path.toString() @bci=128, line=384 (Compiled frame)
 - 
org.apache.spark.sql.sources.HadoopFsRelation$FileStatusCache$$anonfun$listLeafFiles$1.apply(org.apache.hadoop.fs.FileStatus)
 @bci=4, line=447 (Compiled frame)
 - 
org.apache.spark.sql.sources.HadoopFsRelation$FileStatusCache$$anonfun$listLeafFiles$1.apply(java.lang.Object)
 @bci=5, line=447 (Compiled frame)
 - scala.collection.TraversableLike$$anonfun$map$1.apply(java.lang.Object) 
@bci=9, line=244 (Compiled frame)
 - scala.collection.TraversableLike$$anonfun$map$1.apply(java.lang.Object) 
@bci=2, line=244 (Compiled frame)
 - 
scala.collection.IndexedSeqOptimized$class.foreach(scala.collection.IndexedSeqOptimized,
 scala.Function1) @bci=22, line=33 (Compiled frame)
 - scala.collection.mutable.ArrayOps$ofRef.foreach(scala.Function1) @bci=2, 
line=108 (Compiled frame)
 - scala.collection.TraversableLike$class.map(scala.collection.TraversableLike, 
scala.Function1, scala.collection.generic.CanBuildFrom) @bci=17, line=244 
(Compiled frame)
 - scala.collection.mutable.ArrayOps$ofRef.map(scala.Function1, 
scala.collection.generic.CanBuildFrom) @bci=3, line=108 (Interpreted frame)
 - 
org.apache.spark.sql.sources.HadoopFsRelation$FileStatusCache.listLeafFiles(java.lang.String[])
 @bci=279, line=447 (Interpreted frame)
 - org.apache.spark.sql.sources.HadoopFsRelation$FileStatusCache.refresh() 
@bci=8, line=453 (Interpreted frame)
 - 
org.apache.spark.sql.sources.HadoopFsRelation.org$apache$spark$sql$sources$HadoopFsRelation$$fileStatusCache$lzycompute()
 @bci=26, line=465 (Interpreted frame)
 - 
org.apache.spark.sql.sources.HadoopFsRelation.org$apache$spark$sql$sources$HadoopFsRelation$$fileStatusCache()
 @bci=12, line=463 (Interpreted frame)
 - org.apache.spark.sql.sources.HadoopFsRelation.refresh() @bci=1, line=540 
(Interpreted frame)
 - org.apache.spark.sql.execution.datasources.parquet.ParquetRelation.refresh() 
@bci=1, line=204 (Interpreted frame)
 - 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp()
 @bci=392, line=152 (Interpreted frame)
 - 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply()
 @bci=1, line=108 (Interpreted frame)
 - 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply()
 @bci=1, line=108 (Interpreted frame)
 - 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(org.apache.spark.sql.SQLContext,
 org.apache.spark.sql.SQLContext$QueryExecution, scala.Function0) @bci=96, 
line=56 (Interpreted frame)
 - 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(org.apache.spark.sql.SQLContext)
 @bci=718, line=108 (Interpreted frame)
 - org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute() 
@bci=20, line=57 (Interpreted frame)
 - org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult() @bci=15, 
line=57 (Interpreted frame)
 - org.apache.spark.sql.execution.ExecutedCommand.doExecute() @bci=12, line=69 
(Interpreted frame)
 - org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply() @bci=11, 
line=140 (Interpreted frame)
 - org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply() @bci=1, 
line=138 (Interpreted frame)
 - 
org.apache.spark.rdd.RDDOperationScope$.withScope(org.apache.spark.SparkContext,
 java.lang.String, boolean, boolean, scala.Function0) @bci=131, line=147 
(Interpreted frame)
 - org.apache.spark.sql.execution.SparkPlan.execute() @bci=189, line=138 
(Interpreted frame)
 - org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute() @bci=21, 
line=933 (Interpreted frame)
 - org.apache.spark.sql.SQLContext$QueryExecution.toRdd() @bci=13, line=933 
(Interpreted frame)
 - 

[jira] [Assigned] (SPARK-11307) Reduce memory consumption of OutputCommitCoordinator bookkeeping structures

2015-10-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11307:


Assignee: Apache Spark  (was: Josh Rosen)

> Reduce memory consumption of OutputCommitCoordinator bookkeeping structures
> ---
>
> Key: SPARK-11307
> URL: https://issues.apache.org/jira/browse/SPARK-11307
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Reporter: Josh Rosen
>Assignee: Apache Spark
>
> OutputCommitCoordinator uses a map in a place where an array would suffice, 
> increasing its memory consumption for result stages with millions of tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11307) Reduce memory consumption of OutputCommitCoordinator bookkeeping structures

2015-10-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14973564#comment-14973564
 ] 

Apache Spark commented on SPARK-11307:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/9274

> Reduce memory consumption of OutputCommitCoordinator bookkeeping structures
> ---
>
> Key: SPARK-11307
> URL: https://issues.apache.org/jira/browse/SPARK-11307
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> OutputCommitCoordinator uses a map in a place where an array would suffice, 
> increasing its memory consumption for result stages with millions of tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11307) Reduce memory consumption of OutputCommitCoordinator bookkeeping structures

2015-10-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11307:


Assignee: Josh Rosen  (was: Apache Spark)

> Reduce memory consumption of OutputCommitCoordinator bookkeeping structures
> ---
>
> Key: SPARK-11307
> URL: https://issues.apache.org/jira/browse/SPARK-11307
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> OutputCommitCoordinator uses a map in a place where an array would suffice, 
> increasing its memory consumption for result stages with millions of tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10286) Add @since annotation to pyspark.ml.param and pyspark.ml.*

2015-10-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10286:


Assignee: (was: Apache Spark)

> Add @since annotation to pyspark.ml.param and pyspark.ml.*
> --
>
> Key: SPARK-10286
> URL: https://issues.apache.org/jira/browse/SPARK-10286
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, PySpark
>Reporter: Xiangrui Meng
>Priority: Minor
>  Labels: starter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10286) Add @since annotation to pyspark.ml.param and pyspark.ml.*

2015-10-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14973651#comment-14973651
 ] 

Apache Spark commented on SPARK-10286:
--

User 'lidinghao' has created a pull request for this issue:
https://github.com/apache/spark/pull/9275

> Add @since annotation to pyspark.ml.param and pyspark.ml.*
> --
>
> Key: SPARK-10286
> URL: https://issues.apache.org/jira/browse/SPARK-10286
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, PySpark
>Reporter: Xiangrui Meng
>Priority: Minor
>  Labels: starter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11308) Change spark streaming's job scheduler logic to ensuer guaranteed order of batch processing

2015-10-25 Thread Renjie Liu (JIRA)
Renjie Liu created SPARK-11308:
--

 Summary: Change spark streaming's job scheduler logic to ensuer 
guaranteed order of batch processing
 Key: SPARK-11308
 URL: https://issues.apache.org/jira/browse/SPARK-11308
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.5.1
Reporter: Renjie Liu
Priority: Minor


In current implementation, spark streaming uses a thread pool to run jobs 
generated in each time interval and orders are not guaranteed, i.e., if jobs 
generated in time 1 takes time longer than the batch duration, jobs 2 will 
begin to execute and the finish order is not guaranteed. This implementation is 
not quite useful in practice since it may cost much more storage. For example, 
when we do a word count in spark streaming, to be accurate we need to store 
records for each batch rather than just word count in database. But if the 
processing order of each batch is guaranteed, we just need to store the last 
update time with word count in database to be idempotent. Just simply set the 
thread pool size to 1 may cause the system to be inefficient when there are 
more than one output streams.  This feature can be implemented by giving each 
output stream a thread and jobs of each output stream are executed in order.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10286) Add @since annotation to pyspark.ml.param and pyspark.ml.*

2015-10-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10286:


Assignee: Apache Spark

> Add @since annotation to pyspark.ml.param and pyspark.ml.*
> --
>
> Key: SPARK-10286
> URL: https://issues.apache.org/jira/browse/SPARK-10286
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, PySpark
>Reporter: Xiangrui Meng
>Assignee: Apache Spark
>Priority: Minor
>  Labels: starter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11305) Remove Third-Party Hadoop Distributions Doc Page

2015-10-25 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14973493#comment-14973493
 ] 

Patrick Wendell commented on SPARK-11305:
-

/cc [~srowen] for his thoughts.

> Remove Third-Party Hadoop Distributions Doc Page
> 
>
> Key: SPARK-11305
> URL: https://issues.apache.org/jira/browse/SPARK-11305
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Patrick Wendell
>Priority: Critical
>
> There is a fairly old page in our docs that contains a bunch of assorted 
> information regarding running Spark on Hadoop clusters. I think this page 
> should be removed and merged into other parts of the docs because the 
> information is largely redundant and somewhat outdated.
> http://spark.apache.org/docs/latest/hadoop-third-party-distributions.html
> There are three sections:
> 1. Compile time Hadoop version - this information I think can be removed in 
> favor of that on the "building spark" page. These days most "advanced users" 
> are building without bundling Hadoop, so I'm not sure giving them a bunch of 
> different Hadoop versions sends the right message.
> 2. Linking against Hadoop - this doesn't seem to add much beyond what is in 
> the programming guide.
> 3. Where to run Spark - redundant with the hardware provisioning guide.
> 4. Inheriting cluster configurations - I think this would be better as a 
> section at the end of the configuration page. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11305) Remove Third-Party Hadoop Distributions Doc Page

2015-10-25 Thread Patrick Wendell (JIRA)
Patrick Wendell created SPARK-11305:
---

 Summary: Remove Third-Party Hadoop Distributions Doc Page
 Key: SPARK-11305
 URL: https://issues.apache.org/jira/browse/SPARK-11305
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Reporter: Patrick Wendell
Priority: Critical


There is a fairly old page in our docs that contains a bunch of assorted 
information regarding running Spark on Hadoop clusters. I think this page 
should be removed and merged into other parts of the docs because the 
information is largely redundant and somewhat outdated.

http://spark.apache.org/docs/latest/hadoop-third-party-distributions.html

There are three sections:

1. Compile time Hadoop version - this information I think can be removed in 
favor of that on the "building spark" page. These days most "advanced users" 
are building without bundling Hadoop, so I'm not sure giving them a bunch of 
different Hadoop versions sends the right message.

2. Linking against Hadoop - this doesn't seem to add much beyond what is in the 
programming guide.

3. Where to run Spark - redundant with the hardware provisioning guide.

4. Inheriting cluster configurations - I think this would be better as a 
section at the end of the configuration page. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11306) Executor JVM loss can lead to a hang in Standalone mode

2015-10-25 Thread Kay Ousterhout (JIRA)
Kay Ousterhout created SPARK-11306:
--

 Summary: Executor JVM loss can lead to a hang in Standalone mode
 Key: SPARK-11306
 URL: https://issues.apache.org/jira/browse/SPARK-11306
 Project: Spark
  Issue Type: Bug
Reporter: Kay Ousterhout
Assignee: Kay Ousterhout


This commit: 
https://github.com/apache/spark/commit/af3bc59d1f5d9d952c2d7ad1af599c49f1dbdaf0 
introduced a bug where, in Standalone mode, if a task fails and crashes the 
JVM, the failure is considered a "normal failure" (meaning it's considered 
unrelated to the task), so the failure isn't counted against the task's maximum 
number of failures: 
https://github.com/apache/spark/commit/af3bc59d1f5d9d952c2d7ad1af599c49f1dbdaf0#diff-a755f3d892ff2506a7aa7db52022d77cL138.
  As a result, if a task fails in a way that results in it crashing the JVM, it 
will continuously be re-launched, resulting in a hang.

Unfortunately this issue is difficult to reproduce because of a race condition 
where we have multiple code paths that are used to handle executor losses, and 
in the setup I'm using, Akka's notification that the executor was lost always 
gets to the TaskSchedulerImpl first, so the task eventually gets killed (see my 
recent email to the dev list).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5737) Scanning duplicate columns from parquet table

2015-10-25 Thread Kevin Jung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Jung resolved SPARK-5737.
---
   Resolution: Fixed
Fix Version/s: 1.5.1

> Scanning duplicate columns from parquet table
> -
>
> Key: SPARK-5737
> URL: https://issues.apache.org/jira/browse/SPARK-5737
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.1
>Reporter: Kevin Jung
> Fix For: 1.5.1
>
>
> {quote}
> import org.apache.spark.sql._
> val sqlContext = new SQLContext(sc)
> import sqlContext._
> val rdd = sqlContext.parquetFile("temp.parquet")
> rdd.select('d1,'d1,'d2,'d2).take(3).foreach(println)
> {quote}
> The results of above code have null values at the preceding columns of 
> duplicate two.
> For example,
> {quote}
> [null,-5.7,null,121.05]
> [null,-61.17,null,108.91]
> [null,50.60,null,72.15]
> {quote}
> This happens only in ParquetTableScan. PysicalRDD works fine and the rows 
> have duplicate values like...
> {quote}
> [-5.7,-5.7,121.05,121.05]
> [-61.17,-61.17,108.91,108.91]
> [50.60,50.60,72.15,72.15]
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11307) Reduce memory consumption of OutputCommitCoordinator bookkeeping structures

2015-10-25 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-11307:
--

 Summary: Reduce memory consumption of OutputCommitCoordinator 
bookkeeping structures
 Key: SPARK-11307
 URL: https://issues.apache.org/jira/browse/SPARK-11307
 Project: Spark
  Issue Type: Bug
  Components: Scheduler
Reporter: Josh Rosen
Assignee: Josh Rosen


OutputCommitCoordinator uses a map in a place where an array would suffice, 
increasing its memory consumption for result stages with millions of tasks.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11294) Improve R doc for read.df, write.df, saveAsTable

2015-10-25 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-11294:
--
Fix Version/s: (was: 1.5.2)
   1.5.3

> Improve R doc for read.df, write.df, saveAsTable
> 
>
> Key: SPARK-11294
> URL: https://issues.apache.org/jira/browse/SPARK-11294
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.1
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>Priority: Minor
> Fix For: 1.5.3, 1.6.0
>
>
> API doc lacks example and has several formatting issues



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11300) Support for string length when writing to JDBC

2015-10-25 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14973707#comment-14973707
 ] 

Josh Rosen commented on SPARK-11300:


I think that this duplicates SPARK-10101

> Support for string length when writing to JDBC
> --
>
> Key: SPARK-11300
> URL: https://issues.apache.org/jira/browse/SPARK-11300
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Maciej Bryński
>
> Right now every StringType fields are written to JDBC as TEXT.
> I'd like to have option to write it as VARCHAR(size).
> Maybe we could use StringType(size) ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10500) sparkr.zip cannot be created if $SPARK_HOME/R/lib is unwritable

2015-10-25 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14973444#comment-14973444
 ] 

Felix Cheung commented on SPARK-10500:
--

[~sunrui]suggestion 08/Sept makes sense. Would you like to work on this 
[~sunrui]? Otherwise I could take a shot.

> sparkr.zip cannot be created if $SPARK_HOME/R/lib is unwritable
> ---
>
> Key: SPARK-10500
> URL: https://issues.apache.org/jira/browse/SPARK-10500
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.0
>Reporter: Jonathan Kelly
>
> As of SPARK-6797, sparkr.zip is re-created each time spark-submit is run with 
> an R application, which fails if Spark has been installed into a directory to 
> which the current user doesn't have write permissions. (e.g., on EMR's 
> emr-4.0.0 release, Spark is installed at /usr/lib/spark, which is only 
> writable by root.)
> Would it be possible to skip creating sparkr.zip if it already exists? That 
> would enable sparkr.zip to be pre-created by the root user and then reused 
> each time spark-submit is run, which I believe is similar to how pyspark 
> works.
> Another option would be to make the location configurable, as it's currently 
> hardcoded to $SPARK_HOME/R/lib/sparkr.zip. Allowing it to be configured to 
> something like the user's home directory or a random path in /tmp would get 
> around the permissions issue.
> By the way, why does spark-submit even need to re-create sparkr.zip every 
> time a new R application is launched? This seems unnecessary and inefficient, 
> unless you are actively developing the SparkR libraries and expect the 
> contents of sparkr.zip to change.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11306) Executor JVM loss can lead to a hang in Standalone mode

2015-10-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14973506#comment-14973506
 ] 

Apache Spark commented on SPARK-11306:
--

User 'kayousterhout' has created a pull request for this issue:
https://github.com/apache/spark/pull/9273

> Executor JVM loss can lead to a hang in Standalone mode
> ---
>
> Key: SPARK-11306
> URL: https://issues.apache.org/jira/browse/SPARK-11306
> Project: Spark
>  Issue Type: Bug
>Reporter: Kay Ousterhout
>Assignee: Kay Ousterhout
>
> This commit: 
> https://github.com/apache/spark/commit/af3bc59d1f5d9d952c2d7ad1af599c49f1dbdaf0
>  introduced a bug where, in Standalone mode, if a task fails and crashes the 
> JVM, the failure is considered a "normal failure" (meaning it's considered 
> unrelated to the task), so the failure isn't counted against the task's 
> maximum number of failures: 
> https://github.com/apache/spark/commit/af3bc59d1f5d9d952c2d7ad1af599c49f1dbdaf0#diff-a755f3d892ff2506a7aa7db52022d77cL138.
>   As a result, if a task fails in a way that results in it crashing the JVM, 
> it will continuously be re-launched, resulting in a hang.
> Unfortunately this issue is difficult to reproduce because of a race 
> condition where we have multiple code paths that are used to handle executor 
> losses, and in the setup I'm using, Akka's notification that the executor was 
> lost always gets to the TaskSchedulerImpl first, so the task eventually gets 
> killed (see my recent email to the dev list).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10971) sparkR: RRunner should allow setting path to Rscript

2015-10-25 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14973510#comment-14973510
 ] 

Patrick Wendell commented on SPARK-10971:
-

Reynold has sent out the vote email based on the original fix. Since that vote 
is likely to pass, this patch will probably be in 1.5.3.

> sparkR: RRunner should allow setting path to Rscript
> 
>
> Key: SPARK-10971
> URL: https://issues.apache.org/jira/browse/SPARK-10971
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.1
>Reporter: Thomas Graves
>Assignee: Sun Rui
> Fix For: 1.5.3, 1.6.0
>
>
> I'm running spark on yarn and trying to use R in cluster mode. RRunner seems 
> to just call Rscript and assumes its in the path. But on our YARN deployment 
> R isn't installed on the nodes so it needs to be distributed along with the 
> job and we need the ability to point to where it gets installed. sparkR in 
> client mode has the config spark.sparkr.r.command to point to Rscript. 
> RRunner should have something similar so it works in cluster mode



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10971) sparkR: RRunner should allow setting path to Rscript

2015-10-25 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14973510#comment-14973510
 ] 

Patrick Wendell edited comment on SPARK-10971 at 10/26/15 12:02 AM:


Reynold has sent out the vote email based on the tagged commit. Since that vote 
is likely to pass, this patch will probably be in 1.5.3.


was (Author: pwendell):
Reynold has sent out the vote email based on the original fix. Since that vote 
is likely to pass, this patch will probably be in 1.5.3.

> sparkR: RRunner should allow setting path to Rscript
> 
>
> Key: SPARK-10971
> URL: https://issues.apache.org/jira/browse/SPARK-10971
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.1
>Reporter: Thomas Graves
>Assignee: Sun Rui
> Fix For: 1.5.3, 1.6.0
>
>
> I'm running spark on yarn and trying to use R in cluster mode. RRunner seems 
> to just call Rscript and assumes its in the path. But on our YARN deployment 
> R isn't installed on the nodes so it needs to be distributed along with the 
> job and we need the ability to point to where it gets installed. sparkR in 
> client mode has the config spark.sparkr.r.command to point to Rscript. 
> RRunner should have something similar so it works in cluster mode



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10971) sparkR: RRunner should allow setting path to Rscript

2015-10-25 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-10971:

Fix Version/s: (was: 1.5.2)
   1.5.3

> sparkR: RRunner should allow setting path to Rscript
> 
>
> Key: SPARK-10971
> URL: https://issues.apache.org/jira/browse/SPARK-10971
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.1
>Reporter: Thomas Graves
>Assignee: Sun Rui
> Fix For: 1.5.3, 1.6.0
>
>
> I'm running spark on yarn and trying to use R in cluster mode. RRunner seems 
> to just call Rscript and assumes its in the path. But on our YARN deployment 
> R isn't installed on the nodes so it needs to be distributed along with the 
> job and we need the ability to point to where it gets installed. sparkR in 
> client mode has the config spark.sparkr.r.command to point to Rscript. 
> RRunner should have something similar so it works in cluster mode



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11308) Change spark streaming's job scheduler logic to ensuer guaranteed order of batch processing

2015-10-25 Thread Renjie Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Renjie Liu updated SPARK-11308:
---
Description: In current implementation, spark streaming uses a thread pool 
to run jobs generated in each time interval and orders are not guaranteed, 
i.e., if jobs generated in time 1 takes time longer than the batch duration, 
jobs 2 will begin to execute and the finish order is not guaranteed. This 
implementation is not quite useful in practice since it may cost much more 
storage. For example, when we do a word count in spark streaming, to be 
accurate we need to store records for each batch rather than just word count in 
database to be idempotent. But if the processing order of each batch is 
guaranteed, we just need to store the last update time with word count in 
database to be idempotent. Just simply set the thread pool size to 1 may cause 
the system to be inefficient when there are more than one output streams.  This 
feature can be implemented by giving each output stream a thread and jobs of 
each output stream are executed in order.  (was: In current implementation, 
spark streaming uses a thread pool to run jobs generated in each time interval 
and orders are not guaranteed, i.e., if jobs generated in time 1 takes time 
longer than the batch duration, jobs 2 will begin to execute and the finish 
order is not guaranteed. This implementation is not quite useful in practice 
since it may cost much more storage. For example, when we do a word count in 
spark streaming, to be accurate we need to store records for each batch rather 
than just word count in database. But if the processing order of each batch is 
guaranteed, we just need to store the last update time with word count in 
database to be idempotent. Just simply set the thread pool size to 1 may cause 
the system to be inefficient when there are more than one output streams.  This 
feature can be implemented by giving each output stream a thread and jobs of 
each output stream are executed in order.)

> Change spark streaming's job scheduler logic to ensuer guaranteed order of 
> batch processing
> ---
>
> Key: SPARK-11308
> URL: https://issues.apache.org/jira/browse/SPARK-11308
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.5.1
>Reporter: Renjie Liu
>Priority: Minor
>
> In current implementation, spark streaming uses a thread pool to run jobs 
> generated in each time interval and orders are not guaranteed, i.e., if jobs 
> generated in time 1 takes time longer than the batch duration, jobs 2 will 
> begin to execute and the finish order is not guaranteed. This implementation 
> is not quite useful in practice since it may cost much more storage. For 
> example, when we do a word count in spark streaming, to be accurate we need 
> to store records for each batch rather than just word count in database to be 
> idempotent. But if the processing order of each batch is guaranteed, we just 
> need to store the last update time with word count in database to be 
> idempotent. Just simply set the thread pool size to 1 may cause the system to 
> be inefficient when there are more than one output streams.  This feature can 
> be implemented by giving each output stream a thread and jobs of each output 
> stream are executed in order.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10984) Simplify *MemoryManager class structure

2015-10-25 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-10984.

   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9127
[https://github.com/apache/spark/pull/9127]

> Simplify *MemoryManager class structure
> ---
>
> Key: SPARK-10984
> URL: https://issues.apache.org/jira/browse/SPARK-10984
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Andrew Or
>Assignee: Josh Rosen
> Fix For: 1.6.0
>
>
> This is a refactoring task.
> After SPARK-10956 gets merged, we will have the following:
> - MemoryManager
> - StaticMemoryManager
> - ExecutorMemoryManager
> - TaskMemoryManager
> - ShuffleMemoryManager
> This is pretty confusing. The goal is to merge ShuffleMemoryManager and 
> ExecutorMemoryManager and move them into the top-level MemoryManager abstract 
> class. Then TaskMemoryManager should be renamed something else and used by 
> MemoryManager, such that the new hierarchy becomes:
> - MemoryManager
> - StaticMemoryManager



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11309) Clean up hacky use of MemoryManager inside of HashedRelation

2015-10-25 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-11309:
--

 Summary: Clean up hacky use of MemoryManager inside of 
HashedRelation
 Key: SPARK-11309
 URL: https://issues.apache.org/jira/browse/SPARK-11309
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Josh Rosen


In HashedRelation, there's a hacky creation of a new MemoryManager in order to 
handle broadcasting of BytesToBytesMap: 
https://github.com/apache/spark/blob/85e654c5ec87e666a8845bfd77185c1ea57b268a/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L323

Something similar to this has existed for a while, but the code recently became 
much messier as an indirect consequence of my memory manager consolidation 
patch. We should see about cleaning this up and removing the hack.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8890) Reduce memory consumption for dynamic partition insert

2015-10-25 Thread Jerry Lam (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14973529#comment-14973529
 ] 

Jerry Lam commented on SPARK-8890:
--

Hi guys, sorry by injecting comments into the closed jira. I just want to point 
out that I'm using spark 1.5.1, I got OOM in the driver side after all 
partitions are written out (I have over 1 million partitions). The job was 
marked SUCCESS in the output folder but the driver took significant CPU and 
memory. After several hours, the driver dies with OOM. I already configure the 
driver to use 6GB. The jstack of the process is as follows:
{code}
Thread 528: (state = BLOCKED)
 - java.util.Arrays.copyOf(char[], int) @bci=1, line=2367 (Compiled frame)
 - java.lang.AbstractStringBuilder.expandCapacity(int) @bci=43, line=130 
(Compiled frame)
 - java.lang.AbstractStringBuilder.ensureCapacityInternal(int) @bci=12, 
line=114 (Compiled frame)
 - java.lang.AbstractStringBuilder.append(java.lang.String) @bci=19, line=415 
(Compiled frame)
 - java.lang.StringBuilder.append(java.lang.String) @bci=2, line=132 (Compiled 
frame)
 - org.apache.hadoop.fs.Path.toString() @bci=128, line=384 (Compiled frame)
 - 
org.apache.spark.sql.sources.HadoopFsRelation$FileStatusCache$$anonfun$listLeafFiles$1.apply(org.apache.hadoop.fs.FileStatus)
 @bci=4, line=447 (Compiled frame)
 - 
org.apache.spark.sql.sources.HadoopFsRelation$FileStatusCache$$anonfun$listLeafFiles$1.apply(java.lang.Object)
 @bci=5, line=447 (Compiled frame)
 - scala.collection.TraversableLike$$anonfun$map$1.apply(java.lang.Object) 
@bci=9, line=244 (Compiled frame)
{code}
 - scala.collection.TraversableLike$$anonfun$map$1.apply(java.lang.Object) 
@bci=2, line=244 (Compiled frame)
 - 
scala.collection.IndexedSeqOptimized$class.foreach(scala.collection.IndexedSeqOptimized,
 scala.Function1) @bci=22, line=33 (Compiled frame)
 - scala.collection.mutable.ArrayOps$ofRef.foreach(scala.Function1) @bci=2, 
line=108 (Compiled frame)
 - scala.collection.TraversableLike$class.map(scala.collection.TraversableLike, 
scala.Function1, scala.collection.generic.CanBuildFrom) @bci=17, line=244 
(Compiled frame)
 - scala.collection.mutable.ArrayOps$ofRef.map(scala.Function1, 
scala.collection.generic.CanBuildFrom) @bci=3, line=108 (Interpreted frame)
 - 
org.apache.spark.sql.sources.HadoopFsRelation$FileStatusCache.listLeafFiles(java.lang.String[])
 @bci=279, line=447 (Interpreted frame)
 - org.apache.spark.sql.sources.HadoopFsRelation$FileStatusCache.refresh() 
@bci=8, line=453 (Interpreted frame)
 - 
org.apache.spark.sql.sources.HadoopFsRelation.org$apache$spark$sql$sources$HadoopFsRelation$$fileStatusCache$lzycompute()
 @bci=26, line=465 (Interpreted frame)
 - 
org.apache.spark.sql.sources.HadoopFsRelation.org$apache$spark$sql$sources$HadoopFsRelation$$fileStatusCache()
 @bci=12, line=463 (Interpreted frame)
 - org.apache.spark.sql.sources.HadoopFsRelation.refresh() @bci=1, line=540 
(Interpreted frame)
 - org.apache.spark.sql.execution.datasources.parquet.ParquetRelation.refresh() 
@bci=1, line=204 (Interpreted frame)
 - 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp()
 @bci=392, line=152 (Interpreted frame)
 - 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply()
 @bci=1, line=108 (Interpreted frame)
 - 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply()
 @bci=1, line=108 (Interpreted frame)
 - 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(org.apache.spark.sql.SQLContext,
 org.apache.spark.sql.SQLContext$QueryExecution, scala.Function0) @bci=96, 
line=56 (Interpreted frame)
 - 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(org.apache.spark.sql.SQLContext)
 @bci=718, line=108 (Interpreted frame)
 - org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute() 
@bci=20, line=57 (Interpreted frame)
 - org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult() @bci=15, 
line=57 (Interpreted frame)
 - org.apache.spark.sql.execution.ExecutedCommand.doExecute() @bci=12, line=69 
(Interpreted frame)
 - org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply() @bci=11, 
line=140 (Interpreted frame)
 - org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply() @bci=1, 
line=138 (Interpreted frame)
 - 
org.apache.spark.rdd.RDDOperationScope$.withScope(org.apache.spark.SparkContext,
 java.lang.String, boolean, boolean, scala.Function0) @bci=131, line=147 
(Interpreted frame)
 - org.apache.spark.sql.execution.SparkPlan.execute() @bci=189, line=138 
(Interpreted frame)
 - org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute() @bci=21, 
line=933 (Interpreted frame)
 - org.apache.spark.sql.SQLContext$QueryExecution.toRdd() @bci=13, line=933 
(Interpreted frame)
 - 

[jira] [Commented] (SPARK-8597) DataFrame partitionBy memory pressure scales extremely poorly

2015-10-25 Thread Jerry Lam (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14973531#comment-14973531
 ] 

Jerry Lam commented on SPARK-8597:
--

FYI ... The solution described here solves the problem of memory issue in the 
executors but not at the driver. I encountered OOM at the driver when I have 
only 1 million partitions generated. 

> DataFrame partitionBy memory pressure scales extremely poorly
> -
>
> Key: SPARK-8597
> URL: https://issues.apache.org/jira/browse/SPARK-8597
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Matt Cheah
>Priority: Blocker
> Attachments: table.csv
>
>
> I'm running into a strange memory scaling issue when using the partitionBy 
> feature of DataFrameWriter. 
> I've generated a table (a CSV file) with 3 columns (A, B and C) and 32*32 
> different entries, with size on disk of about 20kb. There are 32 distinct 
> values for column A and 32 distinct values for column B and all these are 
> combined together (column C will contain a random number for each row - it 
> doesn't matter) producing a 32*32 elements data set. I've imported this into 
> Spark and I ran a partitionBy("A", "B") in order to test its performance. 
> This should create a nested directory structure with 32 folders, each of them 
> containing another 32 folders. It uses about 10Gb of RAM and it's running 
> slow. If I increase the number of entries in the table from 32*32 to 128*128, 
> I get Java Heap Space Out Of Memory no matter what value I use for Heap Space 
> variable.
> Scala code:
> {code}
> var df = sqlContext.read.format("com.databricks.spark.csv").option("header", 
> "true").load("table.csv") 
> df.write.partitionBy("A", "B").mode("overwrite").parquet("table.parquet”)
> {code}
> How I ran the Spark shell:
> {code}
> bin/spark-shell --driver-memory 16g --master local[8] --packages 
> com.databricks:spark-csv_2.10:1.0.3
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11234) What's cooking classification

2015-10-25 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14973674#comment-14973674
 ] 

Xusen Yin commented on SPARK-11234:
---

The last comment is based on my trial on Avito dataset 
(https://issues.apache.org/jira/browse/SPARK-10935). It is only a beginning of 
the trial, because Kristina Plazonic is already work on it. Look at here 
https://github.com/yinxusen/incubator-project/blob/master/avito/src/main/scala/org/apache/spark/examples/main.scala#L48.
 I want to load some of the columns as Int type, because they are Int type 
variables in the dataset. But when I was using Assembler to assemble these 
columns here 
https://github.com/yinxusen/incubator-project/blob/master/avito/src/main/scala/org/apache/spark/examples/main.scala#L62,
 the Spark code throws an exception for Int type should be Double. That's why I 
think we should release the limitation, making Int/Float auto-cast to Double.

> What's cooking classification
> -
>
> Key: SPARK-11234
> URL: https://issues.apache.org/jira/browse/SPARK-11234
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Xusen Yin
>
> I add the subtask to post the work on this dataset:  
> https://www.kaggle.com/c/whats-cooking



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11253) reset all accumulators in physical operators before execute an action

2015-10-25 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-11253:
-
Assignee: Wenchen Fan

> reset all accumulators in physical operators before execute an action
> -
>
> Key: SPARK-11253
> URL: https://issues.apache.org/jira/browse/SPARK-11253
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11253) reset all accumulators in physical operators before execute an action

2015-10-25 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-11253.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9215
[https://github.com/apache/spark/pull/9215]

> reset all accumulators in physical operators before execute an action
> -
>
> Key: SPARK-11253
> URL: https://issues.apache.org/jira/browse/SPARK-11253
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9861) Join: Determine the number of reducers used by a shuffle join operator at runtime

2015-10-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14973784#comment-14973784
 ] 

Apache Spark commented on SPARK-9861:
-

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/9276

> Join: Determine the number of reducers used by a shuffle join operator at 
> runtime
> -
>
> Key: SPARK-9861
> URL: https://issues.apache.org/jira/browse/SPARK-9861
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9858) Introduce an ExchangeCoordinator to estimate the number of post-shuffle partitions.

2015-10-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14973782#comment-14973782
 ] 

Apache Spark commented on SPARK-9858:
-

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/9276

> Introduce an ExchangeCoordinator to estimate the number of post-shuffle 
> partitions.
> ---
>
> Key: SPARK-9858
> URL: https://issues.apache.org/jira/browse/SPARK-9858
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9859) Aggregation: Determine the number of reducers at runtime

2015-10-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9859:
---

Assignee: Apache Spark  (was: Yin Huai)

> Aggregation: Determine the number of reducers at runtime
> 
>
> Key: SPARK-9859
> URL: https://issues.apache.org/jira/browse/SPARK-9859
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9861) Join: Determine the number of reducers used by a shuffle join operator at runtime

2015-10-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9861:
---

Assignee: Yin Huai  (was: Apache Spark)

> Join: Determine the number of reducers used by a shuffle join operator at 
> runtime
> -
>
> Key: SPARK-9861
> URL: https://issues.apache.org/jira/browse/SPARK-9861
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9858) Introduce an ExchangeCoordinator to estimate the number of post-shuffle partitions.

2015-10-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9858:
---

Assignee: Yin Huai  (was: Apache Spark)

> Introduce an ExchangeCoordinator to estimate the number of post-shuffle 
> partitions.
> ---
>
> Key: SPARK-9858
> URL: https://issues.apache.org/jira/browse/SPARK-9858
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11304) SparkR in yarn-client mode fails creating sparkr.zip

2015-10-25 Thread Ram Venkatesh (JIRA)
Ram Venkatesh created SPARK-11304:
-

 Summary: SparkR in yarn-client mode fails creating sparkr.zip
 Key: SPARK-11304
 URL: https://issues.apache.org/jira/browse/SPARK-11304
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 1.5.1
Reporter: Ram Venkatesh


If you run sparkR in yarn-client mode and the spark installation directory is 
not writable by the current user, it fails with

Exception in thread "main" java.io.FileNotFoundException:
/usr/hdp/2.3.2.1-12/spark/R/lib/sparkr.zip (Permission denied)
at java.io.FileOutputStream.open0(Native Method)
at java.io.FileOutputStream.open(FileOutputStream.java:270)
at java.io.FileOutputStream.(FileOutputStream.java:213)
at
org.apache.spark.deploy.RPackageUtils$.zipRLibraries(RPackageUtils.scala:215)
at
org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:371)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:153)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

The behavior is the same with the pre-built spark-1.5.1-bin-hadoop2.6
bits also.

We need to either use an existing sparkr.zip if we find one in the R/lib 
directory, or create the file in a location accessible to the submitting user.

Temporary hack workaround - create a world-writable file called sparkr.zip 
under R/lib. It will still fail if multiple users submit jobs at the same time.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11308) Change spark streaming's job scheduler logic to ensuer guaranteed order of batch processing

2015-10-25 Thread Renjie Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Renjie Liu updated SPARK-11308:
---
Priority: Major  (was: Minor)

> Change spark streaming's job scheduler logic to ensuer guaranteed order of 
> batch processing
> ---
>
> Key: SPARK-11308
> URL: https://issues.apache.org/jira/browse/SPARK-11308
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.5.1
>Reporter: Renjie Liu
>
> In current implementation, spark streaming uses a thread pool to run jobs 
> generated in each time interval and orders are not guaranteed, i.e., if jobs 
> generated in time 1 takes time longer than the batch duration, jobs 2 will 
> begin to execute and the finish order is not guaranteed. This implementation 
> is not quite useful in practice since it may cost much more storage. For 
> example, when we do a word count in spark streaming, to be accurate we need 
> to store records for each batch rather than just word count in database to be 
> idempotent. But if the processing order of each batch is guaranteed, we just 
> need to store the last update time with word count in database to be 
> idempotent. Just simply set the thread pool size to 1 may cause the system to 
> be inefficient when there are more than one output streams.  This feature can 
> be implemented by giving each output stream a thread and jobs of each output 
> stream are executed in order.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11206) Support SQL UI on the history server

2015-10-25 Thread Carson Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14973700#comment-14973700
 ] 

Carson Wang commented on SPARK-11206:
-

For the live SQL UI, the SQLContext is responsible for attaching the SQLTab and 
adding the SQLListener.
The history server, the standalone Master that rebuilds web UI, and the event 
log listener which writes events to the storage are all in the core module. 
Since there is no SQLContext for history UI, these components in the core need 
reference the SQL class like SQLTab, SQLListener and the SQL events.

> Support SQL UI on the history server
> 
>
> Key: SPARK-11206
> URL: https://issues.apache.org/jira/browse/SPARK-11206
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL, Web UI
>Reporter: Carson Wang
>
> On the live web UI, there is a SQL tab which provides valuable information 
> for the SQL query. But once the workload is finished, we won't see the SQL 
> tab on the history server. It will be helpful if we support SQL UI on the 
> history server so we can analyze it even after its execution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11310) only build spark core,Modify spark pom file:delete graphx

2015-10-25 Thread yindu_asan (JIRA)
yindu_asan created SPARK-11310:
--

 Summary: only build  spark core,Modify spark pom file:delete 
graphx
 Key: SPARK-11310
 URL: https://issues.apache.org/jira/browse/SPARK-11310
 Project: Spark
  Issue Type: Question
Reporter: yindu_asan


only want to build  spark core,Modify spark pom file:delete 
graphx bagel  ... 
but the result of build jar  have  graphx `s scala file
 why?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7146) Should ML sharedParams be a public API?

2015-10-25 Thread Kai Sasaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14973745#comment-14973745
 ] 

Kai Sasaki commented on SPARK-7146:
---

There are several times when I want to use internal resources(e.g. shared 
params, optimization) of Spark as our own library or framework. 
The task to write these code again often cause trouble and long time 
development. In addition to this, as you said, there might be the several 
implementations which has the same name but different functionality. 

{quote}
Cons:
Users have to be careful since parameters can have different meanings for 
different algorithms.
{quote}

I think this is also true when even {{sharedParams}} is private because 
application developers will implements their own params which retain almost 
same name with {{sharedParams}}. It becomes confusing.

So basically it might be better to enable developers to use {{sharedParams}} 
inside their own frameworks. It does not mean that making it public directly. 
As [~josephkb] proposed on (b), it is good way to make it open for developers 
but create some restrictions.

> Should ML sharedParams be a public API?
> ---
>
> Key: SPARK-7146
> URL: https://issues.apache.org/jira/browse/SPARK-7146
> Project: Spark
>  Issue Type: Brainstorming
>  Components: ML
>Reporter: Joseph K. Bradley
>
> Discussion: Should the Param traits in sharedParams.scala be public?
> Pros:
> * Sharing the Param traits helps to encourage standardized Param names and 
> documentation.
> Cons:
> * Users have to be careful since parameters can have different meanings for 
> different algorithms.
> * If the shared Params are public, then implementations could test for the 
> traits.  It is unclear if we want users to rely on these traits, which are 
> somewhat experimental.
> Currently, the shared params are private.
> Proposal: Either
> (a) make the shared params private to encourage users to write specialized 
> documentation and value checks for parameters, or
> (b) design a better way to encourage overriding documentation and parameter 
> value checks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9858) Introduce a AdaptiveExchange operator and add it in the query planner.

2015-10-25 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14973772#comment-14973772
 ] 

Yin Huai commented on SPARK-9858:
-

Instead of having an {{AdaptiveExchange}}, we will have an 
{{ExchangeCoordinator}} and just add it in our existing {{Exchange}} operator.

> Introduce a AdaptiveExchange operator and add it in the query planner.
> --
>
> Key: SPARK-9858
> URL: https://issues.apache.org/jira/browse/SPARK-9858
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9858) Introduce an ExchangeCoordinator to estimate the number of post-shuffle partitions.

2015-10-25 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-9858:

Summary: Introduce an ExchangeCoordinator to estimate the number of 
post-shuffle partitions.  (was: Introduce a AdaptiveExchange operator and add 
it in the query planner.)

> Introduce an ExchangeCoordinator to estimate the number of post-shuffle 
> partitions.
> ---
>
> Key: SPARK-9858
> URL: https://issues.apache.org/jira/browse/SPARK-9858
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11302) Multivariate Gaussian Model with Covariance matrix return zero always

2015-10-25 Thread eyal sharon (JIRA)
eyal sharon created SPARK-11302:
---

 Summary:  Multivariate Gaussian Model with Covariance  matrix 
return zero always 
 Key: SPARK-11302
 URL: https://issues.apache.org/jira/browse/SPARK-11302
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: eyal sharon
Priority: Minor



I have been trying to apply an Anomaly Detection model  using Spark MLib. 

As an input, I feed the model with a mean vector and a Covariance matrix. 
,assuming my features contain Co-variance.

Here are my input for the  model ,and the model returns zero for each data 
point for this input.

MU vector - 
1054.8, 1069.8, 1.3 ,1040.1
Cov' matrix - 
165496.0 , 167996.0,  11.0 , 163037.0  
167996.0,  170631.0,  19.0,  165405.0  
11.0,   19.0 , 0.0,   2.0   
163037.0,   165405.0 2.0 ,  160707.0 



Conversely,  for the  non covariance case, represented by  this matrix ,the 
model is working and returns results as expected 
165496.0,  0.0 ,   0.0,   0.0 
0.0,   170631.0,   0.0,   0.0 
0.0 ,   0.0 ,   0.8,   0.0 
0.0 ,   0.0,0.0,  160594.2






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org




[jira] [Commented] (SPARK-10181) HiveContext is not used with keytab principal but with user principal/unix username

2015-10-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14973423#comment-14973423
 ] 

Apache Spark commented on SPARK-10181:
--

User 'yolandagao' has created a pull request for this issue:
https://github.com/apache/spark/pull/9272

> HiveContext is not used with keytab principal but with user principal/unix 
> username
> ---
>
> Key: SPARK-10181
> URL: https://issues.apache.org/jira/browse/SPARK-10181
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
> Environment: kerberos
>Reporter: Bolke de Bruin
>  Labels: hive, hivecontext, kerberos
>
> `bin/spark-submit --num-executors 1 --executor-cores 5 --executor-memory 5G  
> --driver-java-options -XX:MaxPermSize=4G --driver-class-path 
> lib/datanucleus-api-jdo-3.2.6.jar:lib/datanucleus-core-3.2.10.jar:lib/datanucleus-rdbms-3.2.9.jar:conf/hive-site.xml
>  --files conf/hive-site.xml --master yarn --principal sparkjob --keytab 
> /etc/security/keytabs/sparkjob.keytab --conf 
> spark.yarn.executor.memoryOverhead=18000 --conf 
> "spark.executor.extraJavaOptions=-XX:MaxPermSize=4G" --conf 
> spark.eventLog.enabled=false ~/test.py`
> With:
> #!/usr/bin/python
> from pyspark import SparkContext
> from pyspark.sql import HiveContext
> sc = SparkContext()
> sqlContext = HiveContext(sc)
> query = """ SELECT * FROM fm.sk_cluster """
> rdd = sqlContext.sql(query)
> rdd.registerTempTable("test")
> sqlContext.sql("CREATE TABLE wcs.test LOCATION '/tmp/test_gl' AS SELECT * 
> FROM test")
> Ends up with:
> Caused by: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException):
>  Permission denie
> d: user=ua80tl, access=READ_EXECUTE, 
> inode="/tmp/test_gl/.hive-staging_hive_2015-08-24_10-43-09_157_78057390024057878
> 34-1/-ext-1":sparkjob:hdfs:drwxr-x---
> (Our umask denies read access to other by default)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11154) make specificaition spark.yarn.executor.memoryOverhead consistent with typical JVM options

2015-10-25 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14973322#comment-14973322
 ] 

Sean Owen commented on SPARK-11154:
---

I think that if this is done at all, it would have to be with a new property. 
The old one would then be deprecated but continue to function. This would have 
to be done for all such properties.

> make specificaition spark.yarn.executor.memoryOverhead consistent with 
> typical JVM options
> --
>
> Key: SPARK-11154
> URL: https://issues.apache.org/jira/browse/SPARK-11154
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Spark Submit
>Reporter: Dustin Cote
>Priority: Minor
>
> spark.yarn.executor.memoryOverhead is currently specified in megabytes by 
> default, but it would be nice to allow users to specify the size as though it 
> were a typical -Xmx option to a JVM where you can have 'm' and 'g' appended 
> to the end to explicitly specify megabytes or gigabytes.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11302) Multivariate Gaussian Model with Covariance matrix return zero always

2015-10-25 Thread eyal sharon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14973325#comment-14973325
 ] 

eyal sharon commented on SPARK-11302:
-

Hi Sean,

Thanks for your reply. I will try to add more info

 - I'm using a Multivariate Gaussian for anomaly detection. I'm using this
source from Mlib -   MultivariateGaussian


This library enables to create a Gaussian instance and to feed it with new
data point (which is a dense vector )  to return the probability.
Now ,when I run my code over, it always returns zero

- I checked my code using this example implantation I have found on GIT example
for anomaly detection  
Note the this example uses a *non covariance* matrix, If you run this code
with a full  covariance matrix, the PDF function will always return zero.

  To check to covariance case , here is a function which takes a  data set
(mat) with features and a corresponding mean  vector (mu) :

def createCovSigma(mat: DenseMatrix,mu: Vector) : DenseMatrix = {

  val rowsInArray = mat.transpose.toArray.grouped(mat.numCols).toArray
  val sigmaSubMU = rowsInArray.map(row => {(row.toList zip
mu.toArray).map(elem=>elem._1-elem._2)}.toArray )

  val checkArray = sigmaSubMU.flatMap(row=>row)

  val mat2 = new DenseMatrix(mat.numRows, mat.numCols,checkArray,true)
  val sigmaTmp: DenseMatrix = mat2.transpose.multiply(mat2)
  val sigmaTmpArray=sigmaTmp.toArray
  val sigmaMatrix: DenseMatrix =  new DenseMatrix(mat.numCols,
mat.numCols, sigmaTmpArray.flatMap(x=>List(x/mat.numRows)),true)

  sigmaMatrix
}


 If you need me to add more info I will

Thanks!

Eyal





-



























-- 


*This email and any files transmitted with it are confidential and intended 
solely for the use of the individual or entity to whom they are 
addressed. Please note that any disclosure, copying or distribution of the 
content of this information is strictly forbidden. If you have received 
this email message in error, please destroy it immediately and notify its 
sender.*


>  Multivariate Gaussian Model with Covariance  matrix return zero always 
> 
>
> Key: SPARK-11302
> URL: https://issues.apache.org/jira/browse/SPARK-11302
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Reporter: eyal sharon
>Priority: Minor
>
> I have been trying to apply an Anomaly Detection model  using Spark MLib. 
> As an input, I feed the model with a mean vector and a Covariance matrix. 
> ,assuming my features contain Co-variance.
> Here are my input for the  model ,and the model returns zero for each data 
> point for this input.
> MU vector - 
> 1054.8, 1069.8, 1.3 ,1040.1
> Cov' matrix - 
> 165496.0 , 167996.0,  11.0 , 163037.0  
> 167996.0,  170631.0,  19.0,  165405.0  
> 11.0,   19.0 , 0.0,   2.0   
> 163037.0,   165405.0 2.0 ,  160707.0 
> Conversely,  for the  non covariance case, represented by  this matrix ,the 
> model is working and returns results as expected 
> 165496.0,  0.0 ,   0.0,   0.0 
> 0.0,   170631.0,   0.0,   0.0 
> 0.0 ,   0.0 ,   0.8,   0.0 
> 0.0 ,   0.0,0.0,  160594.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6333) saveAsObjectFile support for compression codec

2015-10-25 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14973252#comment-14973252
 ] 

Sean Owen commented on SPARK-6333:
--

See the pull request. There are some decent reasons that this shouldn't be 
added.

> saveAsObjectFile support for compression codec
> --
>
> Key: SPARK-6333
> URL: https://issues.apache.org/jira/browse/SPARK-6333
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.2.1
>Reporter: Deenar Toraskar
>Priority: Minor
>
> saveAsObjectFile current does not support a compression codec.  This story is 
> about adding saveAsObjectFile (path, codec) support into spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11301) filter on partitioned column is case sensitive even the context is case insensitive

2015-10-25 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-11301:
---

 Summary: filter on partitioned column is case sensitive even the 
context is case insensitive
 Key: SPARK-11301
 URL: https://issues.apache.org/jira/browse/SPARK-11301
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11301) filter on partitioned column is case sensitive even the context is case insensitive

2015-10-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11301:


Assignee: (was: Apache Spark)

> filter on partitioned column is case sensitive even the context is case 
> insensitive
> ---
>
> Key: SPARK-11301
> URL: https://issues.apache.org/jira/browse/SPARK-11301
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11301) filter on partitioned column is case sensitive even the context is case insensitive

2015-10-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14973297#comment-14973297
 ] 

Apache Spark commented on SPARK-11301:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/9271

> filter on partitioned column is case sensitive even the context is case 
> insensitive
> ---
>
> Key: SPARK-11301
> URL: https://issues.apache.org/jira/browse/SPARK-11301
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11301) filter on partitioned column is case sensitive even the context is case insensitive

2015-10-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11301:


Assignee: Apache Spark

> filter on partitioned column is case sensitive even the context is case 
> insensitive
> ---
>
> Key: SPARK-11301
> URL: https://issues.apache.org/jira/browse/SPARK-11301
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11302) Multivariate Gaussian Model with Covariance matrix return zero always

2015-10-25 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14973321#comment-14973321
 ] 

Sean Owen commented on SPARK-11302:
---

It's not clear what you're trying to report. What code are you executing? what 
model?

>  Multivariate Gaussian Model with Covariance  matrix return zero always 
> 
>
> Key: SPARK-11302
> URL: https://issues.apache.org/jira/browse/SPARK-11302
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Reporter: eyal sharon
>Priority: Minor
>
> I have been trying to apply an Anomaly Detection model  using Spark MLib. 
> As an input, I feed the model with a mean vector and a Covariance matrix. 
> ,assuming my features contain Co-variance.
> Here are my input for the  model ,and the model returns zero for each data 
> point for this input.
> MU vector - 
> 1054.8, 1069.8, 1.3 ,1040.1
> Cov' matrix - 
> 165496.0 , 167996.0,  11.0 , 163037.0  
> 167996.0,  170631.0,  19.0,  165405.0  
> 11.0,   19.0 , 0.0,   2.0   
> 163037.0,   165405.0 2.0 ,  160707.0 
> Conversely,  for the  non covariance case, represented by  this matrix ,the 
> model is working and returns results as expected 
> 165496.0,  0.0 ,   0.0,   0.0 
> 0.0,   170631.0,   0.0,   0.0 
> 0.0 ,   0.0 ,   0.8,   0.0 
> 0.0 ,   0.0,0.0,  160594.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10994) Clustering coefficient computation in GraphX

2015-10-25 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-10994.
---
Resolution: Won't Fix

> Clustering coefficient computation in GraphX
> 
>
> Key: SPARK-10994
> URL: https://issues.apache.org/jira/browse/SPARK-10994
> Project: Spark
>  Issue Type: New Feature
>  Components: GraphX
>Reporter: Yang Yang
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> The Clustering Coefficient (CC) is a fundamental measure in social (or other 
> type of) network analysis assessing the degree to which nodes tend to cluster 
> together [1][2]. Clustering coefficient, along with density, node degree, 
> path length, diameter, connectedness, and node centrality are seven most 
> important properties to characterise a network [3].
> We found that GraphX has already implemented connectedness, node centrality, 
> path length, but does not have a componenet for computing clustering 
> coefficient. This actually was the first intention for us to implement an 
> algorithm to compute clustering coefficient for each vertex of a given graph.
> Clustering coefficient is very helpful to many real applications, such as 
> user behaviour prediction and structure prediction (like link prediction). We 
> did that before in a bunch of papers (e.g., [4-5]), and also found many other 
> publication papers using this metric in their work [6-8]. We are very 
> confident that this feature will benefit GraphX and attract a large number of 
> users.
> References
> [1] https://en.wikipedia.org/wiki/Clustering_coefficient
> [2] Watts, Duncan J., and Steven H. Strogatz. "Collective dynamics of 
> ‘small-world’ networks." nature 393.6684 (1998): 440-442. (with 27266 
> citations).
> [3] https://en.wikipedia.org/wiki/Network_science
> [4] Jing Zhang, Zhanpeng Fang, Wei Chen, and Jie Tang. Diffusion of 
> "Following" Links in Microblogging Networks. IEEE Transaction on Knowledge 
> and Data Engineering (TKDE), Volume 27, Issue 8, 2015, Pages 2093-2106.
> [5] Yang Yang, Jie Tang, Jacklyne Keomany, Yanting Zhao, Ying Ding, Juanzi 
> Li, and Liangwei Wang. Mining Competitive Relationships by Learning across 
> Heterogeneous Networks. In Proceedings of the Twenty-First Conference on 
> Information and Knowledge Management (CIKM'12). pp. 1432-1441.
> [6] Clauset, Aaron, Cristopher Moore, and Mark EJ Newman. Hierarchical 
> structure and the prediction of missing links in networks. Nature 453.7191 
> (2008): 98-101. (with 973 citations)
> [7] Adamic, Lada A., and Eytan Adar. Friends and neighbors on the web. Social 
> networks 25.3 (2003): 211-230. (1238 citations)
> [8] Lichtenwalter, Ryan N., Jake T. Lussier, and Nitesh V. Chawla. New 
> perspectives and methods in link prediction. In KDD'10.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11287) Executing deploy.client TestClient fails with bad class name

2015-10-25 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-11287.
---
   Resolution: Fixed
Fix Version/s: 1.6.0
   1.5.3

Issue resolved by pull request 9255
[https://github.com/apache/spark/pull/9255]

> Executing deploy.client TestClient fails with bad class name
> 
>
> Key: SPARK-11287
> URL: https://issues.apache.org/jira/browse/SPARK-11287
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
>Reporter: Bryan Cutler
>Priority: Trivial
> Fix For: 1.5.3, 1.6.0
>
>
> Execution of deploy.client.TestClient creates an ApplicationDescription to 
> start a TestExecutor which fails due to a bad class name.  
> Currently it is "spark.deploy.client.TestExecutor" but should be 
> "org.apache.spark.deploy.client.TestExecutor".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11287) Executing deploy.client TestClient fails with bad class name

2015-10-25 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11287:
--
Assignee: Bryan Cutler

> Executing deploy.client TestClient fails with bad class name
> 
>
> Key: SPARK-11287
> URL: https://issues.apache.org/jira/browse/SPARK-11287
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Trivial
> Fix For: 1.5.3, 1.6.0
>
>
> Execution of deploy.client.TestClient creates an ApplicationDescription to 
> start a TestExecutor which fails due to a bad class name.  
> Currently it is "spark.deploy.client.TestExecutor" but should be 
> "org.apache.spark.deploy.client.TestExecutor".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11303) sample (without replacement) + filter returns wrong results in DataFrame

2015-10-25 Thread Yuval Tanny (JIRA)
Yuval Tanny created SPARK-11303:
---

 Summary: sample (without replacement) + filter returns wrong 
results in DataFrame
 Key: SPARK-11303
 URL: https://issues.apache.org/jira/browse/SPARK-11303
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.1
 Environment: pyspark local mode, linux.
Reporter: Yuval Tanny


When sampling and then filtering DataFrame from python, we get inconsistent 
result when not caching the sampled DataFrame. This bug  doesn't appear in 
spark 1.4.1.

d = sqlContext.createDataFrame(sc.parallelize([[1]] * 50 + [[2]] * 50),['t'])
d_sampled = d.sample(False, 0.1, 1)
print d_sampled.count()
print d_sampled.filter('t = 1').count()
print d_sampled.filter('t != 1').count()
d_sampled.cache()
print d_sampled.count()
print d_sampled.filter('t = 1').count()
print d_sampled.filter('t != 1').count()

output:
14
7
8
14
7
7

Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org