[jira] [Assigned] (SPARK-13696) Remove BlockStore interface to more cleanly reflect different memory and disk store responsibilities

2016-03-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13696:


Assignee: Josh Rosen  (was: Apache Spark)

> Remove BlockStore interface to more cleanly reflect different memory and disk 
> store responsibilities
> 
>
> Key: SPARK-13696
> URL: https://issues.apache.org/jira/browse/SPARK-13696
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> Today, both the MemoryStore and DiskStore implement a common BlockStore API, 
> but I feel that this API is inappropriate because it abstracts away important 
> distinctions between the behavior of these two stores.
> For instance, the disk store doesn't have a notion of storing deserialized 
> objects, so it's confusing for it to expose object-based APIs like 
> putIterator() and getValues() instead of only exposing binary APIs and 
> pushing the responsibilities of serialization and deserialization to the 
> client.
> As part of a larger BlockManager interface cleanup, I'd like to remove the 
> BlockStore API and refine the MemoryStore and DiskStore interfaces to reflect 
> more narrow sets of responsibilities for those components.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13696) Remove BlockStore interface to more cleanly reflect different memory and disk store responsibilities

2016-03-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13696:


Assignee: Apache Spark  (was: Josh Rosen)

> Remove BlockStore interface to more cleanly reflect different memory and disk 
> store responsibilities
> 
>
> Key: SPARK-13696
> URL: https://issues.apache.org/jira/browse/SPARK-13696
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Reporter: Josh Rosen
>Assignee: Apache Spark
>
> Today, both the MemoryStore and DiskStore implement a common BlockStore API, 
> but I feel that this API is inappropriate because it abstracts away important 
> distinctions between the behavior of these two stores.
> For instance, the disk store doesn't have a notion of storing deserialized 
> objects, so it's confusing for it to expose object-based APIs like 
> putIterator() and getValues() instead of only exposing binary APIs and 
> pushing the responsibilities of serialization and deserialization to the 
> client.
> As part of a larger BlockManager interface cleanup, I'd like to remove the 
> BlockStore API and refine the MemoryStore and DiskStore interfaces to reflect 
> more narrow sets of responsibilities for those components.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13696) Remove BlockStore interface to more cleanly reflect different memory and disk store responsibilities

2016-03-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15181527#comment-15181527
 ] 

Apache Spark commented on SPARK-13696:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/11534

> Remove BlockStore interface to more cleanly reflect different memory and disk 
> store responsibilities
> 
>
> Key: SPARK-13696
> URL: https://issues.apache.org/jira/browse/SPARK-13696
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> Today, both the MemoryStore and DiskStore implement a common BlockStore API, 
> but I feel that this API is inappropriate because it abstracts away important 
> distinctions between the behavior of these two stores.
> For instance, the disk store doesn't have a notion of storing deserialized 
> objects, so it's confusing for it to expose object-based APIs like 
> putIterator() and getValues() instead of only exposing binary APIs and 
> pushing the responsibilities of serialization and deserialization to the 
> client.
> As part of a larger BlockManager interface cleanup, I'd like to remove the 
> BlockStore API and refine the MemoryStore and DiskStore interfaces to reflect 
> more narrow sets of responsibilities for those components.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13696) Remove BlockStore interface to more cleanly reflect different memory and disk store responsibilities

2016-03-04 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-13696:
--

 Summary: Remove BlockStore interface to more cleanly reflect 
different memory and disk store responsibilities
 Key: SPARK-13696
 URL: https://issues.apache.org/jira/browse/SPARK-13696
 Project: Spark
  Issue Type: Improvement
  Components: Block Manager
Reporter: Josh Rosen
Assignee: Josh Rosen


Today, both the MemoryStore and DiskStore implement a common BlockStore API, 
but I feel that this API is inappropriate because it abstracts away important 
distinctions between the behavior of these two stores.

For instance, the disk store doesn't have a notion of storing deserialized 
objects, so it's confusing for it to expose object-based APIs like 
putIterator() and getValues() instead of only exposing binary APIs and pushing 
the responsibilities of serialization and deserialization to the client.

As part of a larger BlockManager interface cleanup, I'd like to remove the 
BlockStore API and refine the MemoryStore and DiskStore interfaces to reflect 
more narrow sets of responsibilities for those components.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13695) Don't cache MEMORY_AND_DISK blocks as bytes in memory store when reading spills

2016-03-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13695:


Assignee: Apache Spark  (was: Josh Rosen)

> Don't cache MEMORY_AND_DISK blocks as bytes in memory store when reading 
> spills
> ---
>
> Key: SPARK-13695
> URL: https://issues.apache.org/jira/browse/SPARK-13695
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Reporter: Josh Rosen
>Assignee: Apache Spark
>
> When a cached block is spilled to disk and read back in serialized form (i.e. 
> as bytes), the current BlockManager implementation will attempt to re-insert 
> the serialized block into the MemoryStore even if the block's storage level 
> requests deserialized caching.
> This behavior adds some complexity to the MemoryStore but I don't think it 
> offers many performance benefits and I'd like to remove it in order to 
> simplify a larger refactoring patch. Therefore, I propose to change the 
> behavior such that disk store reads will only cache bytes in the memory store 
> for blocks with serialized storage levels.
> There are two places where we request serialized bytes from the BlockStore:
> 1. getLocalBytes(), which is only called when reading local copies of 
> TorrentBroadcast pieces. Broadcast pieces are always cached using a 
> serialized storage level, so this won't lead to a mismatch in serialization 
> forms if spilled bytes read from disk are cached as bytes in the memory store.
> 2. the non-shuffle-block branch in getBlockData(), which is only called by 
> the NettyBlockRpcServer when responding to requests to read remote blocks. 
> Caching the serialized bytes in memory will only benefit us if those cached 
> bytes are read before they're evicted and the likelihood of that happening 
> seems low since the frequency of remote reads of non-broadcast cached blocks 
> seems very low. Caching these bytes when they have a low probability of being 
> read is bad if it risks the eviction of blocks which are cached in their 
> expected serialized/deserialized forms, since those blocks seem more likely 
> to be read in local computation.
> Therefore, I think this is a safe change.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13695) Don't cache MEMORY_AND_DISK blocks as bytes in memory store when reading spills

2016-03-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15181518#comment-15181518
 ] 

Apache Spark commented on SPARK-13695:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/11533

> Don't cache MEMORY_AND_DISK blocks as bytes in memory store when reading 
> spills
> ---
>
> Key: SPARK-13695
> URL: https://issues.apache.org/jira/browse/SPARK-13695
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> When a cached block is spilled to disk and read back in serialized form (i.e. 
> as bytes), the current BlockManager implementation will attempt to re-insert 
> the serialized block into the MemoryStore even if the block's storage level 
> requests deserialized caching.
> This behavior adds some complexity to the MemoryStore but I don't think it 
> offers many performance benefits and I'd like to remove it in order to 
> simplify a larger refactoring patch. Therefore, I propose to change the 
> behavior such that disk store reads will only cache bytes in the memory store 
> for blocks with serialized storage levels.
> There are two places where we request serialized bytes from the BlockStore:
> 1. getLocalBytes(), which is only called when reading local copies of 
> TorrentBroadcast pieces. Broadcast pieces are always cached using a 
> serialized storage level, so this won't lead to a mismatch in serialization 
> forms if spilled bytes read from disk are cached as bytes in the memory store.
> 2. the non-shuffle-block branch in getBlockData(), which is only called by 
> the NettyBlockRpcServer when responding to requests to read remote blocks. 
> Caching the serialized bytes in memory will only benefit us if those cached 
> bytes are read before they're evicted and the likelihood of that happening 
> seems low since the frequency of remote reads of non-broadcast cached blocks 
> seems very low. Caching these bytes when they have a low probability of being 
> read is bad if it risks the eviction of blocks which are cached in their 
> expected serialized/deserialized forms, since those blocks seem more likely 
> to be read in local computation.
> Therefore, I think this is a safe change.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13695) Don't cache MEMORY_AND_DISK blocks as bytes in memory store when reading spills

2016-03-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13695:


Assignee: Josh Rosen  (was: Apache Spark)

> Don't cache MEMORY_AND_DISK blocks as bytes in memory store when reading 
> spills
> ---
>
> Key: SPARK-13695
> URL: https://issues.apache.org/jira/browse/SPARK-13695
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> When a cached block is spilled to disk and read back in serialized form (i.e. 
> as bytes), the current BlockManager implementation will attempt to re-insert 
> the serialized block into the MemoryStore even if the block's storage level 
> requests deserialized caching.
> This behavior adds some complexity to the MemoryStore but I don't think it 
> offers many performance benefits and I'd like to remove it in order to 
> simplify a larger refactoring patch. Therefore, I propose to change the 
> behavior such that disk store reads will only cache bytes in the memory store 
> for blocks with serialized storage levels.
> There are two places where we request serialized bytes from the BlockStore:
> 1. getLocalBytes(), which is only called when reading local copies of 
> TorrentBroadcast pieces. Broadcast pieces are always cached using a 
> serialized storage level, so this won't lead to a mismatch in serialization 
> forms if spilled bytes read from disk are cached as bytes in the memory store.
> 2. the non-shuffle-block branch in getBlockData(), which is only called by 
> the NettyBlockRpcServer when responding to requests to read remote blocks. 
> Caching the serialized bytes in memory will only benefit us if those cached 
> bytes are read before they're evicted and the likelihood of that happening 
> seems low since the frequency of remote reads of non-broadcast cached blocks 
> seems very low. Caching these bytes when they have a low probability of being 
> read is bad if it risks the eviction of blocks which are cached in their 
> expected serialized/deserialized forms, since those blocks seem more likely 
> to be read in local computation.
> Therefore, I think this is a safe change.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13695) Don't cache MEMORY_AND_DISK blocks as bytes in memory store when reading spills

2016-03-04 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-13695:
---
Summary: Don't cache MEMORY_AND_DISK blocks as bytes in memory store when 
reading spills  (was: Don't cache blocks at different serialization level when 
reading back from disk)

> Don't cache MEMORY_AND_DISK blocks as bytes in memory store when reading 
> spills
> ---
>
> Key: SPARK-13695
> URL: https://issues.apache.org/jira/browse/SPARK-13695
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> When a cached block is spilled to disk and read back in serialized form (i.e. 
> as bytes), the current BlockManager implementation will attempt to re-insert 
> the serialized block into the MemoryStore even if the block's storage level 
> requests deserialized caching.
> This behavior adds some complexity to the MemoryStore but I don't think it 
> offers many performance benefits and I'd like to remove it in order to 
> simplify a larger refactoring patch. Therefore, I propose to change the 
> behavior such that disk store reads will only cache bytes in the memory store 
> for blocks with serialized storage levels.
> There are two places where we request serialized bytes from the BlockStore:
> 1. getLocalBytes(), which is only called when reading local copies of 
> TorrentBroadcast pieces. Broadcast pieces are always cached using a 
> serialized storage level, so this won't lead to a mismatch in serialization 
> forms if spilled bytes read from disk are cached as bytes in the memory store.
> 2. the non-shuffle-block branch in getBlockData(), which is only called by 
> the NettyBlockRpcServer when responding to requests to read remote blocks. 
> Caching the serialized bytes in memory will only benefit us if those cached 
> bytes are read before they're evicted and the likelihood of that happening 
> seems low since the frequency of remote reads of non-broadcast cached blocks 
> seems very low. Caching these bytes when they have a low probability of being 
> read is bad if it risks the eviction of blocks which are cached in their 
> expected serialized/deserialized forms, since those blocks seem more likely 
> to be read in local computation.
> Therefore, I think this is a safe change.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13695) Don't cache blocks at different serialization level when reading back from disk

2016-03-04 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-13695:
--

 Summary: Don't cache blocks at different serialization level when 
reading back from disk
 Key: SPARK-13695
 URL: https://issues.apache.org/jira/browse/SPARK-13695
 Project: Spark
  Issue Type: Improvement
  Components: Block Manager
Reporter: Josh Rosen
Assignee: Josh Rosen


When a cached block is spilled to disk and read back in serialized form (i.e. 
as bytes), the current BlockManager implementation will attempt to re-insert 
the serialized block into the MemoryStore even if the block's storage level 
requests deserialized caching.

This behavior adds some complexity to the MemoryStore but I don't think it 
offers many performance benefits and I'd like to remove it in order to simplify 
a larger refactoring patch. Therefore, I propose to change the behavior such 
that disk store reads will only cache bytes in the memory store for blocks with 
serialized storage levels.

There are two places where we request serialized bytes from the BlockStore:

1. getLocalBytes(), which is only called when reading local copies of 
TorrentBroadcast pieces. Broadcast pieces are always cached using a serialized 
storage level, so this won't lead to a mismatch in serialization forms if 
spilled bytes read from disk are cached as bytes in the memory store.

2. the non-shuffle-block branch in getBlockData(), which is only called by the 
NettyBlockRpcServer when responding to requests to read remote blocks. Caching 
the serialized bytes in memory will only benefit us if those cached bytes are 
read before they're evicted and the likelihood of that happening seems low 
since the frequency of remote reads of non-broadcast cached blocks seems very 
low. Caching these bytes when they have a low probability of being read is bad 
if it risks the eviction of blocks which are cached in their expected 
serialized/deserialized forms, since those blocks seem more likely to be read 
in local computation.

Therefore, I think this is a safe change.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-13073) creating R like summary for logistic Regression in Spark - Scala

2016-03-04 Thread Gayathri Murali (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gayathri Murali updated SPARK-13073:

Comment: was deleted

(was: I can work on this, can you please assign it to me?)

> creating R like summary for logistic Regression in Spark - Scala
> 
>
> Key: SPARK-13073
> URL: https://issues.apache.org/jira/browse/SPARK-13073
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Reporter: Samsudhin
>Priority: Minor
>
> Currently Spark ML provides Coefficients for logistic regression. To evaluate 
> the trained model tests like wald test, chi square tests and their results to 
> be summarized and display like GLM summary of R



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7505) Update PySpark DataFrame docs: encourage __getitem__, mark as experimental, etc.

2016-03-04 Thread Neelesh Srinivas Salian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15181418#comment-15181418
 ] 

Neelesh Srinivas Salian commented on SPARK-7505:


Does this still apply ? [~nchammas]

> Update PySpark DataFrame docs: encourage __getitem__, mark as experimental, 
> etc.
> 
>
> Key: SPARK-7505
> URL: https://issues.apache.org/jira/browse/SPARK-7505
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark, SQL
>Affects Versions: 1.3.1, 1.4.1, 1.5.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> The PySpark docs for DataFrame need the following fixes and improvements:
> # Per [SPARK-7035], we should encourage the use of {{\_\_getitem\_\_}} over 
> {{\_\_getattr\_\_}} and change all our examples accordingly.
> # *We should say clearly that the API is experimental.* (That is currently 
> not the case for the PySpark docs.)
> # We should provide an example of how to join and select from 2 DataFrames 
> that have identically named columns, because it is not obvious:
>   {code}
> >>> df1 = sqlContext.jsonRDD(sc.parallelize(['{"a": 4, "other": "I know"}']))
> >>> df2 = sqlContext.jsonRDD(sc.parallelize(['{"a": 4, "other": "I dunno"}']))
> >>> df12 = df1.join(df2, df1['a'] == df2['a'])
> >>> df12.select(df1['a'], df2['other']).show()
> a other   
> 
> 4 I dunno  {code}
> # 
> [{{DF.orderBy}}|https://spark.apache.org/docs/1.3.1/api/python/pyspark.sql.html#pyspark.sql.DataFrame.orderBy]
>  and 
> [{{DF.sort}}|https://spark.apache.org/docs/1.3.1/api/python/pyspark.sql.html#pyspark.sql.DataFrame.sort]
>  should be marked as aliases if that's what they are.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6043) Error when trying to rename table with alter table after using INSERT OVERWITE to populate the table

2016-03-04 Thread Neelesh Srinivas Salian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15181416#comment-15181416
 ] 

Neelesh Srinivas Salian commented on SPARK-6043:


[~tleftwich] is this still an issue?


> Error when trying to rename table with alter table after using INSERT 
> OVERWITE to populate the table
> 
>
> Key: SPARK-6043
> URL: https://issues.apache.org/jira/browse/SPARK-6043
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.1
>Reporter: Trystan Leftwich
>Priority: Minor
>
> If you populate a table using INSERT OVERWRITE and then try to rename the 
> table using alter table it fails with:
> {noformat}
> Error: org.apache.spark.sql.execution.QueryExecutionException: FAILED: 
> Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. 
> Unable to alter table. (state=,code=0)
> {noformat}
> Using the following SQL statement creates the error:
> {code:sql}
> CREATE TABLE `tmp_table` (salesamount_c1 DOUBLE);
> INSERT OVERWRITE table tmp_table SELECT
>MIN(sales_customer.salesamount) salesamount_c1
> FROM
> (
>   SELECT
>  SUM(sales.salesamount) salesamount
>   FROM
>  internalsales sales
> ) sales_customer;
> ALTER TABLE tmp_table RENAME to not_tmp;
> {code}
> But if you change the 'OVERWRITE' to be 'INTO' the SQL statement works.
> This is happening on our CDH5.3 cluster with multiple workers, If we use the 
> CDH5.3 Quickstart VM the SQL does not produce an error. Both cases were spark 
> 1.2.1 built for hadoop2.4+



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4549) Support BigInt -> Decimal in convertToCatalyst in SparkSQL

2016-03-04 Thread Neelesh Srinivas Salian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15181398#comment-15181398
 ] 

Neelesh Srinivas Salian commented on SPARK-4549:


[~huangjs] could you please  add more details here?


> Support BigInt -> Decimal in convertToCatalyst in SparkSQL
> --
>
> Key: SPARK-4549
> URL: https://issues.apache.org/jira/browse/SPARK-4549
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Jianshi Huang
>Priority: Minor
>
> Since BigDecimal is just a wrapper around BigInt, let's also convert to 
> BigInt to Decimal.
> Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13641) getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the original column names

2016-03-04 Thread Gayathri Murali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15181369#comment-15181369
 ] 

Gayathri Murali commented on SPARK-13641:
-

Yes, it looks intentional to carry metadata. There are multiple ways in which 
it can stripped out. I will let [~mengxr] to answer if its ok to strip out the 
"c_" from feature names in the SparkRWrapper

> getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the 
> original column names
> ---
>
> Key: SPARK-13641
> URL: https://issues.apache.org/jira/browse/SPARK-13641
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SparkR
>Reporter: Xusen Yin
>Priority: Minor
>
> getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the 
> original column names. Let's take the HouseVotes84 data set as an example:
> {code}
> case m: XXXModel =>
>   val attrs = AttributeGroup.fromStructField(
> m.summary.predictions.schema(m.summary.featuresCol))
>   attrs.attributes.get.map(_.name.get)
> {code}
> The code above gets features' names from the features column. Usually, the 
> features column is generated by RFormula. The latter has a VectorAssembler in 
> it, which leads the output attributes not equal with the original ones.
> E.g., we want to learn the HouseVotes84's features' name "V1, V2, ..., V16". 
> But with RFormula, we can only get "V1_n, V2_y, ..., V16_y" because [the 
> transform function of 
> VectorAssembler|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala#L75]
>  adds salts of the column names.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13694) QueryPlan.expressions should always include all expressions

2016-03-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13694:


Assignee: (was: Apache Spark)

> QueryPlan.expressions should always include all expressions
> ---
>
> Key: SPARK-13694
> URL: https://issues.apache.org/jira/browse/SPARK-13694
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13694) QueryPlan.expressions should always include all expressions

2016-03-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15181367#comment-15181367
 ] 

Apache Spark commented on SPARK-13694:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/11532

> QueryPlan.expressions should always include all expressions
> ---
>
> Key: SPARK-13694
> URL: https://issues.apache.org/jira/browse/SPARK-13694
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13694) QueryPlan.expressions should always include all expressions

2016-03-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13694:


Assignee: Apache Spark

> QueryPlan.expressions should always include all expressions
> ---
>
> Key: SPARK-13694
> URL: https://issues.apache.org/jira/browse/SPARK-13694
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13690) UnsafeShuffleWriterSuite fails on arm64 (SnappyError, no native library is found)

2016-03-04 Thread Santiago M. Mola (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15181362#comment-15181362
 ] 

Santiago M. Mola commented on SPARK-13690:
--

snappy-java does not have any fallback, but snappy seems to work on arm64 
correctly. I submitted a PR for snappy-java, so a future version should have 
support. This issue will have to wait until such version is out.

I don't expect active support for arm64, but given the latest developments on 
arm64 servers, I'm interested in experimenting with it. It seems I'm not the 
first one to think about it: http://www.sparkonarm.com/ ;-)

> UnsafeShuffleWriterSuite fails on arm64 (SnappyError, no native library is 
> found)
> -
>
> Key: SPARK-13690
> URL: https://issues.apache.org/jira/browse/SPARK-13690
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
> Environment: $ java -version
> java version "1.8.0_73"
> Java(TM) SE Runtime Environment (build 1.8.0_73-b02)
> Java HotSpot(TM) 64-Bit Server VM (build 25.73-b02, mixed mode)
> $ uname -a
> Linux spark-on-arm 4.2.0-55598-g45f70e3 #5 SMP Tue Feb 2 10:14:08 CET 2016 
> aarch64 aarch64 aarch64 GNU/Linux
>Reporter: Santiago M. Mola
>Priority: Minor
>  Labels: arm64, porting
>
> UnsafeShuffleWriterSuite fails because of missing Snappy native library on 
> arm64.
> {code}
> Tests run: 19, Failures: 0, Errors: 2, Skipped: 0, Time elapsed: 6.437 sec 
> <<< FAILURE! - in org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite
> mergeSpillsWithFileStreamAndSnappy(org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite)
>   Time elapsed: 0.072 sec  <<< ERROR!
> java.lang.reflect.InvocationTargetException
>   at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite.testMergingSpills(UnsafeShuffleWriterSuite.java:337)
>   at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite.mergeSpillsWithFileStreamAndSnappy(UnsafeShuffleWriterSuite.java:389)
> Caused by: java.lang.IllegalArgumentException: org.xerial.snappy.SnappyError: 
> [FAILED_TO_LOAD_NATIVE_LIBRARY] no native library is found for os.name=Linux 
> and os.arch=aarch64
>   at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite.testMergingSpills(UnsafeShuffleWriterSuite.java:337)
>   at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite.mergeSpillsWithFileStreamAndSnappy(UnsafeShuffleWriterSuite.java:389)
> Caused by: org.xerial.snappy.SnappyError: [FAILED_TO_LOAD_NATIVE_LIBRARY] no 
> native library is found for os.name=Linux and os.arch=aarch64
>   at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite.testMergingSpills(UnsafeShuffleWriterSuite.java:337)
>   at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite.mergeSpillsWithFileStreamAndSnappy(UnsafeShuffleWriterSuite.java:389)
> mergeSpillsWithTransferToAndSnappy(org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite)
>   Time elapsed: 0.041 sec  <<< ERROR!
> java.lang.reflect.InvocationTargetException
>   at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite.testMergingSpills(UnsafeShuffleWriterSuite.java:337)
>   at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite.mergeSpillsWithTransferToAndSnappy(UnsafeShuffleWriterSuite.java:384)
> Caused by: java.lang.IllegalArgumentException: 
> java.lang.NoClassDefFoundError: Could not initialize class 
> org.xerial.snappy.Snappy
>   at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite.testMergingSpills(UnsafeShuffleWriterSuite.java:337)
>   at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite.mergeSpillsWithTransferToAndSnappy(UnsafeShuffleWriterSuite.java:384)
> Caused by: java.lang.NoClassDefFoundError: Could not initialize class 
> org.xerial.snappy.Snappy
>   at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite.testMergingSpills(UnsafeShuffleWriterSuite.java:337)
>   at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite.mergeSpillsWithTransferToAndSnappy(UnsafeShuffleWriterSuite.java:384)
> Running org.apache.spark.JavaAPISuite
> Tests run: 90, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 52.526 sec - 
> in org.apache.spark.JavaAPISuite
> Running org.apache.spark.unsafe.map.BytesToBytesMapOnHeapSuite
> Tests run: 12, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 9.761 sec - 
> in org.apache.spark.unsafe.map.BytesToBytesMapOnHeapSuite
> Running org.apache.spark.unsafe.map.BytesToBytesMapOffHeapSuite
> Tests run: 12, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 8.967 sec - 
> in org.apache.spark.unsafe.map.BytesToBytesMapOffHeapSuite
> Running org.apache.spark.api.java.OptionalSuite
> Tests run: 8, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.003 sec - 
> in org.apache.spark.api.java.OptionalSuite
> Results :

[jira] [Created] (SPARK-13694) QueryPlan.expressions should always include all expressions

2016-03-04 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-13694:
---

 Summary: QueryPlan.expressions should always include all 
expressions
 Key: SPARK-13694
 URL: https://issues.apache.org/jira/browse/SPARK-13694
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12073) Backpressure causes individual Kafka partitions to lag

2016-03-04 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-12073.
--
   Resolution: Fixed
 Assignee: Jason White
Fix Version/s: 2.0.0

> Backpressure causes individual Kafka partitions to lag
> --
>
> Key: SPARK-12073
> URL: https://issues.apache.org/jira/browse/SPARK-12073
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.2, 1.6.0, 1.6.1
>Reporter: Jason White
>Assignee: Jason White
> Fix For: 2.0.0
>
>
> We're seeing a growing lag on (2) individual Kafka partitions, on a topic 
> with 32 partitions. Our individual batch sessions are completing in 5-7s, 
> with a batch window of 30s, so there's plenty of room for Streaming to catch 
> up, but it looks to be intentionally limiting itself. These partitions are 
> experiencing unbalanced load (higher than most of the others)
> What I believe is happening is that maxMessagesPerPartition calculates an 
> appropriate limit for the message rate from all partitions, and then divides 
> by the number of partitions to determine how many messages to retrieve per 
> partition. The problem with this approach is that when one partition is 
> behind by millions of records (due to random Kafka issues) or is experiencing 
> heavy load, the number of messages to be retrieved shouldn't be evenly split 
> among the partitions. In this scenario, if the rate estimator calculates only 
> 100k total messages can be retrieved, each partition (out of say 32) only 
> retrieves max 100k/32=3125 messages.
> Under some conditions, this results in the backpressure keeping the lagging 
> partition from recovering. The PIDRateEstimator doesn't increase the number 
> of messages to retrieve enough to recover, and we stabilize at a point where 
> these individual partitions slowly grow.
> I have a PR on our fork in progress to allocate the maxMessagesPerPartition 
> by weighting the number to be retrieved on the current lag each partition is 
> currently experiencing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12073) Backpressure causes individual Kafka partitions to lag

2016-03-04 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-12073:
-
Affects Version/s: 1.6.1
   1.6.0

> Backpressure causes individual Kafka partitions to lag
> --
>
> Key: SPARK-12073
> URL: https://issues.apache.org/jira/browse/SPARK-12073
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.2, 1.6.0, 1.6.1
>Reporter: Jason White
>Assignee: Jason White
> Fix For: 2.0.0
>
>
> We're seeing a growing lag on (2) individual Kafka partitions, on a topic 
> with 32 partitions. Our individual batch sessions are completing in 5-7s, 
> with a batch window of 30s, so there's plenty of room for Streaming to catch 
> up, but it looks to be intentionally limiting itself. These partitions are 
> experiencing unbalanced load (higher than most of the others)
> What I believe is happening is that maxMessagesPerPartition calculates an 
> appropriate limit for the message rate from all partitions, and then divides 
> by the number of partitions to determine how many messages to retrieve per 
> partition. The problem with this approach is that when one partition is 
> behind by millions of records (due to random Kafka issues) or is experiencing 
> heavy load, the number of messages to be retrieved shouldn't be evenly split 
> among the partitions. In this scenario, if the rate estimator calculates only 
> 100k total messages can be retrieved, each partition (out of say 32) only 
> retrieves max 100k/32=3125 messages.
> Under some conditions, this results in the backpressure keeping the lagging 
> partition from recovering. The PIDRateEstimator doesn't increase the number 
> of messages to retrieve enough to recover, and we stabilize at a point where 
> these individual partitions slowly grow.
> I have a PR on our fork in progress to allocate the maxMessagesPerPartition 
> by weighting the number to be retrieved on the current lag each partition is 
> currently experiencing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13693) Flaky test: o.a.s.streaming.MapWithStateSuite

2016-03-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13693:


Assignee: Apache Spark  (was: Shixiong Zhu)

> Flaky test: o.a.s.streaming.MapWithStateSuite
> -
>
> Key: SPARK-13693
> URL: https://issues.apache.org/jira/browse/SPARK-13693
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>Priority: Minor
>
> Fixed the following flaky test:
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/256/testReport/junit/org.apache.spark.streaming/MapWithStateSuite/_It_is_not_a_test_/
> {code}
> sbt.ForkMain$ForkError: java.io.IOException: Failed to delete: 
> /home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/streaming/checkpoint/spark-e97794a8-b940-4b21-8685-bf1221f9444d
>   at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:934)
>   at 
> org.apache.spark.streaming.MapWithStateSuite$$anonfun$2.apply$mcV$sp(MapWithStateSuite.scala:47)
>   at 
> org.apache.spark.streaming.MapWithStateSuite$$anonfun$2.apply(MapWithStateSuite.scala:45)
>   at 
> org.apache.spark.streaming.MapWithStateSuite$$anonfun$2.apply(MapWithStateSuite.scala:45)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13693) Flaky test: o.a.s.streaming.MapWithStateSuite

2016-03-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13693:


Assignee: Shixiong Zhu  (was: Apache Spark)

> Flaky test: o.a.s.streaming.MapWithStateSuite
> -
>
> Key: SPARK-13693
> URL: https://issues.apache.org/jira/browse/SPARK-13693
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
>
> Fixed the following flaky test:
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/256/testReport/junit/org.apache.spark.streaming/MapWithStateSuite/_It_is_not_a_test_/
> {code}
> sbt.ForkMain$ForkError: java.io.IOException: Failed to delete: 
> /home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/streaming/checkpoint/spark-e97794a8-b940-4b21-8685-bf1221f9444d
>   at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:934)
>   at 
> org.apache.spark.streaming.MapWithStateSuite$$anonfun$2.apply$mcV$sp(MapWithStateSuite.scala:47)
>   at 
> org.apache.spark.streaming.MapWithStateSuite$$anonfun$2.apply(MapWithStateSuite.scala:45)
>   at 
> org.apache.spark.streaming.MapWithStateSuite$$anonfun$2.apply(MapWithStateSuite.scala:45)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13693) Flaky test: o.a.s.streaming.MapWithStateSuite

2016-03-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15181303#comment-15181303
 ] 

Apache Spark commented on SPARK-13693:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/11531

> Flaky test: o.a.s.streaming.MapWithStateSuite
> -
>
> Key: SPARK-13693
> URL: https://issues.apache.org/jira/browse/SPARK-13693
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
>
> Fixed the following flaky test:
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/256/testReport/junit/org.apache.spark.streaming/MapWithStateSuite/_It_is_not_a_test_/
> {code}
> sbt.ForkMain$ForkError: java.io.IOException: Failed to delete: 
> /home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/streaming/checkpoint/spark-e97794a8-b940-4b21-8685-bf1221f9444d
>   at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:934)
>   at 
> org.apache.spark.streaming.MapWithStateSuite$$anonfun$2.apply$mcV$sp(MapWithStateSuite.scala:47)
>   at 
> org.apache.spark.streaming.MapWithStateSuite$$anonfun$2.apply(MapWithStateSuite.scala:45)
>   at 
> org.apache.spark.streaming.MapWithStateSuite$$anonfun$2.apply(MapWithStateSuite.scala:45)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13692) Fix trivial Coverity/Checkstyle defects

2016-03-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13692:


Assignee: (was: Apache Spark)

> Fix trivial Coverity/Checkstyle defects
> ---
>
> Key: SPARK-13692
> URL: https://issues.apache.org/jira/browse/SPARK-13692
> Project: Spark
>  Issue Type: Bug
>  Components: Examples, Spark Core, SQL
>Reporter: Dongjoon Hyun
>Priority: Trivial
>
> This issue fixes the following potential bugs and Java coding style detected 
> by Coverity and Checkstyle.
>   * Implement both null and type checking in equals functions.
>   * Fix wrong type casting logic in SimpleJavaBean2.equals.
>   * Add `implement Cloneable` to `UTF8String` and `SortedIterator`.
>   * Remove dereferencing before null check in `AbstractBytesToBytesMapSuite`.
>   * Fix coding style: Add '{}' to single `for` statement in mllib examples.
>   * Remove unused imports in `ColumnarBatch`.
> Please note that the last two checkstyle errors exist on newly added commits 
> after [SPARK-13583].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13692) Fix trivial Coverity/Checkstyle defects

2016-03-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13692:


Assignee: Apache Spark

> Fix trivial Coverity/Checkstyle defects
> ---
>
> Key: SPARK-13692
> URL: https://issues.apache.org/jira/browse/SPARK-13692
> Project: Spark
>  Issue Type: Bug
>  Components: Examples, Spark Core, SQL
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Trivial
>
> This issue fixes the following potential bugs and Java coding style detected 
> by Coverity and Checkstyle.
>   * Implement both null and type checking in equals functions.
>   * Fix wrong type casting logic in SimpleJavaBean2.equals.
>   * Add `implement Cloneable` to `UTF8String` and `SortedIterator`.
>   * Remove dereferencing before null check in `AbstractBytesToBytesMapSuite`.
>   * Fix coding style: Add '{}' to single `for` statement in mllib examples.
>   * Remove unused imports in `ColumnarBatch`.
> Please note that the last two checkstyle errors exist on newly added commits 
> after [SPARK-13583].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13692) Fix trivial Coverity/Checkstyle defects

2016-03-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15181298#comment-15181298
 ] 

Apache Spark commented on SPARK-13692:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/11530

> Fix trivial Coverity/Checkstyle defects
> ---
>
> Key: SPARK-13692
> URL: https://issues.apache.org/jira/browse/SPARK-13692
> Project: Spark
>  Issue Type: Bug
>  Components: Examples, Spark Core, SQL
>Reporter: Dongjoon Hyun
>Priority: Trivial
>
> This issue fixes the following potential bugs and Java coding style detected 
> by Coverity and Checkstyle.
>   * Implement both null and type checking in equals functions.
>   * Fix wrong type casting logic in SimpleJavaBean2.equals.
>   * Add `implement Cloneable` to `UTF8String` and `SortedIterator`.
>   * Remove dereferencing before null check in `AbstractBytesToBytesMapSuite`.
>   * Fix coding style: Add '{}' to single `for` statement in mllib examples.
>   * Remove unused imports in `ColumnarBatch`.
> Please note that the last two checkstyle errors exist on newly added commits 
> after [SPARK-13583].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13641) getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the original column names

2016-03-04 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15181294#comment-15181294
 ] 

Xusen Yin commented on SPARK-13641:
---

Or maybe we try to find another way to extract the feature names.

> getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the 
> original column names
> ---
>
> Key: SPARK-13641
> URL: https://issues.apache.org/jira/browse/SPARK-13641
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SparkR
>Reporter: Xusen Yin
>Priority: Minor
>
> getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the 
> original column names. Let's take the HouseVotes84 data set as an example:
> {code}
> case m: XXXModel =>
>   val attrs = AttributeGroup.fromStructField(
> m.summary.predictions.schema(m.summary.featuresCol))
>   attrs.attributes.get.map(_.name.get)
> {code}
> The code above gets features' names from the features column. Usually, the 
> features column is generated by RFormula. The latter has a VectorAssembler in 
> it, which leads the output attributes not equal with the original ones.
> E.g., we want to learn the HouseVotes84's features' name "V1, V2, ..., V16". 
> But with RFormula, we can only get "V1_n, V2_y, ..., V16_y" because [the 
> transform function of 
> VectorAssembler|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala#L75]
>  adds salts of the column names.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13641) getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the original column names

2016-03-04 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15181130#comment-15181130
 ] 

Xusen Yin commented on SPARK-13641:
---

I am not quite familiar with it but I think it's intentional. Refer to the JIRA 
[SPARK 7198|https://issues.apache.org/jira/browse/SPARK-7198] and bring 
[~mengxr] here.

> getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the 
> original column names
> ---
>
> Key: SPARK-13641
> URL: https://issues.apache.org/jira/browse/SPARK-13641
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SparkR
>Reporter: Xusen Yin
>Priority: Minor
>
> getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the 
> original column names. Let's take the HouseVotes84 data set as an example:
> {code}
> case m: XXXModel =>
>   val attrs = AttributeGroup.fromStructField(
> m.summary.predictions.schema(m.summary.featuresCol))
>   attrs.attributes.get.map(_.name.get)
> {code}
> The code above gets features' names from the features column. Usually, the 
> features column is generated by RFormula. The latter has a VectorAssembler in 
> it, which leads the output attributes not equal with the original ones.
> E.g., we want to learn the HouseVotes84's features' name "V1, V2, ..., V16". 
> But with RFormula, we can only get "V1_n, V2_y, ..., V16_y" because [the 
> transform function of 
> VectorAssembler|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala#L75]
>  adds salts of the column names.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13692) Fix trivial Coverity/Checkstyle defects

2016-03-04 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-13692:
--
Description: 
This issue fixes the following potential bugs and Java coding style detected by 
Coverity and Checkstyle.

  * Implement both null and type checking in equals functions.
  * Fix wrong type casting logic in SimpleJavaBean2.equals.
  * Add `implement Cloneable` to `UTF8String` and `SortedIterator`.
  * Remove dereferencing before null check in `AbstractBytesToBytesMapSuite`.
  * Fix coding style: Add '{}' to single `for` statement in mllib examples.
  * Remove unused imports in `ColumnarBatch`.

Please note that the last two checkstyle errors exist on newly added commits 
after [SPARK-13583].

  was:
This issue fixes the following potential bugs and Java coding style detected by 
Coverity and Checkstyle.

  * Implement both null and type checking in equals functions.
  * Fix wrong type casting logic in SimpleJavaBean2.equals.
  * Add `implement Cloneable` to `UTF8String` and `SortedIterator`.
  * Remove dereferencing before null check in `AbstractBytesToBytesMapSuite`.
  * Fix coding style: Add '{}' to single `for` statement in mllib examples.

Please note that the checkstyle errors exist on newly added commits after 
[SPARK-13583].


> Fix trivial Coverity/Checkstyle defects
> ---
>
> Key: SPARK-13692
> URL: https://issues.apache.org/jira/browse/SPARK-13692
> Project: Spark
>  Issue Type: Bug
>  Components: Examples, Spark Core, SQL
>Reporter: Dongjoon Hyun
>Priority: Trivial
>
> This issue fixes the following potential bugs and Java coding style detected 
> by Coverity and Checkstyle.
>   * Implement both null and type checking in equals functions.
>   * Fix wrong type casting logic in SimpleJavaBean2.equals.
>   * Add `implement Cloneable` to `UTF8String` and `SortedIterator`.
>   * Remove dereferencing before null check in `AbstractBytesToBytesMapSuite`.
>   * Fix coding style: Add '{}' to single `for` statement in mllib examples.
>   * Remove unused imports in `ColumnarBatch`.
> Please note that the last two checkstyle errors exist on newly added commits 
> after [SPARK-13583].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13693) Flaky test: o.a.s.streaming.MapWithStateSuite

2016-03-04 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-13693:


 Summary: Flaky test: o.a.s.streaming.MapWithStateSuite
 Key: SPARK-13693
 URL: https://issues.apache.org/jira/browse/SPARK-13693
 Project: Spark
  Issue Type: Test
  Components: Tests
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu
Priority: Minor


Fixed the following flaky test:

https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/256/testReport/junit/org.apache.spark.streaming/MapWithStateSuite/_It_is_not_a_test_/

{code}
sbt.ForkMain$ForkError: java.io.IOException: Failed to delete: 
/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/streaming/checkpoint/spark-e97794a8-b940-4b21-8685-bf1221f9444d
at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:934)
at 
org.apache.spark.streaming.MapWithStateSuite$$anonfun$2.apply$mcV$sp(MapWithStateSuite.scala:47)
at 
org.apache.spark.streaming.MapWithStateSuite$$anonfun$2.apply(MapWithStateSuite.scala:45)
at 
org.apache.spark.streaming.MapWithStateSuite$$anonfun$2.apply(MapWithStateSuite.scala:45)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13692) Fix trivial Coverity/Checkstyle defects

2016-03-04 Thread Dongjoon Hyun (JIRA)
Dongjoon Hyun created SPARK-13692:
-

 Summary: Fix trivial Coverity/Checkstyle defects
 Key: SPARK-13692
 URL: https://issues.apache.org/jira/browse/SPARK-13692
 Project: Spark
  Issue Type: Bug
  Components: Examples, Spark Core, SQL
Reporter: Dongjoon Hyun
Priority: Trivial


This issue fixes the following potential bugs and Java coding style detected by 
Coverity and Checkstyle.

  * Implement both null and type checking in equals functions.
  * Fix wrong type casting logic in SimpleJavaBean2.equals.
  * Add `implement Cloneable` to `UTF8String` and `SortedIterator`.
  * Remove dereferencing before null check in `AbstractBytesToBytesMapSuite`.
  * Fix coding style: Add '{}' to single `for` statement in mllib examples.

Please note that the checkstyle errors exist on newly added commits after 
[SPARK-13583].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13605) Bean encoder cannot handle nonbean properties - no way to Encode nonbean Java objects with columns

2016-03-04 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-13605:
-
Fix Version/s: (was: 1.6.0)

> Bean encoder cannot handle nonbean properties - no way to Encode nonbean Java 
> objects with columns
> --
>
> Key: SPARK-13605
> URL: https://issues.apache.org/jira/browse/SPARK-13605
> Project: Spark
>  Issue Type: New Feature
>  Components: Java API, SQL
>Affects Versions: 1.6.0
> Environment: Any
>Reporter: Steven Lewis
>  Labels: easytest, features
>
> in the current environment the only way to turn a List or JavaRDD into a 
> DataSet with columns is to use a  Encoders.bean(MyBean.class); The current 
> implementation fails if a Bean property is not a basic type or a Bean.
> I would like to see one of the following
> 1) Default to JavaSerialization for any Java Object implementing Serializable 
> when using bean Encoder
> 2) Allow an encoder which is a Map and look up entries in 
> encoding classes - an ideal implementation would look for the class then any 
> interfaces and then search base classes 
> The following code illustrates the issue
> {code}
> /**
>  * This class is a good Java bean but one field holds an object
>  * which is not a bean
>  */
> public class MyBean  implements Serializable {
> private int m_count;
> private String m_Name;
> private MyUnBean m_UnBean;
> public MyBean(int count, String name, MyUnBean unBean) {
> m_count = count;
> m_Name = name;
> m_UnBean = unBean;
> }
> public int getCount() {return m_count; }
> public void setCount(int count) {m_count = count;}
> public String getName() {return m_Name;}
> public void setName(String name) {m_Name = name;}
> public MyUnBean getUnBean() {return m_UnBean;}
> public void setUnBean(MyUnBean unBean) {m_UnBean = unBean;}
> }
> /**
>  * This is a Java object which is not a bean
>  * no getters or setters but is serializable
>  */
> public class MyUnBean implements Serializable {
> public final int count;
> public final String name;
> public MyUnBean(int count, String name) {
> this.count = count;
> this.name = name;
> }
> }
> **
>  * This code creates a list of objects containing MyBean -
>  * a Java Bean containing one field which is not bean 
>  * It then attempts and fails to use a bean encoder 
>  * to make a DataSet
>  */
> public class DatasetTest {
> public static final Random RND = new Random();
> public static final int LIST_SIZE = 100;
> public static String makeName() {
> return Integer.toString(RND.nextInt());
> }
> public static MyUnBean makeUnBean() {
> return new MyUnBean(RND.nextInt(), makeName());
> }
> public static MyBean makeBean() {
> return new MyBean(RND.nextInt(), makeName(), makeUnBean());
> }
> /**
>  * Make a list of MyBeans
>  * @return
>  */
> public static List makeBeanList() {
> List holder = new ArrayList();
> for (int i = 0; i < LIST_SIZE; i++) {
> holder.add(makeBean());
> }
> return holder;
> }
> public static SQLContext getSqlContext() {
> SparkConf sparkConf = new SparkConf();
> sparkConf.setAppName("BeanTest") ;
> Option option = sparkConf.getOption("spark.master");
> if (!option.isDefined())// use local over nothing
> sparkConf.setMaster("local[*]");
> JavaSparkContext ctx = new JavaSparkContext(sparkConf) ;
> return new SQLContext(ctx);
> }
> public static void main(String[] args) {
> SQLContext sqlContext = getSqlContext();
> Encoder evidence = Encoders.bean(MyBean.class);
> Encoder evidence2 = 
> Encoders.javaSerialization(MyUnBean.class);
> List holder = makeBeanList();
>  // fails at this line with
> // Exception in thread "main" java.lang.UnsupportedOperationException: no 
> encoder found for com.lordjoe.testing.MyUnBean
> Dataset beanSet  = sqlContext.createDataset( holder, 
> evidence);
> long count = beanSet.count();
> if(count != LIST_SIZE)
> throw new IllegalStateException("bad count");
> }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13605) Bean encoder cannot handle nonbean properties - no way to Encode nonbean Java objects with columns

2016-03-04 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-13605:
-
Target Version/s: 2.0.0  (was: 1.6.0)

> Bean encoder cannot handle nonbean properties - no way to Encode nonbean Java 
> objects with columns
> --
>
> Key: SPARK-13605
> URL: https://issues.apache.org/jira/browse/SPARK-13605
> Project: Spark
>  Issue Type: New Feature
>  Components: Java API, SQL
>Affects Versions: 1.6.0
> Environment: Any
>Reporter: Steven Lewis
>  Labels: easytest, features
> Fix For: 1.6.0
>
>
> in the current environment the only way to turn a List or JavaRDD into a 
> DataSet with columns is to use a  Encoders.bean(MyBean.class); The current 
> implementation fails if a Bean property is not a basic type or a Bean.
> I would like to see one of the following
> 1) Default to JavaSerialization for any Java Object implementing Serializable 
> when using bean Encoder
> 2) Allow an encoder which is a Map and look up entries in 
> encoding classes - an ideal implementation would look for the class then any 
> interfaces and then search base classes 
> The following code illustrates the issue
> /**
>  * This class is a good Java bean but one field holds an object
>  * which is not a bean
>  */
> public class MyBean  implements Serializable {
> private int m_count;
> private String m_Name;
> private MyUnBean m_UnBean;
> public MyBean(int count, String name, MyUnBean unBean) {
> m_count = count;
> m_Name = name;
> m_UnBean = unBean;
> }
> public int getCount() {return m_count; }
> public void setCount(int count) {m_count = count;}
> public String getName() {return m_Name;}
> public void setName(String name) {m_Name = name;}
> public MyUnBean getUnBean() {return m_UnBean;}
> public void setUnBean(MyUnBean unBean) {m_UnBean = unBean;}
> }
> /**
>  * This is a Java object which is not a bean
>  * no getters or setters but is serializable
>  */
> public class MyUnBean implements Serializable {
> public final int count;
> public final String name;
> public MyUnBean(int count, String name) {
> this.count = count;
> this.name = name;
> }
> }
> **
>  * This code creates a list of objects containing MyBean -
>  * a Java Bean containing one field which is not bean 
>  * It then attempts and fails to use a bean encoder 
>  * to make a DataSet
>  */
> public class DatasetTest {
> public static final Random RND = new Random();
> public static final int LIST_SIZE = 100;
> public static String makeName() {
> return Integer.toString(RND.nextInt());
> }
> public static MyUnBean makeUnBean() {
> return new MyUnBean(RND.nextInt(), makeName());
> }
> public static MyBean makeBean() {
> return new MyBean(RND.nextInt(), makeName(), makeUnBean());
> }
> /**
>  * Make a list of MyBeans
>  * @return
>  */
> public static List makeBeanList() {
> List holder = new ArrayList();
> for (int i = 0; i < LIST_SIZE; i++) {
> holder.add(makeBean());
> }
> return holder;
> }
> public static SQLContext getSqlContext() {
> SparkConf sparkConf = new SparkConf();
> sparkConf.setAppName("BeanTest") ;
> Option option = sparkConf.getOption("spark.master");
> if (!option.isDefined())// use local over nothing
> sparkConf.setMaster("local[*]");
> JavaSparkContext ctx = new JavaSparkContext(sparkConf) ;
> return new SQLContext(ctx);
> }
> public static void main(String[] args) {
> SQLContext sqlContext = getSqlContext();
> Encoder evidence = Encoders.bean(MyBean.class);
> Encoder evidence2 = 
> Encoders.javaSerialization(MyUnBean.class);
> List holder = makeBeanList();
>  // fails at this line with
> // Exception in thread "main" java.lang.UnsupportedOperationException: no 
> encoder found for com.lordjoe.testing.MyUnBean
> Dataset beanSet  = sqlContext.createDataset( holder, 
> evidence);
> long count = beanSet.count();
> if(count != LIST_SIZE)
> throw new IllegalStateException("bad count");
> }
> }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13605) Bean encoder cannot handle nonbean properties - no way to Encode nonbean Java objects with columns

2016-03-04 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-13605:
-
Component/s: SQL

> Bean encoder cannot handle nonbean properties - no way to Encode nonbean Java 
> objects with columns
> --
>
> Key: SPARK-13605
> URL: https://issues.apache.org/jira/browse/SPARK-13605
> Project: Spark
>  Issue Type: New Feature
>  Components: Java API, SQL
>Affects Versions: 1.6.0
> Environment: Any
>Reporter: Steven Lewis
>  Labels: easytest, features
> Fix For: 1.6.0
>
>
> in the current environment the only way to turn a List or JavaRDD into a 
> DataSet with columns is to use a  Encoders.bean(MyBean.class); The current 
> implementation fails if a Bean property is not a basic type or a Bean.
> I would like to see one of the following
> 1) Default to JavaSerialization for any Java Object implementing Serializable 
> when using bean Encoder
> 2) Allow an encoder which is a Map and look up entries in 
> encoding classes - an ideal implementation would look for the class then any 
> interfaces and then search base classes 
> The following code illustrates the issue
> /**
>  * This class is a good Java bean but one field holds an object
>  * which is not a bean
>  */
> public class MyBean  implements Serializable {
> private int m_count;
> private String m_Name;
> private MyUnBean m_UnBean;
> public MyBean(int count, String name, MyUnBean unBean) {
> m_count = count;
> m_Name = name;
> m_UnBean = unBean;
> }
> public int getCount() {return m_count; }
> public void setCount(int count) {m_count = count;}
> public String getName() {return m_Name;}
> public void setName(String name) {m_Name = name;}
> public MyUnBean getUnBean() {return m_UnBean;}
> public void setUnBean(MyUnBean unBean) {m_UnBean = unBean;}
> }
> /**
>  * This is a Java object which is not a bean
>  * no getters or setters but is serializable
>  */
> public class MyUnBean implements Serializable {
> public final int count;
> public final String name;
> public MyUnBean(int count, String name) {
> this.count = count;
> this.name = name;
> }
> }
> **
>  * This code creates a list of objects containing MyBean -
>  * a Java Bean containing one field which is not bean 
>  * It then attempts and fails to use a bean encoder 
>  * to make a DataSet
>  */
> public class DatasetTest {
> public static final Random RND = new Random();
> public static final int LIST_SIZE = 100;
> public static String makeName() {
> return Integer.toString(RND.nextInt());
> }
> public static MyUnBean makeUnBean() {
> return new MyUnBean(RND.nextInt(), makeName());
> }
> public static MyBean makeBean() {
> return new MyBean(RND.nextInt(), makeName(), makeUnBean());
> }
> /**
>  * Make a list of MyBeans
>  * @return
>  */
> public static List makeBeanList() {
> List holder = new ArrayList();
> for (int i = 0; i < LIST_SIZE; i++) {
> holder.add(makeBean());
> }
> return holder;
> }
> public static SQLContext getSqlContext() {
> SparkConf sparkConf = new SparkConf();
> sparkConf.setAppName("BeanTest") ;
> Option option = sparkConf.getOption("spark.master");
> if (!option.isDefined())// use local over nothing
> sparkConf.setMaster("local[*]");
> JavaSparkContext ctx = new JavaSparkContext(sparkConf) ;
> return new SQLContext(ctx);
> }
> public static void main(String[] args) {
> SQLContext sqlContext = getSqlContext();
> Encoder evidence = Encoders.bean(MyBean.class);
> Encoder evidence2 = 
> Encoders.javaSerialization(MyUnBean.class);
> List holder = makeBeanList();
>  // fails at this line with
> // Exception in thread "main" java.lang.UnsupportedOperationException: no 
> encoder found for com.lordjoe.testing.MyUnBean
> Dataset beanSet  = sqlContext.createDataset( holder, 
> evidence);
> long count = beanSet.count();
> if(count != LIST_SIZE)
> throw new IllegalStateException("bad count");
> }
> }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13605) Bean encoder cannot handle nonbean properties - no way to Encode nonbean Java objects with columns

2016-03-04 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-13605:
-
Description: 
in the current environment the only way to turn a List or JavaRDD into a 
DataSet with columns is to use a  Encoders.bean(MyBean.class); The current 
implementation fails if a Bean property is not a basic type or a Bean.
I would like to see one of the following
1) Default to JavaSerialization for any Java Object implementing Serializable 
when using bean Encoder
2) Allow an encoder which is a Map and look up entries in 
encoding classes - an ideal implementation would look for the class then any 
interfaces and then search base classes 

The following code illustrates the issue
{code}
/**
 * This class is a good Java bean but one field holds an object
 * which is not a bean
 */
public class MyBean  implements Serializable {
private int m_count;
private String m_Name;
private MyUnBean m_UnBean;

public MyBean(int count, String name, MyUnBean unBean) {
m_count = count;
m_Name = name;
m_UnBean = unBean;
}

public int getCount() {return m_count; }
public void setCount(int count) {m_count = count;}
public String getName() {return m_Name;}
public void setName(String name) {m_Name = name;}
public MyUnBean getUnBean() {return m_UnBean;}
public void setUnBean(MyUnBean unBean) {m_UnBean = unBean;}
}
/**
 * This is a Java object which is not a bean
 * no getters or setters but is serializable
 */
public class MyUnBean implements Serializable {
public final int count;
public final String name;

public MyUnBean(int count, String name) {
this.count = count;
this.name = name;
}
}

**
 * This code creates a list of objects containing MyBean -
 * a Java Bean containing one field which is not bean 
 * It then attempts and fails to use a bean encoder 
 * to make a DataSet
 */
public class DatasetTest {
public static final Random RND = new Random();
public static final int LIST_SIZE = 100;

public static String makeName() {
return Integer.toString(RND.nextInt());
}

public static MyUnBean makeUnBean() {
return new MyUnBean(RND.nextInt(), makeName());
}

public static MyBean makeBean() {
return new MyBean(RND.nextInt(), makeName(), makeUnBean());
}

/**
 * Make a list of MyBeans
 * @return
 */
public static List makeBeanList() {
List holder = new ArrayList();
for (int i = 0; i < LIST_SIZE; i++) {
holder.add(makeBean());
}
return holder;
}

public static SQLContext getSqlContext() {
SparkConf sparkConf = new SparkConf();
sparkConf.setAppName("BeanTest") ;
Option option = sparkConf.getOption("spark.master");
if (!option.isDefined())// use local over nothing
sparkConf.setMaster("local[*]");
JavaSparkContext ctx = new JavaSparkContext(sparkConf) ;
return new SQLContext(ctx);
}


public static void main(String[] args) {
SQLContext sqlContext = getSqlContext();

Encoder evidence = Encoders.bean(MyBean.class);
Encoder evidence2 = 
Encoders.javaSerialization(MyUnBean.class);

List holder = makeBeanList();
 // fails at this line with
// Exception in thread "main" java.lang.UnsupportedOperationException: no 
encoder found for com.lordjoe.testing.MyUnBean

Dataset beanSet  = sqlContext.createDataset( holder, evidence);

long count = beanSet.count();
if(count != LIST_SIZE)
throw new IllegalStateException("bad count");

}
}
{code}

  was:
in the current environment the only way to turn a List or JavaRDD into a 
DataSet with columns is to use a  Encoders.bean(MyBean.class); The current 
implementation fails if a Bean property is not a basic type or a Bean.
I would like to see one of the following
1) Default to JavaSerialization for any Java Object implementing Serializable 
when using bean Encoder
2) Allow an encoder which is a Map and look up entries in 
encoding classes - an ideal implementation would look for the class then any 
interfaces and then search base classes 

The following code illustrates the issue
/**
 * This class is a good Java bean but one field holds an object
 * which is not a bean
 */
public class MyBean  implements Serializable {
private int m_count;
private String m_Name;
private MyUnBean m_UnBean;

public MyBean(int count, String name, MyUnBean unBean) {
m_count = count;
m_Name = name;
m_UnBean = unBean;
}

public int getCount() {return m_count; }
public void setCount(int count) {m_count = count;}
public String getName() {return m_Name;}
public void setName(String name) {m_Name = name;}
public MyUnBean getUnBean() {return m_UnBean;}

[jira] [Updated] (SPARK-13691) Scala and Python generate inconsistent results

2016-03-04 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-13691:
-
Affects Version/s: 1.4.1
   1.5.2
   1.6.0

> Scala and Python generate inconsistent results
> --
>
> Key: SPARK-13691
> URL: https://issues.apache.org/jira/browse/SPARK-13691
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.4.1, 1.5.2, 1.6.0
>Reporter: Shixiong Zhu
>
> Here is an example that Scala and Python generate different results
> {code}
> Scala:
> scala> var i = 0
> i: Int = 0
> scala> val rdd = sc.parallelize(1 to 10).map(_ + i)
> scala> rdd.collect()
> res0: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
> scala> i += 1
> scala> rdd.collect()
> res2: Array[Int] = Array(2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
> Python:
> >>> i = 0
> >>> rdd = sc.parallelize(range(1, 10)).map(lambda x: x + i)
> >>> rdd.collect()
> [1, 2, 3, 4, 5, 6, 7, 8, 9]
> >>> i += 1
> >>> rdd.collect()
> [1, 2, 3, 4, 5, 6, 7, 8, 9]
> {code}
> The difference is Scala will capture all variables' values when running a job 
> every time, but Python just captures variables' values once and always uses 
> them for all jobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13255) Integrate vectorized parquet scan with whole stage codegen.

2016-03-04 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-13255.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11435
[https://github.com/apache/spark/pull/11435]

> Integrate vectorized parquet scan with whole stage codegen.
> ---
>
> Key: SPARK-13255
> URL: https://issues.apache.org/jira/browse/SPARK-13255
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Reporter: Nong Li
> Fix For: 2.0.0
>
>
> The generated whole stage codegen is intended to be run over batches of rows. 
> This task is to integrate ColumnarBatches with whole stage codegen.
> The resulting generated code should look something like:
> {code}
> Iterator input;
> void process() {
>   while (input.hasNext()) {
> ColumnarBatch batch = input.next();
> for (Row: batch) {
>   // Current function
> }
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13691) Scala and Python generate inconsistent results

2016-03-04 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15180724#comment-15180724
 ] 

Shixiong Zhu commented on SPARK-13691:
--

Ideally, PySpark should always capture all values when running a job like Scala.

> Scala and Python generate inconsistent results
> --
>
> Key: SPARK-13691
> URL: https://issues.apache.org/jira/browse/SPARK-13691
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Reporter: Shixiong Zhu
>
> Here is an example that Scala and Python generate different results
> {code}
> Scala:
> scala> var i = 0
> i: Int = 0
> scala> val rdd = sc.parallelize(1 to 10).map(_ + i)
> scala> rdd.collect()
> res0: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
> scala> i += 1
> scala> rdd.collect()
> res2: Array[Int] = Array(2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
> Python:
> >>> i = 0
> >>> rdd = sc.parallelize(range(1, 10)).map(lambda x: x + i)
> >>> rdd.collect()
> [1, 2, 3, 4, 5, 6, 7, 8, 9]
> >>> i += 1
> >>> rdd.collect()
> [1, 2, 3, 4, 5, 6, 7, 8, 9]
> {code}
> The difference is Scala will capture all variables' values when running a job 
> every time, but Python just captures variables' values once and always uses 
> them for all jobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10086) Flaky StreamingKMeans test in PySpark

2016-03-04 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15180723#comment-15180723
 ] 

Shixiong Zhu commented on SPARK-10086:
--

I opened SPARK-13691 to describe the root issue in PySpark.

> Flaky StreamingKMeans test in PySpark
> -
>
> Key: SPARK-10086
> URL: https://issues.apache.org/jira/browse/SPARK-10086
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark, Streaming, Tests
>Affects Versions: 1.5.0
>Reporter: Joseph K. Bradley
>Priority: Critical
> Attachments: flakyRepro.py
>
>
> Here's a report on investigating test failures in StreamingKMeans in PySpark. 
> (See Jenkins links below.)
> It is a StreamingKMeans test which trains on a DStream with 2 batches and 
> then tests on those same 2 batches.  It fails here: 
> [https://github.com/apache/spark/blob/1968276af0f681fe51328b7dd795bd21724a5441/python/pyspark/mllib/tests.py#L1144]
> I recreated the same test, with variants training on: (1) the original 2 
> batches, (2) just the first batch, (3) just the second batch, and (4) neither 
> batch.  Here is code which avoids Streaming altogether to identify what 
> batches were processed.
> {code}
> from pyspark.mllib.clustering import StreamingKMeans, StreamingKMeansModel
> batches = [[[-0.5], [0.6], [0.8]], [[0.2], [-0.1], [0.3]]]
> batches = [sc.parallelize(batch) for batch in batches]
> stkm = StreamingKMeans(decayFactor=0.0, k=2)
> stkm.setInitialCenters([[0.0], [1.0]], [1.0, 1.0])
> # Train
> def update(rdd):
> stkm._model.update(rdd, stkm._decayFactor, stkm._timeUnit)
> # Remove one or both of these lines to test skipping batches.
> update(batches[0])
> update(batches[1])
> # Test
> def predict(rdd):
> return stkm._model.predict(rdd)
> predict(batches[0]).collect()
> predict(batches[1]).collect()
> {code}
> *Results*:
> {code}
> ### EXPECTED
> [0, 1, 1] 
>   
> [1, 0, 1]
> ### Skip batch 0
> [1, 0, 0]
> [0, 1, 0]
> ### Skip batch 1
> [0, 1, 1]
> [1, 0, 1]
> ### Skip both batches  (This is what we see in the test 
> failures.)
> [0, 1, 1]
> [0, 0, 0]
> {code}
> Skipping both batches reproduces the failure.  There is no randomness in the 
> StreamingKMeans algorithm (since initial centers are fixed, not randomized).
> CC: [~tdas] [~freeman-lab] [~mengxr]
> Failure message:
> {code}
> ==
> FAIL: test_trainOn_predictOn (__main__.StreamingKMeansTest)
> Test that prediction happens on the updated model.
> --
> Traceback (most recent call last):
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py",
>  line 1147, in test_trainOn_predictOn
> self._eventually(condition, catch_assertions=True)
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py",
>  line 123, in _eventually
> raise lastValue
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py",
>  line 114, in _eventually
> lastValue = condition()
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py",
>  line 1144, in condition
> self.assertEqual(predict_results, [[0, 1, 1], [1, 0, 1]])
> AssertionError: Lists differ: [[0, 1, 1], [0, 0, 0]] != [[0, 1, 1], [1, 0, 1]]
> First differing element 1:
> [0, 0, 0]
> [1, 0, 1]
> - [[0, 1, 1], [0, 0, 0]]
> ? 
> + [[0, 1, 1], [1, 0, 1]]
> ?  +++   ^
> --
> Ran 62 tests in 164.188s
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13691) Scala and Python generate inconsistent results

2016-03-04 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-13691:


 Summary: Scala and Python generate inconsistent results
 Key: SPARK-13691
 URL: https://issues.apache.org/jira/browse/SPARK-13691
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Reporter: Shixiong Zhu


Here is an example that Scala and Python generate different results

{code}
Scala:
scala> var i = 0
i: Int = 0
scala> val rdd = sc.parallelize(1 to 10).map(_ + i)
scala> rdd.collect()
res0: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
scala> i += 1
scala> rdd.collect()
res2: Array[Int] = Array(2, 3, 4, 5, 6, 7, 8, 9, 10, 11)

Python:
>>> i = 0
>>> rdd = sc.parallelize(range(1, 10)).map(lambda x: x + i)
>>> rdd.collect()
[1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> i += 1
>>> rdd.collect()
[1, 2, 3, 4, 5, 6, 7, 8, 9]
{code}

The difference is Scala will capture all variables' values when running a job 
every time, but Python just captures variables' values once and always uses 
them for all jobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13459) Separate Alive and Dead Executors in Executor Totals Table

2016-03-04 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-13459.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

> Separate Alive and Dead Executors in Executor Totals Table
> --
>
> Key: SPARK-13459
> URL: https://issues.apache.org/jira/browse/SPARK-13459
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.0.0
>Reporter: Alex Bozarth
>Assignee: Alex Bozarth
>Priority: Minor
> Fix For: 2.0.0
>
>
> Now that dead executors are shown in the executors table (SPARK-7729) the 
> totals table added in SPARK-12716 should be updated to include the separate 
> totals for alive and dead executors as well as the current total.
> (This improvement was originally discussed in the PR for SPARK-12716 while 
> SPARK-7729 was still in progress.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13690) UnsafeShuffleWriterSuite fails on arm64 (SnappyError, no native library is found)

2016-03-04 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15180713#comment-15180713
 ] 

Sean Owen commented on SPARK-13690:
---

Hm, I doubt Spark would be considered supported on a platform like this for 
reasons like this. But I thought snappy-java would fall back to non-native code?

> UnsafeShuffleWriterSuite fails on arm64 (SnappyError, no native library is 
> found)
> -
>
> Key: SPARK-13690
> URL: https://issues.apache.org/jira/browse/SPARK-13690
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
> Environment: $ java -version
> java version "1.8.0_73"
> Java(TM) SE Runtime Environment (build 1.8.0_73-b02)
> Java HotSpot(TM) 64-Bit Server VM (build 25.73-b02, mixed mode)
> $ uname -a
> Linux spark-on-arm 4.2.0-55598-g45f70e3 #5 SMP Tue Feb 2 10:14:08 CET 2016 
> aarch64 aarch64 aarch64 GNU/Linux
>Reporter: Santiago M. Mola
>Priority: Minor
>  Labels: arm64, porting
>
> UnsafeShuffleWriterSuite fails because of missing Snappy native library on 
> arm64.
> {code}
> Tests run: 19, Failures: 0, Errors: 2, Skipped: 0, Time elapsed: 6.437 sec 
> <<< FAILURE! - in org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite
> mergeSpillsWithFileStreamAndSnappy(org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite)
>   Time elapsed: 0.072 sec  <<< ERROR!
> java.lang.reflect.InvocationTargetException
>   at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite.testMergingSpills(UnsafeShuffleWriterSuite.java:337)
>   at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite.mergeSpillsWithFileStreamAndSnappy(UnsafeShuffleWriterSuite.java:389)
> Caused by: java.lang.IllegalArgumentException: org.xerial.snappy.SnappyError: 
> [FAILED_TO_LOAD_NATIVE_LIBRARY] no native library is found for os.name=Linux 
> and os.arch=aarch64
>   at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite.testMergingSpills(UnsafeShuffleWriterSuite.java:337)
>   at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite.mergeSpillsWithFileStreamAndSnappy(UnsafeShuffleWriterSuite.java:389)
> Caused by: org.xerial.snappy.SnappyError: [FAILED_TO_LOAD_NATIVE_LIBRARY] no 
> native library is found for os.name=Linux and os.arch=aarch64
>   at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite.testMergingSpills(UnsafeShuffleWriterSuite.java:337)
>   at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite.mergeSpillsWithFileStreamAndSnappy(UnsafeShuffleWriterSuite.java:389)
> mergeSpillsWithTransferToAndSnappy(org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite)
>   Time elapsed: 0.041 sec  <<< ERROR!
> java.lang.reflect.InvocationTargetException
>   at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite.testMergingSpills(UnsafeShuffleWriterSuite.java:337)
>   at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite.mergeSpillsWithTransferToAndSnappy(UnsafeShuffleWriterSuite.java:384)
> Caused by: java.lang.IllegalArgumentException: 
> java.lang.NoClassDefFoundError: Could not initialize class 
> org.xerial.snappy.Snappy
>   at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite.testMergingSpills(UnsafeShuffleWriterSuite.java:337)
>   at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite.mergeSpillsWithTransferToAndSnappy(UnsafeShuffleWriterSuite.java:384)
> Caused by: java.lang.NoClassDefFoundError: Could not initialize class 
> org.xerial.snappy.Snappy
>   at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite.testMergingSpills(UnsafeShuffleWriterSuite.java:337)
>   at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite.mergeSpillsWithTransferToAndSnappy(UnsafeShuffleWriterSuite.java:384)
> Running org.apache.spark.JavaAPISuite
> Tests run: 90, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 52.526 sec - 
> in org.apache.spark.JavaAPISuite
> Running org.apache.spark.unsafe.map.BytesToBytesMapOnHeapSuite
> Tests run: 12, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 9.761 sec - 
> in org.apache.spark.unsafe.map.BytesToBytesMapOnHeapSuite
> Running org.apache.spark.unsafe.map.BytesToBytesMapOffHeapSuite
> Tests run: 12, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 8.967 sec - 
> in org.apache.spark.unsafe.map.BytesToBytesMapOffHeapSuite
> Running org.apache.spark.api.java.OptionalSuite
> Tests run: 8, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.003 sec - 
> in org.apache.spark.api.java.OptionalSuite
> Results :
> Tests in error: 
>   
> UnsafeShuffleWriterSuite.mergeSpillsWithFileStreamAndSnappy:389->testMergingSpills:337
>  » InvocationTarget
>   
> UnsafeShuffleWriterSuite.mergeSpillsWithTransferToAndSnappy:384->testMergingSpills:337
>  » InvocationTarget
> {code}



--
This message was sent by 

[jira] [Updated] (SPARK-13459) Separate Alive and Dead Executors in Executor Totals Table

2016-03-04 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-13459:
--
Assignee: Alex Bozarth

> Separate Alive and Dead Executors in Executor Totals Table
> --
>
> Key: SPARK-13459
> URL: https://issues.apache.org/jira/browse/SPARK-13459
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.0.0
>Reporter: Alex Bozarth
>Assignee: Alex Bozarth
>Priority: Minor
>
> Now that dead executors are shown in the executors table (SPARK-7729) the 
> totals table added in SPARK-12716 should be updated to include the separate 
> totals for alive and dead executors as well as the current total.
> (This improvement was originally discussed in the PR for SPARK-12716 while 
> SPARK-7729 was still in progress.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13523) Reuse the exchanges in a query

2016-03-04 Thread Mark Hamstra (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Hamstra resolved SPARK-13523.
--
Resolution: Duplicate

> Reuse the exchanges in a query
> --
>
> Key: SPARK-13523
> URL: https://issues.apache.org/jira/browse/SPARK-13523
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>
> In exchange, the RDD will be materialized (shuffled or collected), it's a 
> good point to eliminate common part of a query.
> In some TPCDS queries (for example, Q64), the same exchange (ShuffleExchange 
> or BroadcastExchange) could be used multiple times, we should re-use them to 
> avoid the duplicated work and reduce the memory for broadcast.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13690) UnsafeShuffleWriterSuite fails on arm64 (SnappyError, no native library is found)

2016-03-04 Thread Santiago M. Mola (JIRA)
Santiago M. Mola created SPARK-13690:


 Summary: UnsafeShuffleWriterSuite fails on arm64 (SnappyError, no 
native library is found)
 Key: SPARK-13690
 URL: https://issues.apache.org/jira/browse/SPARK-13690
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.0.0
 Environment: $ java -version
java version "1.8.0_73"
Java(TM) SE Runtime Environment (build 1.8.0_73-b02)
Java HotSpot(TM) 64-Bit Server VM (build 25.73-b02, mixed mode)

$ uname -a
Linux spark-on-arm 4.2.0-55598-g45f70e3 #5 SMP Tue Feb 2 10:14:08 CET 2016 
aarch64 aarch64 aarch64 GNU/Linux
Reporter: Santiago M. Mola
Priority: Minor


UnsafeShuffleWriterSuite fails because of missing Snappy native library on 
arm64.


{code}
Tests run: 19, Failures: 0, Errors: 2, Skipped: 0, Time elapsed: 6.437 sec <<< 
FAILURE! - in org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite
mergeSpillsWithFileStreamAndSnappy(org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite)
  Time elapsed: 0.072 sec  <<< ERROR!
java.lang.reflect.InvocationTargetException
at 
org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite.testMergingSpills(UnsafeShuffleWriterSuite.java:337)
at 
org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite.mergeSpillsWithFileStreamAndSnappy(UnsafeShuffleWriterSuite.java:389)
Caused by: java.lang.IllegalArgumentException: org.xerial.snappy.SnappyError: 
[FAILED_TO_LOAD_NATIVE_LIBRARY] no native library is found for os.name=Linux 
and os.arch=aarch64
at 
org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite.testMergingSpills(UnsafeShuffleWriterSuite.java:337)
at 
org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite.mergeSpillsWithFileStreamAndSnappy(UnsafeShuffleWriterSuite.java:389)
Caused by: org.xerial.snappy.SnappyError: [FAILED_TO_LOAD_NATIVE_LIBRARY] no 
native library is found for os.name=Linux and os.arch=aarch64
at 
org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite.testMergingSpills(UnsafeShuffleWriterSuite.java:337)
at 
org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite.mergeSpillsWithFileStreamAndSnappy(UnsafeShuffleWriterSuite.java:389)

mergeSpillsWithTransferToAndSnappy(org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite)
  Time elapsed: 0.041 sec  <<< ERROR!
java.lang.reflect.InvocationTargetException
at 
org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite.testMergingSpills(UnsafeShuffleWriterSuite.java:337)
at 
org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite.mergeSpillsWithTransferToAndSnappy(UnsafeShuffleWriterSuite.java:384)
Caused by: java.lang.IllegalArgumentException: java.lang.NoClassDefFoundError: 
Could not initialize class org.xerial.snappy.Snappy
at 
org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite.testMergingSpills(UnsafeShuffleWriterSuite.java:337)
at 
org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite.mergeSpillsWithTransferToAndSnappy(UnsafeShuffleWriterSuite.java:384)
Caused by: java.lang.NoClassDefFoundError: Could not initialize class 
org.xerial.snappy.Snappy
at 
org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite.testMergingSpills(UnsafeShuffleWriterSuite.java:337)
at 
org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite.mergeSpillsWithTransferToAndSnappy(UnsafeShuffleWriterSuite.java:384)

Running org.apache.spark.JavaAPISuite
Tests run: 90, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 52.526 sec - 
in org.apache.spark.JavaAPISuite
Running org.apache.spark.unsafe.map.BytesToBytesMapOnHeapSuite
Tests run: 12, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 9.761 sec - in 
org.apache.spark.unsafe.map.BytesToBytesMapOnHeapSuite
Running org.apache.spark.unsafe.map.BytesToBytesMapOffHeapSuite
Tests run: 12, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 8.967 sec - in 
org.apache.spark.unsafe.map.BytesToBytesMapOffHeapSuite
Running org.apache.spark.api.java.OptionalSuite
Tests run: 8, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.003 sec - in 
org.apache.spark.api.java.OptionalSuite

Results :

Tests in error: 
  
UnsafeShuffleWriterSuite.mergeSpillsWithFileStreamAndSnappy:389->testMergingSpills:337
 » InvocationTarget
  
UnsafeShuffleWriterSuite.mergeSpillsWithTransferToAndSnappy:384->testMergingSpills:337
 » InvocationTarget
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13689) Move some methods in CatalystQl to a util object

2016-03-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13689:


Assignee: Andrew Or  (was: Apache Spark)

> Move some methods in CatalystQl to a util object
> 
>
> Key: SPARK-13689
> URL: https://issues.apache.org/jira/browse/SPARK-13689
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> When we add more DDL parsing logic in the future, SparkQl will become very 
> big. To keep it smaller, we'll introduce helper "parser objects", e.g. one to 
> parse alter table commands. However, these parser objects will need to access 
> some helper methods that exist in CatalystQl. The proposal is to move those 
> methods to an isolated ParserUtils object.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13689) Move some methods in CatalystQl to a util object

2016-03-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15180673#comment-15180673
 ] 

Apache Spark commented on SPARK-13689:
--

User 'andrewor14' has created a pull request for this issue:
https://github.com/apache/spark/pull/11529

> Move some methods in CatalystQl to a util object
> 
>
> Key: SPARK-13689
> URL: https://issues.apache.org/jira/browse/SPARK-13689
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> When we add more DDL parsing logic in the future, SparkQl will become very 
> big. To keep it smaller, we'll introduce helper "parser objects", e.g. one to 
> parse alter table commands. However, these parser objects will need to access 
> some helper methods that exist in CatalystQl. The proposal is to move those 
> methods to an isolated ParserUtils object.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13689) Move some methods in CatalystQl to a util object

2016-03-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13689:


Assignee: Apache Spark  (was: Andrew Or)

> Move some methods in CatalystQl to a util object
> 
>
> Key: SPARK-13689
> URL: https://issues.apache.org/jira/browse/SPARK-13689
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Andrew Or
>Assignee: Apache Spark
>
> When we add more DDL parsing logic in the future, SparkQl will become very 
> big. To keep it smaller, we'll introduce helper "parser objects", e.g. one to 
> parse alter table commands. However, these parser objects will need to access 
> some helper methods that exist in CatalystQl. The proposal is to move those 
> methods to an isolated ParserUtils object.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13689) Move some methods in CatalystQl to a util object

2016-03-04 Thread Andrew Or (JIRA)
Andrew Or created SPARK-13689:
-

 Summary: Move some methods in CatalystQl to a util object
 Key: SPARK-13689
 URL: https://issues.apache.org/jira/browse/SPARK-13689
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Andrew Or
Assignee: Andrew Or


When we add more DDL parsing logic in the future, SparkQl will become very big. 
To keep it smaller, we'll introduce helper "parser objects", e.g. one to parse 
alter table commands. However, these parser objects will need to access some 
helper methods that exist in CatalystQl. The proposal is to move those methods 
to an isolated ParserUtils object.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13670) spark-class doesn't bubble up error from launcher command

2016-03-04 Thread Mark Grover (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15180663#comment-15180663
 ] 

Mark Grover commented on SPARK-13670:
-

I have a mac, I can do some more testing.

> spark-class doesn't bubble up error from launcher command
> -
>
> Key: SPARK-13670
> URL: https://issues.apache.org/jira/browse/SPARK-13670
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.0.0
>Reporter: Mark Grover
>Priority: Minor
>
> There's a particular snippet in spark-class 
> [here|https://github.com/apache/spark/blob/master/bin/spark-class#L86] that 
> runs the spark-launcher code in a subshell.
> {code}
> # The launcher library will print arguments separated by a NULL character, to 
> allow arguments with
> # characters that would be otherwise interpreted by the shell. Read that in a 
> while loop, populating
> # an array that will be used to exec the final command.
> CMD=()
> while IFS= read -d '' -r ARG; do
>   CMD+=("$ARG")
> done < <("$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main 
> "$@")
> {code}
> The problem is that the if the launcher Main fails, this code still still 
> returns success and continues, even though the top level script is marked 
> {{set -e}}. This is because the launcher.Main is run within a subshell.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13670) spark-class doesn't bubble up error from launcher command

2016-03-04 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15180643#comment-15180643
 ] 

Marcelo Vanzin commented on SPARK-13670:


Here's one that works regardless of the output of Main. It would be nice to 
have someone try this on MacOS.

{code}
# The launcher library will print arguments separated by a NULL character, to 
allow arguments with
# characters that would be otherwise interpreted by the shell. Read that in a 
while loop, populating
# an array that will be used to exec the final command.
#
# The exit code of the launcher is appended to the output, so the parent shell 
removes it from the
# command array and checks the value to see if the launcher succeeded.
build_command() {
  "$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@"
  printf "%d\0" $?
}

CMD=()
while IFS= read -d '' -r ARG; do
  CMD+=("$ARG")
done < <(build_command "$@")

COUNT=${#CMD[@]}
LAST=$((COUNT - 1))
EC=${CMD[$LAST]}
if [ $EC != 0 ]; then
  exit $EC
fi

CMD=("${CMD[@]:0:$LAST}")
exec "${CMD[@]}"
{code}

> spark-class doesn't bubble up error from launcher command
> -
>
> Key: SPARK-13670
> URL: https://issues.apache.org/jira/browse/SPARK-13670
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.0.0
>Reporter: Mark Grover
>Priority: Minor
>
> There's a particular snippet in spark-class 
> [here|https://github.com/apache/spark/blob/master/bin/spark-class#L86] that 
> runs the spark-launcher code in a subshell.
> {code}
> # The launcher library will print arguments separated by a NULL character, to 
> allow arguments with
> # characters that would be otherwise interpreted by the shell. Read that in a 
> while loop, populating
> # an array that will be used to exec the final command.
> CMD=()
> while IFS= read -d '' -r ARG; do
>   CMD+=("$ARG")
> done < <("$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main 
> "$@")
> {code}
> The problem is that the if the launcher Main fails, this code still still 
> returns success and continues, even though the top level script is marked 
> {{set -e}}. This is because the launcher.Main is run within a subshell.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13688) Add option to use dynamic allocation even if spark.executor.instances is set.

2016-03-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15180636#comment-15180636
 ] 

Apache Spark commented on SPARK-13688:
--

User 'rdblue' has created a pull request for this issue:
https://github.com/apache/spark/pull/11528

> Add option to use dynamic allocation even if spark.executor.instances is set.
> -
>
> Key: SPARK-13688
> URL: https://issues.apache.org/jira/browse/SPARK-13688
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.6.0
>Reporter: Ryan Blue
>
> When both spark.dynamicAllocation.enabled and spark.executor.instances are 
> set, dynamic resource allocation is disabled (see SPARK-9092). This is a 
> reasonable default, but I think there should be a configuration property to 
> override it because it isn't obvious to users that dynamic allocation and 
> number of executors are mutually exclusive. We see users setting 
> --num-executors because that looks like what they want: a way to get more 
> executors.
> I propose adding a new boolean property, 
> spark.dynamicAllocation.overrideNumExecutors, that makes dynamic allocation 
> the default when both are set and uses --num-executors as the minimum number 
> of executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13688) Add option to use dynamic allocation even if spark.executor.instances is set.

2016-03-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13688:


Assignee: (was: Apache Spark)

> Add option to use dynamic allocation even if spark.executor.instances is set.
> -
>
> Key: SPARK-13688
> URL: https://issues.apache.org/jira/browse/SPARK-13688
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.6.0
>Reporter: Ryan Blue
>
> When both spark.dynamicAllocation.enabled and spark.executor.instances are 
> set, dynamic resource allocation is disabled (see SPARK-9092). This is a 
> reasonable default, but I think there should be a configuration property to 
> override it because it isn't obvious to users that dynamic allocation and 
> number of executors are mutually exclusive. We see users setting 
> --num-executors because that looks like what they want: a way to get more 
> executors.
> I propose adding a new boolean property, 
> spark.dynamicAllocation.overrideNumExecutors, that makes dynamic allocation 
> the default when both are set and uses --num-executors as the minimum number 
> of executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13688) Add option to use dynamic allocation even if spark.executor.instances is set.

2016-03-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13688:


Assignee: Apache Spark

> Add option to use dynamic allocation even if spark.executor.instances is set.
> -
>
> Key: SPARK-13688
> URL: https://issues.apache.org/jira/browse/SPARK-13688
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.6.0
>Reporter: Ryan Blue
>Assignee: Apache Spark
>
> When both spark.dynamicAllocation.enabled and spark.executor.instances are 
> set, dynamic resource allocation is disabled (see SPARK-9092). This is a 
> reasonable default, but I think there should be a configuration property to 
> override it because it isn't obvious to users that dynamic allocation and 
> number of executors are mutually exclusive. We see users setting 
> --num-executors because that looks like what they want: a way to get more 
> executors.
> I propose adding a new boolean property, 
> spark.dynamicAllocation.overrideNumExecutors, that makes dynamic allocation 
> the default when both are set and uses --num-executors as the minimum number 
> of executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13688) Add option to use dynamic allocation even if spark.executor.instances is set.

2016-03-04 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated SPARK-13688:
--
Affects Version/s: 1.6.0

> Add option to use dynamic allocation even if spark.executor.instances is set.
> -
>
> Key: SPARK-13688
> URL: https://issues.apache.org/jira/browse/SPARK-13688
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.6.0
>Reporter: Ryan Blue
>
> When both spark.dynamicAllocation.enabled and spark.executor.instances are 
> set, dynamic resource allocation is disabled (see SPARK-9092). This is a 
> reasonable default, but I think there should be a configuration property to 
> override it because it isn't obvious to users that dynamic allocation and 
> number of executors are mutually exclusive. We see users setting 
> --num-executors because that looks like what they want: a way to get more 
> executors.
> I propose adding a new boolean property, 
> spark.dynamicAllocation.overrideNumExecutors, that makes dynamic allocation 
> the default when both are set and uses --num-executors as the minimum number 
> of executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13687) Cleanup pyspark temporary files

2016-03-04 Thread Damir (JIRA)
Damir created SPARK-13687:
-

 Summary: Cleanup pyspark temporary files
 Key: SPARK-13687
 URL: https://issues.apache.org/jira/browse/SPARK-13687
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 1.6.0, 1.5.2
Reporter: Damir


Every time parallelize is called it creates temporary file for rdd in 
spark.local.dir/spark-uuid/pyspark-uuid/ directory. This directory deletes when 
context is closed, but for long running applications with permanently opened 
context this directory growth infinitely.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13688) Add option to use dynamic allocation even if spark.executor.instances is set.

2016-03-04 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated SPARK-13688:
--
Component/s: YARN

> Add option to use dynamic allocation even if spark.executor.instances is set.
> -
>
> Key: SPARK-13688
> URL: https://issues.apache.org/jira/browse/SPARK-13688
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.6.0
>Reporter: Ryan Blue
>
> When both spark.dynamicAllocation.enabled and spark.executor.instances are 
> set, dynamic resource allocation is disabled (see SPARK-9092). This is a 
> reasonable default, but I think there should be a configuration property to 
> override it because it isn't obvious to users that dynamic allocation and 
> number of executors are mutually exclusive. We see users setting 
> --num-executors because that looks like what they want: a way to get more 
> executors.
> I propose adding a new boolean property, 
> spark.dynamicAllocation.overrideNumExecutors, that makes dynamic allocation 
> the default when both are set and uses --num-executors as the minimum number 
> of executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13688) Add option to use dynamic allocation even if spark.executor.instances is set.

2016-03-04 Thread Ryan Blue (JIRA)
Ryan Blue created SPARK-13688:
-

 Summary: Add option to use dynamic allocation even if 
spark.executor.instances is set.
 Key: SPARK-13688
 URL: https://issues.apache.org/jira/browse/SPARK-13688
 Project: Spark
  Issue Type: Bug
Reporter: Ryan Blue


When both spark.dynamicAllocation.enabled and spark.executor.instances are set, 
dynamic resource allocation is disabled (see SPARK-9092). This is a reasonable 
default, but I think there should be a configuration property to override it 
because it isn't obvious to users that dynamic allocation and number of 
executors are mutually exclusive. We see users setting --num-executors because 
that looks like what they want: a way to get more executors.

I propose adding a new boolean property, 
spark.dynamicAllocation.overrideNumExecutors, that makes dynamic allocation the 
default when both are set and uses --num-executors as the minimum number of 
executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13684) Possible unsafe bytesRead increment in StreamInterceptor

2016-03-04 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15180575#comment-15180575
 ] 

holdenk commented on SPARK-13684:
-

It should (coverity doesn't trigger on any of the places where we use 
AtomicLong), although getting rid of the volatile keyword should also remove 
the warning (looking at the netty docs 
http://netty.io/wiki/new-and-noteworthy-in-4.0.html#wiki-h2-34 "A user does not 
need to define a volatile field to keep the state of a handler.")

> Possible unsafe bytesRead increment in StreamInterceptor
> 
>
> Key: SPARK-13684
> URL: https://issues.apache.org/jira/browse/SPARK-13684
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: holdenk
>Priority: Trivial
>
> We unsafely increment a volatile (bytesRead) in a call back, if two call 
> backs are triggered we may under count bytesRead. This issue was found using 
> coverity.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13686) Add a constructor parameter `reqParam` to (Streaming)LinearRegressionWithSGD

2016-03-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13686:


Assignee: Apache Spark

> Add a constructor parameter `reqParam` to (Streaming)LinearRegressionWithSGD
> 
>
> Key: SPARK-13686
> URL: https://issues.apache.org/jira/browse/SPARK-13686
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, Streaming
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Minor
>
> `LinearRegressionWithSGD` and `StreamingLinearRegressionWithSGD` does not 
> have `regParam` as their constructor arguments. They just depends on 
> GradientDescent's default reqParam values. 
> To be consistent with other algorithms, we had better add them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13686) Add a constructor parameter `reqParam` to (Streaming)LinearRegressionWithSGD

2016-03-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13686:


Assignee: (was: Apache Spark)

> Add a constructor parameter `reqParam` to (Streaming)LinearRegressionWithSGD
> 
>
> Key: SPARK-13686
> URL: https://issues.apache.org/jira/browse/SPARK-13686
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, Streaming
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> `LinearRegressionWithSGD` and `StreamingLinearRegressionWithSGD` does not 
> have `regParam` as their constructor arguments. They just depends on 
> GradientDescent's default reqParam values. 
> To be consistent with other algorithms, we had better add them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13686) Add a constructor parameter `reqParam` to (Streaming)LinearRegressionWithSGD

2016-03-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15180567#comment-15180567
 ] 

Apache Spark commented on SPARK-13686:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/11527

> Add a constructor parameter `reqParam` to (Streaming)LinearRegressionWithSGD
> 
>
> Key: SPARK-13686
> URL: https://issues.apache.org/jira/browse/SPARK-13686
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, Streaming
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> `LinearRegressionWithSGD` and `StreamingLinearRegressionWithSGD` does not 
> have `regParam` as their constructor arguments. They just depends on 
> GradientDescent's default reqParam values. 
> To be consistent with other algorithms, we had better add them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7179) Add pattern after "show tables" to filter desire tablename

2016-03-04 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-7179:

Target Version/s: 2.0.0

> Add pattern after "show tables" to filter desire tablename
> --
>
> Key: SPARK-7179
> URL: https://issues.apache.org/jira/browse/SPARK-7179
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: baishuo
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13686) Add a constructor parameter `reqParam` to (Streaming)LinearRegressionWithSGD

2016-03-04 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-13686:
--
Component/s: MLlib

> Add a constructor parameter `reqParam` to (Streaming)LinearRegressionWithSGD
> 
>
> Key: SPARK-13686
> URL: https://issues.apache.org/jira/browse/SPARK-13686
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, Streaming
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> `LinearRegressionWithSGD` and `StreamingLinearRegressionWithSGD` does not 
> have `regParam` as their constructor arguments. They just depends on 
> GradientDescent's default reqParam values. 
> To be consistent with other algorithms, we had better add them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13686) Add a constructor parameter `reqParam` to (Streaming)LinearRegressionWithSGD

2016-03-04 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-13686:
--
Component/s: Streaming

> Add a constructor parameter `reqParam` to (Streaming)LinearRegressionWithSGD
> 
>
> Key: SPARK-13686
> URL: https://issues.apache.org/jira/browse/SPARK-13686
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, Streaming
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> `LinearRegressionWithSGD` and `StreamingLinearRegressionWithSGD` does not 
> have `regParam` as their constructor arguments. They just depends on 
> GradientDescent's default reqParam values. 
> To be consistent with other algorithms, we had better add them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13686) Add a constructor parameter `reqParam` to (Streaming)LinearRegressionWithSGD

2016-03-04 Thread Dongjoon Hyun (JIRA)
Dongjoon Hyun created SPARK-13686:
-

 Summary: Add a constructor parameter `reqParam` to 
(Streaming)LinearRegressionWithSGD
 Key: SPARK-13686
 URL: https://issues.apache.org/jira/browse/SPARK-13686
 Project: Spark
  Issue Type: Bug
Reporter: Dongjoon Hyun
Priority: Minor


`LinearRegressionWithSGD` and `StreamingLinearRegressionWithSGD` does not have 
`regParam` as their constructor arguments. They just depends on 
GradientDescent's default reqParam values. 

To be consistent with other algorithms, we had better add them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13684) Possible unsafe bytesRead increment in StreamInterceptor

2016-03-04 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15180554#comment-15180554
 ] 

Marcelo Vanzin commented on SPARK-13684:


If switching to AtomicLong gets rid of the warning, that's fine.

> Possible unsafe bytesRead increment in StreamInterceptor
> 
>
> Key: SPARK-13684
> URL: https://issues.apache.org/jira/browse/SPARK-13684
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: holdenk
>Priority: Trivial
>
> We unsafely increment a volatile (bytesRead) in a call back, if two call 
> backs are triggered we may under count bytesRead. This issue was found using 
> coverity.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13685) Rename catalog.Catalog to ExternalCatalog

2016-03-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13685:


Assignee: Apache Spark  (was: Andrew Or)

> Rename catalog.Catalog to ExternalCatalog
> -
>
> Key: SPARK-13685
> URL: https://issues.apache.org/jira/browse/SPARK-13685
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Andrew Or
>Assignee: Apache Spark
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13685) Rename catalog.Catalog to ExternalCatalog

2016-03-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15180551#comment-15180551
 ] 

Apache Spark commented on SPARK-13685:
--

User 'andrewor14' has created a pull request for this issue:
https://github.com/apache/spark/pull/11526

> Rename catalog.Catalog to ExternalCatalog
> -
>
> Key: SPARK-13685
> URL: https://issues.apache.org/jira/browse/SPARK-13685
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Andrew Or
>Assignee: Andrew Or
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13684) Possible unsafe bytesRead increment in StreamInterceptor

2016-03-04 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15180550#comment-15180550
 ] 

holdenk commented on SPARK-13684:
-

We can use AtomicLong (so no need for using Doubles).

> Possible unsafe bytesRead increment in StreamInterceptor
> 
>
> Key: SPARK-13684
> URL: https://issues.apache.org/jira/browse/SPARK-13684
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: holdenk
>Priority: Trivial
>
> We unsafely increment a volatile (bytesRead) in a call back, if two call 
> backs are triggered we may under count bytesRead. This issue was found using 
> coverity.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13685) Rename catalog.Catalog to ExternalCatalog

2016-03-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13685:


Assignee: Andrew Or  (was: Apache Spark)

> Rename catalog.Catalog to ExternalCatalog
> -
>
> Key: SPARK-13685
> URL: https://issues.apache.org/jira/browse/SPARK-13685
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Andrew Or
>Assignee: Andrew Or
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13684) Possible unsafe bytesRead increment in StreamInterceptor

2016-03-04 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15180549#comment-15180549
 ] 

Marcelo Vanzin commented on SPARK-13684:


I don't think it could break unless netty changes the contract significantly. I 
find that highly unlikely to happen, since there's not much sense in processing 
different buffers read from the same socket concurrently.

> Possible unsafe bytesRead increment in StreamInterceptor
> 
>
> Key: SPARK-13684
> URL: https://issues.apache.org/jira/browse/SPARK-13684
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: holdenk
>Priority: Trivial
>
> We unsafely increment a volatile (bytesRead) in a call back, if two call 
> backs are triggered we may under count bytesRead. This issue was found using 
> coverity.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13684) Possible unsafe bytesRead increment in StreamInterceptor

2016-03-04 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15180544#comment-15180544
 ] 

Sean Owen commented on SPARK-13684:
---

The alternative is AtomicDouble I suppose. If that's true, it's not necessary 
indeed. Is it worth it for future proofing or is this just not something that 
could break? then it can be dismissed as a false positive.

> Possible unsafe bytesRead increment in StreamInterceptor
> 
>
> Key: SPARK-13684
> URL: https://issues.apache.org/jira/browse/SPARK-13684
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: holdenk
>Priority: Trivial
>
> We unsafely increment a volatile (bytesRead) in a call back, if two call 
> backs are triggered we may under count bytesRead. This issue was found using 
> coverity.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13572) HiveContext reads avro Hive tables incorrectly

2016-03-04 Thread Zoltan Fedor (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zoltan Fedor updated SPARK-13572:
-
Affects Version/s: 1.6.0

> HiveContext reads avro Hive tables incorrectly 
> ---
>
> Key: SPARK-13572
> URL: https://issues.apache.org/jira/browse/SPARK-13572
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.2, 1.6.0
> Environment: Hive 0.13.1, Spark 1.5.2
>Reporter: Zoltan Fedor
> Attachments: logs, table_definition
>
>
> I am using PySpark to read avro-based tables from Hive and while the avro 
> tables can be read, some of the columns are incorrectly read - showing value 
> "None" instead of the actual value.
> >>> results_df = sqlContext.sql("""SELECT * FROM trmdw_prod.opsconsole_ingest 
> >>> where year=2016 and month=2 and day=29 limit 3""")
> >>> results_df.take(3)
> [Row(kafkaoffsetgeneration=None, kafkapartition=None, kafkaoffset=None, 
> uuid=None, mid=None, iid=None, product=None, utctime=None, statcode=None, 
> statvalue=None, displayname=None, category=None, 
> source_filename=u'ops-20160228_23_35_01.gz', year=2016, month=2, day=29),
>  Row(kafkaoffsetgeneration=None, kafkapartition=None, kafkaoffset=None, 
> uuid=None, mid=None, iid=None, product=None, utctime=None, statcode=None, 
> statvalue=None, displayname=None, category=None, 
> source_filename=u'ops-20160228_23_35_01.gz', year=2016, month=2, day=29),
>  Row(kafkaoffsetgeneration=None, kafkapartition=None, kafkaoffset=None, 
> uuid=None, mid=None, iid=None, product=None, utctime=None, statcode=None, 
> statvalue=None, displayname=None, category=None, 
> source_filename=u'ops-20160228_23_35_01.gz', year=2016, month=2, day=29)]
> Observe the "None" values at most of the fields. Surprisingly not all fields, 
> only some of them are showing "None" instead of the real values. The table 
> definition does not show anything specific about these columns.
> Running the same query in Hive:
> c:hive2://xyz.com:100> SELECT * FROM trmdw_prod.opsconsole_ingest where 
> year=2016 and month=2 and day=29 limit 3;
> +--+---++---+---+---+++-+--++-++-+--++--+
> | opsconsole_ingest.kafkaoffsetgeneration  | opsconsole_ingest.kafkapartition 
>  | opsconsole_ingest.kafkaoffset  |  opsconsole_ingest.uuid   |   
>   opsconsole_ingest.mid | opsconsole_ingest.iid | 
> opsconsole_ingest.product  | opsconsole_ingest.utctime  | 
> opsconsole_ingest.statcode  | opsconsole_ingest.statvalue  | 
> opsconsole_ingest.displayname  | opsconsole_ingest.category  | 
> opsconsole_ingest.source_filename  | opsconsole_ingest.year  | 
> opsconsole_ingest.month  | opsconsole_ingest.day  |
> +--+---++---+---+---+++-+--++-++-+--++--+
> | 11.0 | 0.0  
>  | 3.83399394E8   | EF0D03C409681B98646F316CA1088973  | 
> 174f53fb-ca9b-d3f9-64e1-7631bf906817  | ----  
> | est| 2016-01-13T06:58:19| 8 
>   | 3.0 SP11 (8.110.7601.18923)  | MSXML 3.0 Version  | 
> PC Information  | ops-20160228_23_35_01.gz   | 2016   
>  | 2| 29 |
> | 11.0 | 0.0  
>  | 3.83399395E8   | EF0D03C409681B98646F316CA1088973  | 
> 174f53fb-ca9b-d3f9-64e1-7631bf906817  | ----  
> | est| 2016-01-13T06:58:19| 2 
>   | GenuineIntel | CPU Vendor | 
> PC Information  | ops-20160228_23_35_01.gz   | 2016   
>  | 2| 29 |
> | 11.0 

[jira] [Commented] (SPARK-13670) spark-class doesn't bubble up error from launcher command

2016-03-04 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15180535#comment-15180535
 ] 

Marcelo Vanzin commented on SPARK-13670:


bq. And, it's guaranteed that Main would never return that output?

I'm reasonably sure it won't. I just tried spark-shell, pyspark and sparkR and 
they all work (since they all generate multiple command line arguments).

But if worried about that, you could always append the exit code to the output 
of {{build_command}} by always executing the printf. Then just remove the exit 
code in the parent shell, something like:

{code}
COUNT=${#CMD[@]}
EC=${CMD[$COUNT]}
CMD="${CMD[@]:0:$((COUNT -1))}"
{code}

Not 100% sure about the syntax, but seems to work in a quick test.


> spark-class doesn't bubble up error from launcher command
> -
>
> Key: SPARK-13670
> URL: https://issues.apache.org/jira/browse/SPARK-13670
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.0.0
>Reporter: Mark Grover
>Priority: Minor
>
> There's a particular snippet in spark-class 
> [here|https://github.com/apache/spark/blob/master/bin/spark-class#L86] that 
> runs the spark-launcher code in a subshell.
> {code}
> # The launcher library will print arguments separated by a NULL character, to 
> allow arguments with
> # characters that would be otherwise interpreted by the shell. Read that in a 
> while loop, populating
> # an array that will be used to exec the final command.
> CMD=()
> while IFS= read -d '' -r ARG; do
>   CMD+=("$ARG")
> done < <("$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main 
> "$@")
> {code}
> The problem is that the if the launcher Main fails, this code still still 
> returns success and continues, even though the top level script is marked 
> {{set -e}}. This is because the launcher.Main is run within a subshell.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13685) Rename catalog.Catalog to ExternalCatalog

2016-03-04 Thread Andrew Or (JIRA)
Andrew Or created SPARK-13685:
-

 Summary: Rename catalog.Catalog to ExternalCatalog
 Key: SPARK-13685
 URL: https://issues.apache.org/jira/browse/SPARK-13685
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Andrew Or
Assignee: Andrew Or






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13670) spark-class doesn't bubble up error from launcher command

2016-03-04 Thread Mark Grover (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15180516#comment-15180516
 ] 

Mark Grover commented on SPARK-13670:
-

And, it's guaranteed that Main would never return that output?

I feel like we may be overloading too much here but again, I don't have any 
non-ugly ideas either. We may just have to choose our poison.

> spark-class doesn't bubble up error from launcher command
> -
>
> Key: SPARK-13670
> URL: https://issues.apache.org/jira/browse/SPARK-13670
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.0.0
>Reporter: Mark Grover
>Priority: Minor
>
> There's a particular snippet in spark-class 
> [here|https://github.com/apache/spark/blob/master/bin/spark-class#L86] that 
> runs the spark-launcher code in a subshell.
> {code}
> # The launcher library will print arguments separated by a NULL character, to 
> allow arguments with
> # characters that would be otherwise interpreted by the shell. Read that in a 
> while loop, populating
> # an array that will be used to exec the final command.
> CMD=()
> while IFS= read -d '' -r ARG; do
>   CMD+=("$ARG")
> done < <("$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main 
> "$@")
> {code}
> The problem is that the if the launcher Main fails, this code still still 
> returns success and continues, even though the top level script is marked 
> {{set -e}}. This is because the launcher.Main is run within a subshell.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13684) Possible unsafe bytesRead increment in StreamInterceptor

2016-03-04 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15180515#comment-15180515
 ] 

Marcelo Vanzin commented on SPARK-13684:


It's not a real bug; netty guarantees that, on the read pipeline, a single 
thread is running the handlers. The volatile is there for paranoia, in case two 
back-to-back handler invocations happen on different threads.

> Possible unsafe bytesRead increment in StreamInterceptor
> 
>
> Key: SPARK-13684
> URL: https://issues.apache.org/jira/browse/SPARK-13684
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: holdenk
>Priority: Trivial
>
> We unsafely increment a volatile (bytesRead) in a call back, if two call 
> backs are triggered we may under count bytesRead. This issue was found using 
> coverity.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13684) Possible unsafe bytesRead increment in StreamInterceptor

2016-03-04 Thread holdenk (JIRA)
holdenk created SPARK-13684:
---

 Summary: Possible unsafe bytesRead increment in StreamInterceptor
 Key: SPARK-13684
 URL: https://issues.apache.org/jira/browse/SPARK-13684
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: holdenk
Priority: Trivial


We unsafely increment a volatile (bytesRead) in a call back, if two call backs 
are triggered we may under count bytesRead. This issue was found using coverity.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13670) spark-class doesn't bubble up error from launcher command

2016-03-04 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15180489#comment-15180489
 ] 

Marcelo Vanzin commented on SPARK-13670:


Here's something a little ugly that works though:

{code}
build_command() {
  "$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@"
  EC=$?
  if [ $EC -ne 0 ]; then
printf "%d\0" $EC
  fi
}

CMD=()
while IFS= read -d '' -r ARG; do
  CMD+=("$ARG")
done < <(build_command "$@")

if [ ${#CMD[@]} -eq 1 ]; then
  exit ${CMD[0]}
fi

exec "${CMD[@]}"
{code}

> spark-class doesn't bubble up error from launcher command
> -
>
> Key: SPARK-13670
> URL: https://issues.apache.org/jira/browse/SPARK-13670
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.0.0
>Reporter: Mark Grover
>Priority: Minor
>
> There's a particular snippet in spark-class 
> [here|https://github.com/apache/spark/blob/master/bin/spark-class#L86] that 
> runs the spark-launcher code in a subshell.
> {code}
> # The launcher library will print arguments separated by a NULL character, to 
> allow arguments with
> # characters that would be otherwise interpreted by the shell. Read that in a 
> while loop, populating
> # an array that will be used to exec the final command.
> CMD=()
> while IFS= read -d '' -r ARG; do
>   CMD+=("$ARG")
> done < <("$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main 
> "$@")
> {code}
> The problem is that the if the launcher Main fails, this code still still 
> returns success and continues, even though the top level script is marked 
> {{set -e}}. This is because the launcher.Main is run within a subshell.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13670) spark-class doesn't bubble up error from launcher command

2016-03-04 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15180479#comment-15180479
 ] 

Marcelo Vanzin commented on SPARK-13670:


Not a fan... you have to write the file somewhere, clean it up, worry about who 
can read it, all sorts of stuff. Better to avoid it.

> spark-class doesn't bubble up error from launcher command
> -
>
> Key: SPARK-13670
> URL: https://issues.apache.org/jira/browse/SPARK-13670
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.0.0
>Reporter: Mark Grover
>Priority: Minor
>
> There's a particular snippet in spark-class 
> [here|https://github.com/apache/spark/blob/master/bin/spark-class#L86] that 
> runs the spark-launcher code in a subshell.
> {code}
> # The launcher library will print arguments separated by a NULL character, to 
> allow arguments with
> # characters that would be otherwise interpreted by the shell. Read that in a 
> while loop, populating
> # an array that will be used to exec the final command.
> CMD=()
> while IFS= read -d '' -r ARG; do
>   CMD+=("$ARG")
> done < <("$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main 
> "$@")
> {code}
> The problem is that the if the launcher Main fails, this code still still 
> returns success and continues, even though the top level script is marked 
> {{set -e}}. This is because the launcher.Main is run within a subshell.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13670) spark-class doesn't bubble up error from launcher command

2016-03-04 Thread Mark Grover (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15180463#comment-15180463
 ] 

Mark Grover commented on SPARK-13670:
-

Thanks for looking at this. What do you think of #2 - writing the output to a 
file?

> spark-class doesn't bubble up error from launcher command
> -
>
> Key: SPARK-13670
> URL: https://issues.apache.org/jira/browse/SPARK-13670
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.0.0
>Reporter: Mark Grover
>Priority: Minor
>
> There's a particular snippet in spark-class 
> [here|https://github.com/apache/spark/blob/master/bin/spark-class#L86] that 
> runs the spark-launcher code in a subshell.
> {code}
> # The launcher library will print arguments separated by a NULL character, to 
> allow arguments with
> # characters that would be otherwise interpreted by the shell. Read that in a 
> while loop, populating
> # an array that will be used to exec the final command.
> CMD=()
> while IFS= read -d '' -r ARG; do
>   CMD+=("$ARG")
> done < <("$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main 
> "$@")
> {code}
> The problem is that the if the launcher Main fails, this code still still 
> returns success and continues, even though the top level script is marked 
> {{set -e}}. This is because the launcher.Main is run within a subshell.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13683) Finalize the public interface for OutputWriter[Factory]

2016-03-04 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-13683:


 Summary: Finalize the public interface for OutputWriter[Factory]
 Key: SPARK-13683
 URL: https://issues.apache.org/jira/browse/SPARK-13683
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Michael Armbrust


We need to at least remove bucketing.  I would also like to remove {{Job}} and 
the configuration stuff as well if possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12675) Executor dies because of ClassCastException and causes timeout

2016-03-04 Thread Anthony Brew (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15180456#comment-15180456
 ] 

Anthony Brew commented on SPARK-12675:
--

I'm hitting a bug similar to this also (same cast exceptions)...

java.io.IOException: java.lang.ClassCastException: cannot assign instance of 
scala.collection.immutable.HashMap$SerializationProxy to field 
org.apache.spark.executor.TaskMetrics._accumulatorUpdates of type 
scala.collection.immutable.Map in instance of 
org.apache.spark.executor.TaskMetrics
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1207)
at 
org.apache.spark.executor.TaskMetrics.readObject(TaskMetrics.scala:219)
at sun.reflect.GeneratedMethodAccessor26.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)


> Executor dies because of ClassCastException and causes timeout
> --
>
> Key: SPARK-12675
> URL: https://issues.apache.org/jira/browse/SPARK-12675
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0, 2.0.0
> Environment: 64-bit Linux Ubuntu 15.10, 16GB RAM, 8 cores 3ghz
>Reporter: Alexandru Rosianu
>Priority: Minor
>
> I'm trying to fit a Spark ML pipeline but my executor dies. Here's the script 
> which doesn't work (a bit simplified):
> {code:title=Script.scala}
> // Prepare data sets
> logInfo("Getting datasets")
> val emoTrainingData = 
> sqlc.read.parquet("/tw/sentiment/emo/parsed/data.parquet")
> val trainingData = emoTrainingData
> // Configure the pipeline
> val pipeline = new Pipeline().setStages(Array(
>   new 
> FeatureReducer().setInputCol("raw_text").setOutputCol("reduced_text"),
>   new StringSanitizer().setInputCol("reduced_text").setOutputCol("text"),
>   new Tokenizer().setInputCol("text").setOutputCol("raw_words"),
>   new StopWordsRemover().setInputCol("raw_words").setOutputCol("words"),
>   new HashingTF().setInputCol("words").setOutputCol("features"),
>   new NaiveBayes().setSmoothing(0.5).setFeaturesCol("features"),
>   new ColumnDropper().setDropColumns("raw_text", "reduced_text", "text", 
> "raw_words", "words", "features")
> ))
> // Fit the pipeline
> logInfo(s"Training model on ${trainingData.count()} rows")
> val model = pipeline.fit(trainingData)
> {code}
> It executes up to the last line. It prints "Training model on xx rows", then 
> it starts fitting, the executor dies, the drivers doesn't receive heartbeats 
> from the executor and it times out, then the script exits. It doesn't get 
> past that line.
> This is the exception that kills the executor:
> {code}
> java.io.IOException: java.lang.ClassCastException: cannot assign instance 
> of scala.collection.immutable.HashMap$SerializationProxy to field 
> org.apache.spark.executor.TaskMetrics._accumulatorUpdates of type 
> scala.collection.immutable.Map in instance of 
> org.apache.spark.executor.TaskMetrics
>   at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1207)
>   at 
> org.apache.spark.executor.TaskMetrics.readObject(TaskMetrics.scala:219)
>   at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1058)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1900)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
>   at org.apache.spark.util.Utils$.deserialize(Utils.scala:92)
>   at 
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1$$anonfun$apply$6.apply(Executor.scala:436)
>   at 
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1$$anonfun$apply$6.apply(Executor.scala:426)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1.apply(Executor.scala:426)
>   at 
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1.apply(Executor.scala:424)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:742)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at 

[jira] [Commented] (SPARK-13670) spark-class doesn't bubble up error from launcher command

2016-03-04 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15180455#comment-15180455
 ] 

Marcelo Vanzin commented on SPARK-13670:


Actually scrap that, it breaks things when the spark-shell actually runs... 
back to the drawing board.

> spark-class doesn't bubble up error from launcher command
> -
>
> Key: SPARK-13670
> URL: https://issues.apache.org/jira/browse/SPARK-13670
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.0.0
>Reporter: Mark Grover
>Priority: Minor
>
> There's a particular snippet in spark-class 
> [here|https://github.com/apache/spark/blob/master/bin/spark-class#L86] that 
> runs the spark-launcher code in a subshell.
> {code}
> # The launcher library will print arguments separated by a NULL character, to 
> allow arguments with
> # characters that would be otherwise interpreted by the shell. Read that in a 
> while loop, populating
> # an array that will be used to exec the final command.
> CMD=()
> while IFS= read -d '' -r ARG; do
>   CMD+=("$ARG")
> done < <("$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main 
> "$@")
> {code}
> The problem is that the if the launcher Main fails, this code still still 
> returns success and continues, even though the top level script is marked 
> {{set -e}}. This is because the launcher.Main is run within a subshell.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13670) spark-class doesn't bubble up error from launcher command

2016-03-04 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15180444#comment-15180444
 ] 

Marcelo Vanzin commented on SPARK-13670:


Note that will probably leave a bash process running somewhere alongside the 
Spark jvm, so probably would need tweaks to avoid that... bash is fun.


> spark-class doesn't bubble up error from launcher command
> -
>
> Key: SPARK-13670
> URL: https://issues.apache.org/jira/browse/SPARK-13670
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.0.0
>Reporter: Mark Grover
>Priority: Minor
>
> There's a particular snippet in spark-class 
> [here|https://github.com/apache/spark/blob/master/bin/spark-class#L86] that 
> runs the spark-launcher code in a subshell.
> {code}
> # The launcher library will print arguments separated by a NULL character, to 
> allow arguments with
> # characters that would be otherwise interpreted by the shell. Read that in a 
> while loop, populating
> # an array that will be used to exec the final command.
> CMD=()
> while IFS= read -d '' -r ARG; do
>   CMD+=("$ARG")
> done < <("$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main 
> "$@")
> {code}
> The problem is that the if the launcher Main fails, this code still still 
> returns success and continues, even though the top level script is marked 
> {{set -e}}. This is because the launcher.Main is run within a subshell.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13670) spark-class doesn't bubble up error from launcher command

2016-03-04 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15180427#comment-15180427
 ] 

Marcelo Vanzin commented on SPARK-13670:


After some fun playing with arcane bash syntax, here's something that worked 
for me:

{code}
run_command() {
  CMD=()
  while IFS='' read -d '' -r ARG; do
echo "line: $ARG"
CMD+=("$ARG")
  done
  if [ ${#CMD[@]} -gt 0 ]; then
exec "${CMD[@]}"
  fi
}

set -o pipefail
"$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@" | 
run_command
{code}

Example:

{noformat}
$ ./bin/spark-shell 
NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes 
ahead of assembly.
Exception in thread "main" java.lang.IllegalArgumentException: Testing, 
testing, testing...
at org.apache.spark.launcher.Main.main(Main.java:93)

$ echo $?
1
{noformat}



> spark-class doesn't bubble up error from launcher command
> -
>
> Key: SPARK-13670
> URL: https://issues.apache.org/jira/browse/SPARK-13670
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.0.0
>Reporter: Mark Grover
>Priority: Minor
>
> There's a particular snippet in spark-class 
> [here|https://github.com/apache/spark/blob/master/bin/spark-class#L86] that 
> runs the spark-launcher code in a subshell.
> {code}
> # The launcher library will print arguments separated by a NULL character, to 
> allow arguments with
> # characters that would be otherwise interpreted by the shell. Read that in a 
> while loop, populating
> # an array that will be used to exec the final command.
> CMD=()
> while IFS= read -d '' -r ARG; do
>   CMD+=("$ARG")
> done < <("$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main 
> "$@")
> {code}
> The problem is that the if the launcher Main fails, this code still still 
> returns success and continues, even though the top level script is marked 
> {{set -e}}. This is because the launcher.Main is run within a subshell.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13682) Finalize the public API for FileFormat

2016-03-04 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-13682:


 Summary: Finalize the public API for FileFormat
 Key: SPARK-13682
 URL: https://issues.apache.org/jira/browse/SPARK-13682
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Michael Armbrust


The current file format interface needs to be cleaned up before its acceptable 
for public consumption:
 - Have a version that takes Row and does a conversion, hide the internal API.
 - Remove bucketing
 - Remove RDD and the broadcastedConf
 - Remove SQLContext (maybe include SparkSession?)
 - Pass a better conf object



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13681) Reimplement CommitFailureTestRelationSuite

2016-03-04 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-13681:
-
Description: This test case got broken by 
[#11509|https://github.com/apache/spark/pull/11509].  We should reimplement it 
as a format.

> Reimplement CommitFailureTestRelationSuite
> --
>
> Key: SPARK-13681
> URL: https://issues.apache.org/jira/browse/SPARK-13681
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>Priority: Blocker
>
> This test case got broken by 
> [#11509|https://github.com/apache/spark/pull/11509].  We should reimplement 
> it as a format.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13681) Reimplement CommitFailureTestRelationSuite

2016-03-04 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-13681:


 Summary: Reimplement CommitFailureTestRelationSuite
 Key: SPARK-13681
 URL: https://issues.apache.org/jira/browse/SPARK-13681
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Michael Armbrust
Priority: Blocker






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13633) Move parser classes to o.a.s.sql.catalyst.parser package

2016-03-04 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-13633.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

> Move parser classes to o.a.s.sql.catalyst.parser package
> 
>
> Key: SPARK-13633
> URL: https://issues.apache.org/jira/browse/SPARK-13633
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Andrew Or
>Assignee: Andrew Or
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13494) Cannot sort on a column which is of type "array"

2016-03-04 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15180289#comment-15180289
 ] 

Xiao Li commented on SPARK-13494:
-

Can you try the other newer versions? You know, a lot of issues have been fixed 
in each release.

> Cannot sort on a column which is of type "array"
> 
>
> Key: SPARK-13494
> URL: https://issues.apache.org/jira/browse/SPARK-13494
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yael Aharon
>
> Executing the following SQL results in an error if columnName refers to a 
> column of type array
> SELECT * FROM myTable ORDER BY columnName ASC LIMIT 50
> The error is 
> org.apache.spark.sql.AnalysisException: cannot resolve 'columnName ASC' due 
> to data type mismatch: cannot sort data type array



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13680) Java UDAF with more than one intermediate argument returns wrong results

2016-03-04 Thread Yael Aharon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yael Aharon updated SPARK-13680:

Attachment: setup.hql

> Java UDAF with more than one intermediate argument returns wrong results
> 
>
> Key: SPARK-13680
> URL: https://issues.apache.org/jira/browse/SPARK-13680
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
> Environment: CDH 5.5.2
>Reporter: Yael Aharon
> Attachments: data.csv, setup.hql
>
>
> I am trying to incorporate the Java UDAF from 
> https://github.com/apache/spark/blob/master/sql/hive/src/test/java/org/apache/spark/sql/hive/aggregate/MyDoubleAvg.java
>  into an SQL query. 
> I registered the UDAF like this:
>  sqlContext.udf().register("myavg", new MyDoubleAvg());
> My SQL query is:
> SELECT AVG(seqi) AS `avg_seqi`, AVG(seqd) AS `avg_seqd`, AVG(ci) AS `avg_ci`, 
> AVG(cd) AS `avg_cd`, AVG(stdevd) AS `avg_stdevd`, AVG(stdevi) AS 
> `avg_stdevi`, MAX(seqi) AS `max_seqi`, MAX(seqd) AS `max_seqd`, MAX(ci) AS 
> `max_ci`, MAX(cd) AS `max_cd`, MAX(stdevd) AS `max_stdevd`, MAX(stdevi) AS 
> `max_stdevi`, MIN(seqi) AS `min_seqi`, MIN(seqd) AS `min_seqd`, MIN(ci) AS 
> `min_ci`, MIN(cd) AS `min_cd`, MIN(stdevd) AS `min_stdevd`, MIN(stdevi) AS 
> `min_stdevi`,SUM(seqi) AS `sum_seqi`, SUM(seqd) AS `sum_seqd`, SUM(ci) AS 
> `sum_ci`, SUM(cd) AS `sum_cd`, SUM(stdevd) AS `sum_stdevd`, SUM(stdevi) AS 
> `sum_stdevi`, myavg(seqd) as `myavg_seqd`,  AVG(zero) AS `avg_zero`, 
> AVG(nulli) AS `avg_nulli`,AVG(nulld) AS `avg_nulld`, SUM(zero) AS `sum_zero`, 
> SUM(nulli) AS `sum_nulli`,SUM(nulld) AS `sum_nulld`,MAX(zero) AS `max_zero`, 
> MAX(nulli) AS `max_nulli`,MAX(nulld) AS `max_nulld`,count( * ) AS 
> `count_all`, count(nulli) AS `count_nulli` FROM mytable
> As soon as I add the UDAF myavg to the SQL, all the results become incorrect. 
> When I remove the call to the UDAF, the results are correct.
> I was able to go around the issue by modifying bufferSchema of the UDAF to 
> use an array and the corresponding update and merge methods. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13434) Reduce Spark RandomForest memory footprint

2016-03-04 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15180243#comment-15180243
 ] 

Joseph K. Bradley commented on SPARK-13434:
---

There are a few options here:
* Temp fix: Reduce the number of executors, as you suggested.
* Long-term for this RF implementation: Implement local training for deep 
trees.  Spilling the current tree to disk would help, but I'd guess that local 
training would have a bigger impact.
* Long-term fix via a separate RF implementation: I've been working for a long 
time on a column-partitioned implementation which will be better for tasks like 
yours with many features & deep trees.  It's making progress but not yet ready 
to merge into Spark.

> Reduce Spark RandomForest memory footprint
> --
>
> Key: SPARK-13434
> URL: https://issues.apache.org/jira/browse/SPARK-13434
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.0
> Environment: Linux
>Reporter: Ewan Higgs
>  Labels: decisiontree, mllib, randomforest
> Attachments: heap-usage.log, rf-heap-usage.png
>
>
> The RandomForest implementation can easily run out of memory on moderate 
> datasets. This was raised in the a user's benchmarking game on github 
> (https://github.com/szilard/benchm-ml/issues/19). I looked to see if there 
> was a tracking issue, but I couldn't fine one.
> Using Spark 1.6, a user of mine is running into problems running the 
> RandomForest training on largish datasets on machines with 64G memory and the 
> following in {{spark-defaults.conf}}:
> {code}
> spark.executor.cores 2
> spark.executor.instances 199
> spark.executor.memory 10240M
> {code}
> I reproduced the excessive memory use from the benchmark example (using an 
> input CSV of 1.3G and 686 columns) in spark shell with {{spark-shell 
> --driver-memory 30G --executor-memory 30G}} and have a heap profile from a 
> single machine by running {{jmap -histo:live }}. I took a sample 
> every 5 seconds and at the peak it looks like this:
> {code}
>  num #instances #bytes  class name
> --
>1:   5428073 8458773496  [D
>2:  12293653 4124641992  [I
>3:  32508964 1820501984  org.apache.spark.mllib.tree.model.Node
>4:  53068426 1698189632  org.apache.spark.mllib.tree.model.Predict
>5:  72853787 1165660592  scala.Some
>6:  16263408  910750848  
> org.apache.spark.mllib.tree.model.InformationGainStats
>7: 72969  390492744  [B
>8:   3327008  133080320  
> org.apache.spark.mllib.tree.impl.DTStatsAggregator
>9:   3754500  120144000  
> scala.collection.immutable.HashMap$HashMap1
>   10:   3318349  106187168  org.apache.spark.mllib.tree.model.Split
>   11:   3534946   84838704  
> org.apache.spark.mllib.tree.RandomForest$NodeIndexInfo
>   12:   3764745   60235920  java.lang.Integer
>   13:   3327008   53232128  
> org.apache.spark.mllib.tree.impurity.EntropyAggregator
>   14:380804   45361144  [C
>   15:268887   34877128  
>   16:268887   34431568  
>   17:908377   34042760  [Lscala.collection.immutable.HashMap;
>   18:   110   2640  
> org.apache.spark.mllib.regression.LabeledPoint
>   19:   110   2640  org.apache.spark.mllib.linalg.SparseVector
>   20: 20206   25979864  
>   21:   100   2400  org.apache.spark.mllib.tree.impl.TreePoint
>   22:   100   2400  
> org.apache.spark.mllib.tree.impl.BaggedPoint
>   23:908332   21799968  
> scala.collection.immutable.HashMap$HashTrieMap
>   24: 20206   20158864  
>   25: 17023   14380352  
>   26:16   13308288  
> [Lorg.apache.spark.mllib.tree.impl.DTStatsAggregator;
>   27:445797   10699128  scala.Tuple2
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13048) EMLDAOptimizer deletes dependent checkpoint of DistributedLDAModel

2016-03-04 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15180241#comment-15180241
 ] 

Joseph K. Bradley commented on SPARK-13048:
---

I'd say the best fix would be to add an option to LDA to not delete the last 
checkpoint.  I'd prefer to expose this as a Param in the spark.ml API, but it 
could be added to the spark.mllib API as well if necessary.

[~holdenk]  I agree we need to figure out how to handle/control caching and 
checkpointing within Pipelines, but that will have to wait for after 2.0.

[~jvstein]  We try to minimize the public API.  Although I agree with you about 
opening up APIs in principal, it have proven dangerous in practice.  Even when 
we mark things DeveloperApi, many users still use those APIs, making it 
difficult to change them in the future.

> EMLDAOptimizer deletes dependent checkpoint of DistributedLDAModel
> --
>
> Key: SPARK-13048
> URL: https://issues.apache.org/jira/browse/SPARK-13048
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.5.2
> Environment: Standalone Spark cluster
>Reporter: Jeff Stein
>
> In EMLDAOptimizer, all checkpoints are deleted before returning the 
> DistributedLDAModel.
> The most recent checkpoint is still necessary for operations on the 
> DistributedLDAModel under a couple scenarios:
> - The graph doesn't fit in memory on the worker nodes (e.g. very large data 
> set).
> - Late worker failures that require reading the now-dependent checkpoint.
> I ran into this problem running a 10M record LDA model in a memory starved 
> environment. The model consistently failed in either the {{collect at 
> LDAModel.scala:528}} stage (when converting to a LocalLDAModel) or in the 
> {{reduce at LDAModel.scala:563}} stage (when calling "describeTopics" on the 
> model). In both cases, a FileNotFoundException is thrown attempting to access 
> a checkpoint file.
> I'm not sure what the correct fix is here; it might involve a class signature 
> change. An alternative simple fix is to leave the last checkpoint around and 
> expect the user to clean the checkpoint directory themselves.
> {noformat}
> java.io.FileNotFoundException: File does not exist: 
> /hdfs/path/to/checkpoints/c8bd2b4e-27dd-47b3-84ec-3ff0bac04587/rdd-635/part-00071
> {noformat}
> Relevant code is included below.
> LDAOptimizer.scala:
> {noformat}
>   override private[clustering] def getLDAModel(iterationTimes: 
> Array[Double]): LDAModel = {
> require(graph != null, "graph is null, EMLDAOptimizer not initialized.")
> this.graphCheckpointer.deleteAllCheckpoints()
> // The constructor's default arguments assume gammaShape = 100 to ensure 
> equivalence in
> // LDAModel.toLocal conversion
> new DistributedLDAModel(this.graph, this.globalTopicTotals, this.k, 
> this.vocabSize,
>   Vectors.dense(Array.fill(this.k)(this.docConcentration)), 
> this.topicConcentration,
>   iterationTimes)
>   }
> {noformat}
> PeriodicCheckpointer.scala
> {noformat}
>   /**
>* Call this at the end to delete any remaining checkpoint files.
>*/
>   def deleteAllCheckpoints(): Unit = {
> while (checkpointQueue.nonEmpty) {
>   removeCheckpointFile()
> }
>   }
>   /**
>* Dequeue the oldest checkpointed Dataset, and remove its checkpoint files.
>* This prints a warning but does not fail if the files cannot be removed.
>*/
>   private def removeCheckpointFile(): Unit = {
> val old = checkpointQueue.dequeue()
> // Since the old checkpoint is not deleted by Spark, we manually delete 
> it.
> val fs = FileSystem.get(sc.hadoopConfiguration)
> getCheckpointFiles(old).foreach { checkpointFile =>
>   try {
> fs.delete(new Path(checkpointFile), true)
>   } catch {
> case e: Exception =>
>   logWarning("PeriodicCheckpointer could not remove old checkpoint 
> file: " +
> checkpointFile)
>   }
> }
>   }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13230) HashMap.merged not working properly with Spark

2016-03-04 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15180234#comment-15180234
 ] 

Sean Owen commented on SPARK-13230:
---

Thanks, that's a great analysis. It sounds like we might need to close this as 
a Scala problem, and offer a workaround. For example, it's obviously possible 
to write a little function that accomplishes the same thing, and which I hope 
doesn't depend on serializing the same internal representation.

(PS JIRA does not use markdown. Use pairs of curly braces to {{format as code}}.

> HashMap.merged not working properly with Spark
> --
>
> Key: SPARK-13230
> URL: https://issues.apache.org/jira/browse/SPARK-13230
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
> Environment: Ubuntu 14.04.3, Scala 2.11.7, Spark 1.6.0
>Reporter: Alin Treznai
>
> Using HashMap.merged with Spark fails with NullPointerException.
> {noformat}
> import org.apache.spark.{SparkConf, SparkContext}
> import scala.collection.immutable.HashMap
> object MergeTest {
>   def mergeFn:(HashMap[String, Long], HashMap[String, Long]) => 
> HashMap[String, Long] = {
> case (m1, m2) => m1.merged(m2){ case (x,y) => (x._1, x._2 + y._2) }
>   }
>   def main(args: Array[String]) = {
> val input = Seq(HashMap("A" -> 1L), HashMap("A" -> 2L, "B" -> 
> 3L),HashMap("A" -> 2L, "C" -> 4L))
> val conf = new SparkConf().setAppName("MergeTest").setMaster("local[*]")
> val sc = new SparkContext(conf)
> val result = sc.parallelize(input).reduce(mergeFn)
> println(s"Result=$result")
> sc.stop()
>   }
> }
> {noformat}
> Error message:
> org.apache.spark.SparkDriverExecutionException: Execution error
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1169)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1637)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1952)
> at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1025)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
> at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
> at org.apache.spark.rdd.RDD.reduce(RDD.scala:1007)
> at MergeTest$.main(MergeTest.scala:21)
> at MergeTest.main(MergeTest.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> Caused by: java.lang.NullPointerException
> at 
> MergeTest$$anonfun$mergeFn$1$$anonfun$apply$1.apply(MergeTest.scala:12)
> at 
> MergeTest$$anonfun$mergeFn$1$$anonfun$apply$1.apply(MergeTest.scala:12)
> at scala.collection.immutable.HashMap$$anon$2.apply(HashMap.scala:148)
> at 
> scala.collection.immutable.HashMap$HashMap1.updated0(HashMap.scala:200)
> at 
> scala.collection.immutable.HashMap$HashTrieMap.updated0(HashMap.scala:322)
> at 
> scala.collection.immutable.HashMap$HashTrieMap.merge0(HashMap.scala:463)
> at scala.collection.immutable.HashMap.merged(HashMap.scala:117)
> at MergeTest$$anonfun$mergeFn$1.apply(MergeTest.scala:12)
> at MergeTest$$anonfun$mergeFn$1.apply(MergeTest.scala:11)
> at 
> org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply(RDD.scala:1020)
> at 
> org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply(RDD.scala:1017)
> at 
> org.apache.spark.scheduler.JobWaiter.taskSucceeded(JobWaiter.scala:56)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1165)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1637)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
> at 

[jira] [Comment Edited] (SPARK-13230) HashMap.merged not working properly with Spark

2016-03-04 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-13230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15180216#comment-15180216
 ] 

Łukasz Gieroń edited comment on SPARK-13230 at 3/4/16 5:36 PM:
---

The issue here is the bug in Scala library, in deserialization of `HashMap1` 
objects. When they get deserialized, the internal `kv` field does not get 
deserialized (is left `null`), which causes a `NullPointerException` in 
`merged`. I've fixed this is Scala library, and it fixes the issue.
I'm going to open a bug to Scala library and submit a pull request for it, and 
link that ticket here (if it's possible to link between Jiras).

PS. Not sure why Jira doesn't recognize my backticks markdown.


was (Author: lgieron):
The issue here is the bug in Scala library, in deserialization of `HashMap1` 
objects. When they get deserialized, the internal `kv` field does not get 
deserialized (is left `null`), which causes a `NullPointerException` in 
`merged`. I've fixed this is Scala library, and it fixes the issue.
I'm going to open a bug to Scala library and submit a pull request for it, and 
link that ticket here (if it's possible to link between Jiras).

> HashMap.merged not working properly with Spark
> --
>
> Key: SPARK-13230
> URL: https://issues.apache.org/jira/browse/SPARK-13230
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
> Environment: Ubuntu 14.04.3, Scala 2.11.7, Spark 1.6.0
>Reporter: Alin Treznai
>
> Using HashMap.merged with Spark fails with NullPointerException.
> {noformat}
> import org.apache.spark.{SparkConf, SparkContext}
> import scala.collection.immutable.HashMap
> object MergeTest {
>   def mergeFn:(HashMap[String, Long], HashMap[String, Long]) => 
> HashMap[String, Long] = {
> case (m1, m2) => m1.merged(m2){ case (x,y) => (x._1, x._2 + y._2) }
>   }
>   def main(args: Array[String]) = {
> val input = Seq(HashMap("A" -> 1L), HashMap("A" -> 2L, "B" -> 
> 3L),HashMap("A" -> 2L, "C" -> 4L))
> val conf = new SparkConf().setAppName("MergeTest").setMaster("local[*]")
> val sc = new SparkContext(conf)
> val result = sc.parallelize(input).reduce(mergeFn)
> println(s"Result=$result")
> sc.stop()
>   }
> }
> {noformat}
> Error message:
> org.apache.spark.SparkDriverExecutionException: Execution error
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1169)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1637)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1952)
> at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1025)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
> at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
> at org.apache.spark.rdd.RDD.reduce(RDD.scala:1007)
> at MergeTest$.main(MergeTest.scala:21)
> at MergeTest.main(MergeTest.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> Caused by: java.lang.NullPointerException
> at 
> MergeTest$$anonfun$mergeFn$1$$anonfun$apply$1.apply(MergeTest.scala:12)
> at 
> MergeTest$$anonfun$mergeFn$1$$anonfun$apply$1.apply(MergeTest.scala:12)
> at scala.collection.immutable.HashMap$$anon$2.apply(HashMap.scala:148)
> at 
> scala.collection.immutable.HashMap$HashMap1.updated0(HashMap.scala:200)
> at 
> scala.collection.immutable.HashMap$HashTrieMap.updated0(HashMap.scala:322)
> at 
> scala.collection.immutable.HashMap$HashTrieMap.merge0(HashMap.scala:463)
> at scala.collection.immutable.HashMap.merged(HashMap.scala:117)
> at MergeTest$$anonfun$mergeFn$1.apply(MergeTest.scala:12)
> at MergeTest$$anonfun$mergeFn$1.apply(MergeTest.scala:11)
> at 
> org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply(RDD.scala:1020)
> at 
> 

[jira] [Comment Edited] (SPARK-13230) HashMap.merged not working properly with Spark

2016-03-04 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-13230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15180216#comment-15180216
 ] 

Łukasz Gieroń edited comment on SPARK-13230 at 3/4/16 5:35 PM:
---

The issue here is the bug in Scala library, in deserialization of `HashMap1` 
objects. When they get deserialized, the internal `kv` field does not get 
deserialized (is left `null`), which causes a `NullPointerException` in 
`merged`. I've fixed this is Scala library, and it fixes the issue.
I'm going to open a bug to Scala library and submit a pull request for it, and 
link that ticket here (if it's possible to link between Jiras).


was (Author: lgieron):
The issue here is the bug in Scala library, in deserialization of 'HashMap1' 
objects. When they get deserialized, the internal `kv` field does not get 
deserialized (is left `null`), which causes a `NullPointerException` in 
`merged`. I've fixed this is Scala library, and it fixes the issue.
I'm going to open a bug to Scala library and submit a pull request for it, and 
link that ticket here (if it's possible to link between Jiras).

> HashMap.merged not working properly with Spark
> --
>
> Key: SPARK-13230
> URL: https://issues.apache.org/jira/browse/SPARK-13230
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
> Environment: Ubuntu 14.04.3, Scala 2.11.7, Spark 1.6.0
>Reporter: Alin Treznai
>
> Using HashMap.merged with Spark fails with NullPointerException.
> {noformat}
> import org.apache.spark.{SparkConf, SparkContext}
> import scala.collection.immutable.HashMap
> object MergeTest {
>   def mergeFn:(HashMap[String, Long], HashMap[String, Long]) => 
> HashMap[String, Long] = {
> case (m1, m2) => m1.merged(m2){ case (x,y) => (x._1, x._2 + y._2) }
>   }
>   def main(args: Array[String]) = {
> val input = Seq(HashMap("A" -> 1L), HashMap("A" -> 2L, "B" -> 
> 3L),HashMap("A" -> 2L, "C" -> 4L))
> val conf = new SparkConf().setAppName("MergeTest").setMaster("local[*]")
> val sc = new SparkContext(conf)
> val result = sc.parallelize(input).reduce(mergeFn)
> println(s"Result=$result")
> sc.stop()
>   }
> }
> {noformat}
> Error message:
> org.apache.spark.SparkDriverExecutionException: Execution error
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1169)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1637)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1952)
> at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1025)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
> at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
> at org.apache.spark.rdd.RDD.reduce(RDD.scala:1007)
> at MergeTest$.main(MergeTest.scala:21)
> at MergeTest.main(MergeTest.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> Caused by: java.lang.NullPointerException
> at 
> MergeTest$$anonfun$mergeFn$1$$anonfun$apply$1.apply(MergeTest.scala:12)
> at 
> MergeTest$$anonfun$mergeFn$1$$anonfun$apply$1.apply(MergeTest.scala:12)
> at scala.collection.immutable.HashMap$$anon$2.apply(HashMap.scala:148)
> at 
> scala.collection.immutable.HashMap$HashMap1.updated0(HashMap.scala:200)
> at 
> scala.collection.immutable.HashMap$HashTrieMap.updated0(HashMap.scala:322)
> at 
> scala.collection.immutable.HashMap$HashTrieMap.merge0(HashMap.scala:463)
> at scala.collection.immutable.HashMap.merged(HashMap.scala:117)
> at MergeTest$$anonfun$mergeFn$1.apply(MergeTest.scala:12)
> at MergeTest$$anonfun$mergeFn$1.apply(MergeTest.scala:11)
> at 
> org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply(RDD.scala:1020)
> at 
> org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply(RDD.scala:1017)
> at 

[jira] [Comment Edited] (SPARK-13230) HashMap.merged not working properly with Spark

2016-03-04 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-13230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15180216#comment-15180216
 ] 

Łukasz Gieroń edited comment on SPARK-13230 at 3/4/16 5:34 PM:
---

The issue here is the bug in Scala library, in deserialization of 'HashMap1' 
objects. When they get deserialized, the internal `kv` field does not get 
deserialized (is left `null`), which causes a `NullPointerException` in 
`merged`. I've fixed this is Scala library, and it fixes the issue.
I'm going to open a bug to Scala library and submit a pull request for it, and 
link that ticket here (if it's possible to link between Jiras).


was (Author: lgieron):
The issue here is the bug in Scala library, in deserialization of `HashMap1` 
objects. When they get deserialized, the internal `kv` field does not get 
deserialized (is left `null`), which causes a `NullPointerException` in 
`merged`. I've fixed this is Scala library, and it fixes the issue.
I'm going to open a bug to Scala library and submit a pull request for it, and 
link that ticket here (if it's possible to link between Jiras).

> HashMap.merged not working properly with Spark
> --
>
> Key: SPARK-13230
> URL: https://issues.apache.org/jira/browse/SPARK-13230
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
> Environment: Ubuntu 14.04.3, Scala 2.11.7, Spark 1.6.0
>Reporter: Alin Treznai
>
> Using HashMap.merged with Spark fails with NullPointerException.
> {noformat}
> import org.apache.spark.{SparkConf, SparkContext}
> import scala.collection.immutable.HashMap
> object MergeTest {
>   def mergeFn:(HashMap[String, Long], HashMap[String, Long]) => 
> HashMap[String, Long] = {
> case (m1, m2) => m1.merged(m2){ case (x,y) => (x._1, x._2 + y._2) }
>   }
>   def main(args: Array[String]) = {
> val input = Seq(HashMap("A" -> 1L), HashMap("A" -> 2L, "B" -> 
> 3L),HashMap("A" -> 2L, "C" -> 4L))
> val conf = new SparkConf().setAppName("MergeTest").setMaster("local[*]")
> val sc = new SparkContext(conf)
> val result = sc.parallelize(input).reduce(mergeFn)
> println(s"Result=$result")
> sc.stop()
>   }
> }
> {noformat}
> Error message:
> org.apache.spark.SparkDriverExecutionException: Execution error
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1169)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1637)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1952)
> at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1025)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
> at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
> at org.apache.spark.rdd.RDD.reduce(RDD.scala:1007)
> at MergeTest$.main(MergeTest.scala:21)
> at MergeTest.main(MergeTest.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> Caused by: java.lang.NullPointerException
> at 
> MergeTest$$anonfun$mergeFn$1$$anonfun$apply$1.apply(MergeTest.scala:12)
> at 
> MergeTest$$anonfun$mergeFn$1$$anonfun$apply$1.apply(MergeTest.scala:12)
> at scala.collection.immutable.HashMap$$anon$2.apply(HashMap.scala:148)
> at 
> scala.collection.immutable.HashMap$HashMap1.updated0(HashMap.scala:200)
> at 
> scala.collection.immutable.HashMap$HashTrieMap.updated0(HashMap.scala:322)
> at 
> scala.collection.immutable.HashMap$HashTrieMap.merge0(HashMap.scala:463)
> at scala.collection.immutable.HashMap.merged(HashMap.scala:117)
> at MergeTest$$anonfun$mergeFn$1.apply(MergeTest.scala:12)
> at MergeTest$$anonfun$mergeFn$1.apply(MergeTest.scala:11)
> at 
> org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply(RDD.scala:1020)
> at 
> org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply(RDD.scala:1017)
> at 

  1   2   >