[jira] [Assigned] (SPARK-13696) Remove BlockStore interface to more cleanly reflect different memory and disk store responsibilities
[ https://issues.apache.org/jira/browse/SPARK-13696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13696: Assignee: Josh Rosen (was: Apache Spark) > Remove BlockStore interface to more cleanly reflect different memory and disk > store responsibilities > > > Key: SPARK-13696 > URL: https://issues.apache.org/jira/browse/SPARK-13696 > Project: Spark > Issue Type: Improvement > Components: Block Manager >Reporter: Josh Rosen >Assignee: Josh Rosen > > Today, both the MemoryStore and DiskStore implement a common BlockStore API, > but I feel that this API is inappropriate because it abstracts away important > distinctions between the behavior of these two stores. > For instance, the disk store doesn't have a notion of storing deserialized > objects, so it's confusing for it to expose object-based APIs like > putIterator() and getValues() instead of only exposing binary APIs and > pushing the responsibilities of serialization and deserialization to the > client. > As part of a larger BlockManager interface cleanup, I'd like to remove the > BlockStore API and refine the MemoryStore and DiskStore interfaces to reflect > more narrow sets of responsibilities for those components. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13696) Remove BlockStore interface to more cleanly reflect different memory and disk store responsibilities
[ https://issues.apache.org/jira/browse/SPARK-13696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13696: Assignee: Apache Spark (was: Josh Rosen) > Remove BlockStore interface to more cleanly reflect different memory and disk > store responsibilities > > > Key: SPARK-13696 > URL: https://issues.apache.org/jira/browse/SPARK-13696 > Project: Spark > Issue Type: Improvement > Components: Block Manager >Reporter: Josh Rosen >Assignee: Apache Spark > > Today, both the MemoryStore and DiskStore implement a common BlockStore API, > but I feel that this API is inappropriate because it abstracts away important > distinctions between the behavior of these two stores. > For instance, the disk store doesn't have a notion of storing deserialized > objects, so it's confusing for it to expose object-based APIs like > putIterator() and getValues() instead of only exposing binary APIs and > pushing the responsibilities of serialization and deserialization to the > client. > As part of a larger BlockManager interface cleanup, I'd like to remove the > BlockStore API and refine the MemoryStore and DiskStore interfaces to reflect > more narrow sets of responsibilities for those components. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13696) Remove BlockStore interface to more cleanly reflect different memory and disk store responsibilities
[ https://issues.apache.org/jira/browse/SPARK-13696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15181527#comment-15181527 ] Apache Spark commented on SPARK-13696: -- User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/11534 > Remove BlockStore interface to more cleanly reflect different memory and disk > store responsibilities > > > Key: SPARK-13696 > URL: https://issues.apache.org/jira/browse/SPARK-13696 > Project: Spark > Issue Type: Improvement > Components: Block Manager >Reporter: Josh Rosen >Assignee: Josh Rosen > > Today, both the MemoryStore and DiskStore implement a common BlockStore API, > but I feel that this API is inappropriate because it abstracts away important > distinctions between the behavior of these two stores. > For instance, the disk store doesn't have a notion of storing deserialized > objects, so it's confusing for it to expose object-based APIs like > putIterator() and getValues() instead of only exposing binary APIs and > pushing the responsibilities of serialization and deserialization to the > client. > As part of a larger BlockManager interface cleanup, I'd like to remove the > BlockStore API and refine the MemoryStore and DiskStore interfaces to reflect > more narrow sets of responsibilities for those components. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13696) Remove BlockStore interface to more cleanly reflect different memory and disk store responsibilities
Josh Rosen created SPARK-13696: -- Summary: Remove BlockStore interface to more cleanly reflect different memory and disk store responsibilities Key: SPARK-13696 URL: https://issues.apache.org/jira/browse/SPARK-13696 Project: Spark Issue Type: Improvement Components: Block Manager Reporter: Josh Rosen Assignee: Josh Rosen Today, both the MemoryStore and DiskStore implement a common BlockStore API, but I feel that this API is inappropriate because it abstracts away important distinctions between the behavior of these two stores. For instance, the disk store doesn't have a notion of storing deserialized objects, so it's confusing for it to expose object-based APIs like putIterator() and getValues() instead of only exposing binary APIs and pushing the responsibilities of serialization and deserialization to the client. As part of a larger BlockManager interface cleanup, I'd like to remove the BlockStore API and refine the MemoryStore and DiskStore interfaces to reflect more narrow sets of responsibilities for those components. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13695) Don't cache MEMORY_AND_DISK blocks as bytes in memory store when reading spills
[ https://issues.apache.org/jira/browse/SPARK-13695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13695: Assignee: Apache Spark (was: Josh Rosen) > Don't cache MEMORY_AND_DISK blocks as bytes in memory store when reading > spills > --- > > Key: SPARK-13695 > URL: https://issues.apache.org/jira/browse/SPARK-13695 > Project: Spark > Issue Type: Improvement > Components: Block Manager >Reporter: Josh Rosen >Assignee: Apache Spark > > When a cached block is spilled to disk and read back in serialized form (i.e. > as bytes), the current BlockManager implementation will attempt to re-insert > the serialized block into the MemoryStore even if the block's storage level > requests deserialized caching. > This behavior adds some complexity to the MemoryStore but I don't think it > offers many performance benefits and I'd like to remove it in order to > simplify a larger refactoring patch. Therefore, I propose to change the > behavior such that disk store reads will only cache bytes in the memory store > for blocks with serialized storage levels. > There are two places where we request serialized bytes from the BlockStore: > 1. getLocalBytes(), which is only called when reading local copies of > TorrentBroadcast pieces. Broadcast pieces are always cached using a > serialized storage level, so this won't lead to a mismatch in serialization > forms if spilled bytes read from disk are cached as bytes in the memory store. > 2. the non-shuffle-block branch in getBlockData(), which is only called by > the NettyBlockRpcServer when responding to requests to read remote blocks. > Caching the serialized bytes in memory will only benefit us if those cached > bytes are read before they're evicted and the likelihood of that happening > seems low since the frequency of remote reads of non-broadcast cached blocks > seems very low. Caching these bytes when they have a low probability of being > read is bad if it risks the eviction of blocks which are cached in their > expected serialized/deserialized forms, since those blocks seem more likely > to be read in local computation. > Therefore, I think this is a safe change. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13695) Don't cache MEMORY_AND_DISK blocks as bytes in memory store when reading spills
[ https://issues.apache.org/jira/browse/SPARK-13695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15181518#comment-15181518 ] Apache Spark commented on SPARK-13695: -- User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/11533 > Don't cache MEMORY_AND_DISK blocks as bytes in memory store when reading > spills > --- > > Key: SPARK-13695 > URL: https://issues.apache.org/jira/browse/SPARK-13695 > Project: Spark > Issue Type: Improvement > Components: Block Manager >Reporter: Josh Rosen >Assignee: Josh Rosen > > When a cached block is spilled to disk and read back in serialized form (i.e. > as bytes), the current BlockManager implementation will attempt to re-insert > the serialized block into the MemoryStore even if the block's storage level > requests deserialized caching. > This behavior adds some complexity to the MemoryStore but I don't think it > offers many performance benefits and I'd like to remove it in order to > simplify a larger refactoring patch. Therefore, I propose to change the > behavior such that disk store reads will only cache bytes in the memory store > for blocks with serialized storage levels. > There are two places where we request serialized bytes from the BlockStore: > 1. getLocalBytes(), which is only called when reading local copies of > TorrentBroadcast pieces. Broadcast pieces are always cached using a > serialized storage level, so this won't lead to a mismatch in serialization > forms if spilled bytes read from disk are cached as bytes in the memory store. > 2. the non-shuffle-block branch in getBlockData(), which is only called by > the NettyBlockRpcServer when responding to requests to read remote blocks. > Caching the serialized bytes in memory will only benefit us if those cached > bytes are read before they're evicted and the likelihood of that happening > seems low since the frequency of remote reads of non-broadcast cached blocks > seems very low. Caching these bytes when they have a low probability of being > read is bad if it risks the eviction of blocks which are cached in their > expected serialized/deserialized forms, since those blocks seem more likely > to be read in local computation. > Therefore, I think this is a safe change. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13695) Don't cache MEMORY_AND_DISK blocks as bytes in memory store when reading spills
[ https://issues.apache.org/jira/browse/SPARK-13695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13695: Assignee: Josh Rosen (was: Apache Spark) > Don't cache MEMORY_AND_DISK blocks as bytes in memory store when reading > spills > --- > > Key: SPARK-13695 > URL: https://issues.apache.org/jira/browse/SPARK-13695 > Project: Spark > Issue Type: Improvement > Components: Block Manager >Reporter: Josh Rosen >Assignee: Josh Rosen > > When a cached block is spilled to disk and read back in serialized form (i.e. > as bytes), the current BlockManager implementation will attempt to re-insert > the serialized block into the MemoryStore even if the block's storage level > requests deserialized caching. > This behavior adds some complexity to the MemoryStore but I don't think it > offers many performance benefits and I'd like to remove it in order to > simplify a larger refactoring patch. Therefore, I propose to change the > behavior such that disk store reads will only cache bytes in the memory store > for blocks with serialized storage levels. > There are two places where we request serialized bytes from the BlockStore: > 1. getLocalBytes(), which is only called when reading local copies of > TorrentBroadcast pieces. Broadcast pieces are always cached using a > serialized storage level, so this won't lead to a mismatch in serialization > forms if spilled bytes read from disk are cached as bytes in the memory store. > 2. the non-shuffle-block branch in getBlockData(), which is only called by > the NettyBlockRpcServer when responding to requests to read remote blocks. > Caching the serialized bytes in memory will only benefit us if those cached > bytes are read before they're evicted and the likelihood of that happening > seems low since the frequency of remote reads of non-broadcast cached blocks > seems very low. Caching these bytes when they have a low probability of being > read is bad if it risks the eviction of blocks which are cached in their > expected serialized/deserialized forms, since those blocks seem more likely > to be read in local computation. > Therefore, I think this is a safe change. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13695) Don't cache MEMORY_AND_DISK blocks as bytes in memory store when reading spills
[ https://issues.apache.org/jira/browse/SPARK-13695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-13695: --- Summary: Don't cache MEMORY_AND_DISK blocks as bytes in memory store when reading spills (was: Don't cache blocks at different serialization level when reading back from disk) > Don't cache MEMORY_AND_DISK blocks as bytes in memory store when reading > spills > --- > > Key: SPARK-13695 > URL: https://issues.apache.org/jira/browse/SPARK-13695 > Project: Spark > Issue Type: Improvement > Components: Block Manager >Reporter: Josh Rosen >Assignee: Josh Rosen > > When a cached block is spilled to disk and read back in serialized form (i.e. > as bytes), the current BlockManager implementation will attempt to re-insert > the serialized block into the MemoryStore even if the block's storage level > requests deserialized caching. > This behavior adds some complexity to the MemoryStore but I don't think it > offers many performance benefits and I'd like to remove it in order to > simplify a larger refactoring patch. Therefore, I propose to change the > behavior such that disk store reads will only cache bytes in the memory store > for blocks with serialized storage levels. > There are two places where we request serialized bytes from the BlockStore: > 1. getLocalBytes(), which is only called when reading local copies of > TorrentBroadcast pieces. Broadcast pieces are always cached using a > serialized storage level, so this won't lead to a mismatch in serialization > forms if spilled bytes read from disk are cached as bytes in the memory store. > 2. the non-shuffle-block branch in getBlockData(), which is only called by > the NettyBlockRpcServer when responding to requests to read remote blocks. > Caching the serialized bytes in memory will only benefit us if those cached > bytes are read before they're evicted and the likelihood of that happening > seems low since the frequency of remote reads of non-broadcast cached blocks > seems very low. Caching these bytes when they have a low probability of being > read is bad if it risks the eviction of blocks which are cached in their > expected serialized/deserialized forms, since those blocks seem more likely > to be read in local computation. > Therefore, I think this is a safe change. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13695) Don't cache blocks at different serialization level when reading back from disk
Josh Rosen created SPARK-13695: -- Summary: Don't cache blocks at different serialization level when reading back from disk Key: SPARK-13695 URL: https://issues.apache.org/jira/browse/SPARK-13695 Project: Spark Issue Type: Improvement Components: Block Manager Reporter: Josh Rosen Assignee: Josh Rosen When a cached block is spilled to disk and read back in serialized form (i.e. as bytes), the current BlockManager implementation will attempt to re-insert the serialized block into the MemoryStore even if the block's storage level requests deserialized caching. This behavior adds some complexity to the MemoryStore but I don't think it offers many performance benefits and I'd like to remove it in order to simplify a larger refactoring patch. Therefore, I propose to change the behavior such that disk store reads will only cache bytes in the memory store for blocks with serialized storage levels. There are two places where we request serialized bytes from the BlockStore: 1. getLocalBytes(), which is only called when reading local copies of TorrentBroadcast pieces. Broadcast pieces are always cached using a serialized storage level, so this won't lead to a mismatch in serialization forms if spilled bytes read from disk are cached as bytes in the memory store. 2. the non-shuffle-block branch in getBlockData(), which is only called by the NettyBlockRpcServer when responding to requests to read remote blocks. Caching the serialized bytes in memory will only benefit us if those cached bytes are read before they're evicted and the likelihood of that happening seems low since the frequency of remote reads of non-broadcast cached blocks seems very low. Caching these bytes when they have a low probability of being read is bad if it risks the eviction of blocks which are cached in their expected serialized/deserialized forms, since those blocks seem more likely to be read in local computation. Therefore, I think this is a safe change. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-13073) creating R like summary for logistic Regression in Spark - Scala
[ https://issues.apache.org/jira/browse/SPARK-13073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gayathri Murali updated SPARK-13073: Comment: was deleted (was: I can work on this, can you please assign it to me?) > creating R like summary for logistic Regression in Spark - Scala > > > Key: SPARK-13073 > URL: https://issues.apache.org/jira/browse/SPARK-13073 > Project: Spark > Issue Type: New Feature > Components: ML, MLlib >Reporter: Samsudhin >Priority: Minor > > Currently Spark ML provides Coefficients for logistic regression. To evaluate > the trained model tests like wald test, chi square tests and their results to > be summarized and display like GLM summary of R -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7505) Update PySpark DataFrame docs: encourage __getitem__, mark as experimental, etc.
[ https://issues.apache.org/jira/browse/SPARK-7505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15181418#comment-15181418 ] Neelesh Srinivas Salian commented on SPARK-7505: Does this still apply ? [~nchammas] > Update PySpark DataFrame docs: encourage __getitem__, mark as experimental, > etc. > > > Key: SPARK-7505 > URL: https://issues.apache.org/jira/browse/SPARK-7505 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark, SQL >Affects Versions: 1.3.1, 1.4.1, 1.5.0 >Reporter: Nicholas Chammas >Priority: Minor > > The PySpark docs for DataFrame need the following fixes and improvements: > # Per [SPARK-7035], we should encourage the use of {{\_\_getitem\_\_}} over > {{\_\_getattr\_\_}} and change all our examples accordingly. > # *We should say clearly that the API is experimental.* (That is currently > not the case for the PySpark docs.) > # We should provide an example of how to join and select from 2 DataFrames > that have identically named columns, because it is not obvious: > {code} > >>> df1 = sqlContext.jsonRDD(sc.parallelize(['{"a": 4, "other": "I know"}'])) > >>> df2 = sqlContext.jsonRDD(sc.parallelize(['{"a": 4, "other": "I dunno"}'])) > >>> df12 = df1.join(df2, df1['a'] == df2['a']) > >>> df12.select(df1['a'], df2['other']).show() > a other > > 4 I dunno {code} > # > [{{DF.orderBy}}|https://spark.apache.org/docs/1.3.1/api/python/pyspark.sql.html#pyspark.sql.DataFrame.orderBy] > and > [{{DF.sort}}|https://spark.apache.org/docs/1.3.1/api/python/pyspark.sql.html#pyspark.sql.DataFrame.sort] > should be marked as aliases if that's what they are. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6043) Error when trying to rename table with alter table after using INSERT OVERWITE to populate the table
[ https://issues.apache.org/jira/browse/SPARK-6043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15181416#comment-15181416 ] Neelesh Srinivas Salian commented on SPARK-6043: [~tleftwich] is this still an issue? > Error when trying to rename table with alter table after using INSERT > OVERWITE to populate the table > > > Key: SPARK-6043 > URL: https://issues.apache.org/jira/browse/SPARK-6043 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.1 >Reporter: Trystan Leftwich >Priority: Minor > > If you populate a table using INSERT OVERWRITE and then try to rename the > table using alter table it fails with: > {noformat} > Error: org.apache.spark.sql.execution.QueryExecutionException: FAILED: > Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. > Unable to alter table. (state=,code=0) > {noformat} > Using the following SQL statement creates the error: > {code:sql} > CREATE TABLE `tmp_table` (salesamount_c1 DOUBLE); > INSERT OVERWRITE table tmp_table SELECT >MIN(sales_customer.salesamount) salesamount_c1 > FROM > ( > SELECT > SUM(sales.salesamount) salesamount > FROM > internalsales sales > ) sales_customer; > ALTER TABLE tmp_table RENAME to not_tmp; > {code} > But if you change the 'OVERWRITE' to be 'INTO' the SQL statement works. > This is happening on our CDH5.3 cluster with multiple workers, If we use the > CDH5.3 Quickstart VM the SQL does not produce an error. Both cases were spark > 1.2.1 built for hadoop2.4+ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4549) Support BigInt -> Decimal in convertToCatalyst in SparkSQL
[ https://issues.apache.org/jira/browse/SPARK-4549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15181398#comment-15181398 ] Neelesh Srinivas Salian commented on SPARK-4549: [~huangjs] could you please add more details here? > Support BigInt -> Decimal in convertToCatalyst in SparkSQL > -- > > Key: SPARK-4549 > URL: https://issues.apache.org/jira/browse/SPARK-4549 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.3.0 >Reporter: Jianshi Huang >Priority: Minor > > Since BigDecimal is just a wrapper around BigInt, let's also convert to > BigInt to Decimal. > Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13641) getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the original column names
[ https://issues.apache.org/jira/browse/SPARK-13641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15181369#comment-15181369 ] Gayathri Murali commented on SPARK-13641: - Yes, it looks intentional to carry metadata. There are multiple ways in which it can stripped out. I will let [~mengxr] to answer if its ok to strip out the "c_" from feature names in the SparkRWrapper > getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the > original column names > --- > > Key: SPARK-13641 > URL: https://issues.apache.org/jira/browse/SPARK-13641 > Project: Spark > Issue Type: Bug > Components: ML, SparkR >Reporter: Xusen Yin >Priority: Minor > > getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the > original column names. Let's take the HouseVotes84 data set as an example: > {code} > case m: XXXModel => > val attrs = AttributeGroup.fromStructField( > m.summary.predictions.schema(m.summary.featuresCol)) > attrs.attributes.get.map(_.name.get) > {code} > The code above gets features' names from the features column. Usually, the > features column is generated by RFormula. The latter has a VectorAssembler in > it, which leads the output attributes not equal with the original ones. > E.g., we want to learn the HouseVotes84's features' name "V1, V2, ..., V16". > But with RFormula, we can only get "V1_n, V2_y, ..., V16_y" because [the > transform function of > VectorAssembler|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala#L75] > adds salts of the column names. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13694) QueryPlan.expressions should always include all expressions
[ https://issues.apache.org/jira/browse/SPARK-13694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13694: Assignee: (was: Apache Spark) > QueryPlan.expressions should always include all expressions > --- > > Key: SPARK-13694 > URL: https://issues.apache.org/jira/browse/SPARK-13694 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13694) QueryPlan.expressions should always include all expressions
[ https://issues.apache.org/jira/browse/SPARK-13694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15181367#comment-15181367 ] Apache Spark commented on SPARK-13694: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/11532 > QueryPlan.expressions should always include all expressions > --- > > Key: SPARK-13694 > URL: https://issues.apache.org/jira/browse/SPARK-13694 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13694) QueryPlan.expressions should always include all expressions
[ https://issues.apache.org/jira/browse/SPARK-13694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13694: Assignee: Apache Spark > QueryPlan.expressions should always include all expressions > --- > > Key: SPARK-13694 > URL: https://issues.apache.org/jira/browse/SPARK-13694 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13690) UnsafeShuffleWriterSuite fails on arm64 (SnappyError, no native library is found)
[ https://issues.apache.org/jira/browse/SPARK-13690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15181362#comment-15181362 ] Santiago M. Mola commented on SPARK-13690: -- snappy-java does not have any fallback, but snappy seems to work on arm64 correctly. I submitted a PR for snappy-java, so a future version should have support. This issue will have to wait until such version is out. I don't expect active support for arm64, but given the latest developments on arm64 servers, I'm interested in experimenting with it. It seems I'm not the first one to think about it: http://www.sparkonarm.com/ ;-) > UnsafeShuffleWriterSuite fails on arm64 (SnappyError, no native library is > found) > - > > Key: SPARK-13690 > URL: https://issues.apache.org/jira/browse/SPARK-13690 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 > Environment: $ java -version > java version "1.8.0_73" > Java(TM) SE Runtime Environment (build 1.8.0_73-b02) > Java HotSpot(TM) 64-Bit Server VM (build 25.73-b02, mixed mode) > $ uname -a > Linux spark-on-arm 4.2.0-55598-g45f70e3 #5 SMP Tue Feb 2 10:14:08 CET 2016 > aarch64 aarch64 aarch64 GNU/Linux >Reporter: Santiago M. Mola >Priority: Minor > Labels: arm64, porting > > UnsafeShuffleWriterSuite fails because of missing Snappy native library on > arm64. > {code} > Tests run: 19, Failures: 0, Errors: 2, Skipped: 0, Time elapsed: 6.437 sec > <<< FAILURE! - in org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite > mergeSpillsWithFileStreamAndSnappy(org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite) > Time elapsed: 0.072 sec <<< ERROR! > java.lang.reflect.InvocationTargetException > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite.testMergingSpills(UnsafeShuffleWriterSuite.java:337) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite.mergeSpillsWithFileStreamAndSnappy(UnsafeShuffleWriterSuite.java:389) > Caused by: java.lang.IllegalArgumentException: org.xerial.snappy.SnappyError: > [FAILED_TO_LOAD_NATIVE_LIBRARY] no native library is found for os.name=Linux > and os.arch=aarch64 > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite.testMergingSpills(UnsafeShuffleWriterSuite.java:337) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite.mergeSpillsWithFileStreamAndSnappy(UnsafeShuffleWriterSuite.java:389) > Caused by: org.xerial.snappy.SnappyError: [FAILED_TO_LOAD_NATIVE_LIBRARY] no > native library is found for os.name=Linux and os.arch=aarch64 > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite.testMergingSpills(UnsafeShuffleWriterSuite.java:337) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite.mergeSpillsWithFileStreamAndSnappy(UnsafeShuffleWriterSuite.java:389) > mergeSpillsWithTransferToAndSnappy(org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite) > Time elapsed: 0.041 sec <<< ERROR! > java.lang.reflect.InvocationTargetException > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite.testMergingSpills(UnsafeShuffleWriterSuite.java:337) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite.mergeSpillsWithTransferToAndSnappy(UnsafeShuffleWriterSuite.java:384) > Caused by: java.lang.IllegalArgumentException: > java.lang.NoClassDefFoundError: Could not initialize class > org.xerial.snappy.Snappy > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite.testMergingSpills(UnsafeShuffleWriterSuite.java:337) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite.mergeSpillsWithTransferToAndSnappy(UnsafeShuffleWriterSuite.java:384) > Caused by: java.lang.NoClassDefFoundError: Could not initialize class > org.xerial.snappy.Snappy > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite.testMergingSpills(UnsafeShuffleWriterSuite.java:337) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite.mergeSpillsWithTransferToAndSnappy(UnsafeShuffleWriterSuite.java:384) > Running org.apache.spark.JavaAPISuite > Tests run: 90, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 52.526 sec - > in org.apache.spark.JavaAPISuite > Running org.apache.spark.unsafe.map.BytesToBytesMapOnHeapSuite > Tests run: 12, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 9.761 sec - > in org.apache.spark.unsafe.map.BytesToBytesMapOnHeapSuite > Running org.apache.spark.unsafe.map.BytesToBytesMapOffHeapSuite > Tests run: 12, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 8.967 sec - > in org.apache.spark.unsafe.map.BytesToBytesMapOffHeapSuite > Running org.apache.spark.api.java.OptionalSuite > Tests run: 8, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.003 sec - > in org.apache.spark.api.java.OptionalSu
[jira] [Created] (SPARK-13694) QueryPlan.expressions should always include all expressions
Wenchen Fan created SPARK-13694: --- Summary: QueryPlan.expressions should always include all expressions Key: SPARK-13694 URL: https://issues.apache.org/jira/browse/SPARK-13694 Project: Spark Issue Type: Improvement Components: SQL Reporter: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12073) Backpressure causes individual Kafka partitions to lag
[ https://issues.apache.org/jira/browse/SPARK-12073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-12073. -- Resolution: Fixed Assignee: Jason White Fix Version/s: 2.0.0 > Backpressure causes individual Kafka partitions to lag > -- > > Key: SPARK-12073 > URL: https://issues.apache.org/jira/browse/SPARK-12073 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.5.2, 1.6.0, 1.6.1 >Reporter: Jason White >Assignee: Jason White > Fix For: 2.0.0 > > > We're seeing a growing lag on (2) individual Kafka partitions, on a topic > with 32 partitions. Our individual batch sessions are completing in 5-7s, > with a batch window of 30s, so there's plenty of room for Streaming to catch > up, but it looks to be intentionally limiting itself. These partitions are > experiencing unbalanced load (higher than most of the others) > What I believe is happening is that maxMessagesPerPartition calculates an > appropriate limit for the message rate from all partitions, and then divides > by the number of partitions to determine how many messages to retrieve per > partition. The problem with this approach is that when one partition is > behind by millions of records (due to random Kafka issues) or is experiencing > heavy load, the number of messages to be retrieved shouldn't be evenly split > among the partitions. In this scenario, if the rate estimator calculates only > 100k total messages can be retrieved, each partition (out of say 32) only > retrieves max 100k/32=3125 messages. > Under some conditions, this results in the backpressure keeping the lagging > partition from recovering. The PIDRateEstimator doesn't increase the number > of messages to retrieve enough to recover, and we stabilize at a point where > these individual partitions slowly grow. > I have a PR on our fork in progress to allocate the maxMessagesPerPartition > by weighting the number to be retrieved on the current lag each partition is > currently experiencing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12073) Backpressure causes individual Kafka partitions to lag
[ https://issues.apache.org/jira/browse/SPARK-12073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-12073: - Affects Version/s: 1.6.1 1.6.0 > Backpressure causes individual Kafka partitions to lag > -- > > Key: SPARK-12073 > URL: https://issues.apache.org/jira/browse/SPARK-12073 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.5.2, 1.6.0, 1.6.1 >Reporter: Jason White >Assignee: Jason White > Fix For: 2.0.0 > > > We're seeing a growing lag on (2) individual Kafka partitions, on a topic > with 32 partitions. Our individual batch sessions are completing in 5-7s, > with a batch window of 30s, so there's plenty of room for Streaming to catch > up, but it looks to be intentionally limiting itself. These partitions are > experiencing unbalanced load (higher than most of the others) > What I believe is happening is that maxMessagesPerPartition calculates an > appropriate limit for the message rate from all partitions, and then divides > by the number of partitions to determine how many messages to retrieve per > partition. The problem with this approach is that when one partition is > behind by millions of records (due to random Kafka issues) or is experiencing > heavy load, the number of messages to be retrieved shouldn't be evenly split > among the partitions. In this scenario, if the rate estimator calculates only > 100k total messages can be retrieved, each partition (out of say 32) only > retrieves max 100k/32=3125 messages. > Under some conditions, this results in the backpressure keeping the lagging > partition from recovering. The PIDRateEstimator doesn't increase the number > of messages to retrieve enough to recover, and we stabilize at a point where > these individual partitions slowly grow. > I have a PR on our fork in progress to allocate the maxMessagesPerPartition > by weighting the number to be retrieved on the current lag each partition is > currently experiencing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13693) Flaky test: o.a.s.streaming.MapWithStateSuite
[ https://issues.apache.org/jira/browse/SPARK-13693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13693: Assignee: Apache Spark (was: Shixiong Zhu) > Flaky test: o.a.s.streaming.MapWithStateSuite > - > > Key: SPARK-13693 > URL: https://issues.apache.org/jira/browse/SPARK-13693 > Project: Spark > Issue Type: Test > Components: Tests >Reporter: Shixiong Zhu >Assignee: Apache Spark >Priority: Minor > > Fixed the following flaky test: > https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/256/testReport/junit/org.apache.spark.streaming/MapWithStateSuite/_It_is_not_a_test_/ > {code} > sbt.ForkMain$ForkError: java.io.IOException: Failed to delete: > /home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/streaming/checkpoint/spark-e97794a8-b940-4b21-8685-bf1221f9444d > at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:934) > at > org.apache.spark.streaming.MapWithStateSuite$$anonfun$2.apply$mcV$sp(MapWithStateSuite.scala:47) > at > org.apache.spark.streaming.MapWithStateSuite$$anonfun$2.apply(MapWithStateSuite.scala:45) > at > org.apache.spark.streaming.MapWithStateSuite$$anonfun$2.apply(MapWithStateSuite.scala:45) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13693) Flaky test: o.a.s.streaming.MapWithStateSuite
[ https://issues.apache.org/jira/browse/SPARK-13693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13693: Assignee: Shixiong Zhu (was: Apache Spark) > Flaky test: o.a.s.streaming.MapWithStateSuite > - > > Key: SPARK-13693 > URL: https://issues.apache.org/jira/browse/SPARK-13693 > Project: Spark > Issue Type: Test > Components: Tests >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Minor > > Fixed the following flaky test: > https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/256/testReport/junit/org.apache.spark.streaming/MapWithStateSuite/_It_is_not_a_test_/ > {code} > sbt.ForkMain$ForkError: java.io.IOException: Failed to delete: > /home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/streaming/checkpoint/spark-e97794a8-b940-4b21-8685-bf1221f9444d > at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:934) > at > org.apache.spark.streaming.MapWithStateSuite$$anonfun$2.apply$mcV$sp(MapWithStateSuite.scala:47) > at > org.apache.spark.streaming.MapWithStateSuite$$anonfun$2.apply(MapWithStateSuite.scala:45) > at > org.apache.spark.streaming.MapWithStateSuite$$anonfun$2.apply(MapWithStateSuite.scala:45) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13693) Flaky test: o.a.s.streaming.MapWithStateSuite
[ https://issues.apache.org/jira/browse/SPARK-13693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15181303#comment-15181303 ] Apache Spark commented on SPARK-13693: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/11531 > Flaky test: o.a.s.streaming.MapWithStateSuite > - > > Key: SPARK-13693 > URL: https://issues.apache.org/jira/browse/SPARK-13693 > Project: Spark > Issue Type: Test > Components: Tests >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Minor > > Fixed the following flaky test: > https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/256/testReport/junit/org.apache.spark.streaming/MapWithStateSuite/_It_is_not_a_test_/ > {code} > sbt.ForkMain$ForkError: java.io.IOException: Failed to delete: > /home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/streaming/checkpoint/spark-e97794a8-b940-4b21-8685-bf1221f9444d > at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:934) > at > org.apache.spark.streaming.MapWithStateSuite$$anonfun$2.apply$mcV$sp(MapWithStateSuite.scala:47) > at > org.apache.spark.streaming.MapWithStateSuite$$anonfun$2.apply(MapWithStateSuite.scala:45) > at > org.apache.spark.streaming.MapWithStateSuite$$anonfun$2.apply(MapWithStateSuite.scala:45) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13692) Fix trivial Coverity/Checkstyle defects
[ https://issues.apache.org/jira/browse/SPARK-13692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13692: Assignee: Apache Spark > Fix trivial Coverity/Checkstyle defects > --- > > Key: SPARK-13692 > URL: https://issues.apache.org/jira/browse/SPARK-13692 > Project: Spark > Issue Type: Bug > Components: Examples, Spark Core, SQL >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Trivial > > This issue fixes the following potential bugs and Java coding style detected > by Coverity and Checkstyle. > * Implement both null and type checking in equals functions. > * Fix wrong type casting logic in SimpleJavaBean2.equals. > * Add `implement Cloneable` to `UTF8String` and `SortedIterator`. > * Remove dereferencing before null check in `AbstractBytesToBytesMapSuite`. > * Fix coding style: Add '{}' to single `for` statement in mllib examples. > * Remove unused imports in `ColumnarBatch`. > Please note that the last two checkstyle errors exist on newly added commits > after [SPARK-13583]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13692) Fix trivial Coverity/Checkstyle defects
[ https://issues.apache.org/jira/browse/SPARK-13692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13692: Assignee: (was: Apache Spark) > Fix trivial Coverity/Checkstyle defects > --- > > Key: SPARK-13692 > URL: https://issues.apache.org/jira/browse/SPARK-13692 > Project: Spark > Issue Type: Bug > Components: Examples, Spark Core, SQL >Reporter: Dongjoon Hyun >Priority: Trivial > > This issue fixes the following potential bugs and Java coding style detected > by Coverity and Checkstyle. > * Implement both null and type checking in equals functions. > * Fix wrong type casting logic in SimpleJavaBean2.equals. > * Add `implement Cloneable` to `UTF8String` and `SortedIterator`. > * Remove dereferencing before null check in `AbstractBytesToBytesMapSuite`. > * Fix coding style: Add '{}' to single `for` statement in mllib examples. > * Remove unused imports in `ColumnarBatch`. > Please note that the last two checkstyle errors exist on newly added commits > after [SPARK-13583]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13692) Fix trivial Coverity/Checkstyle defects
[ https://issues.apache.org/jira/browse/SPARK-13692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15181298#comment-15181298 ] Apache Spark commented on SPARK-13692: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/11530 > Fix trivial Coverity/Checkstyle defects > --- > > Key: SPARK-13692 > URL: https://issues.apache.org/jira/browse/SPARK-13692 > Project: Spark > Issue Type: Bug > Components: Examples, Spark Core, SQL >Reporter: Dongjoon Hyun >Priority: Trivial > > This issue fixes the following potential bugs and Java coding style detected > by Coverity and Checkstyle. > * Implement both null and type checking in equals functions. > * Fix wrong type casting logic in SimpleJavaBean2.equals. > * Add `implement Cloneable` to `UTF8String` and `SortedIterator`. > * Remove dereferencing before null check in `AbstractBytesToBytesMapSuite`. > * Fix coding style: Add '{}' to single `for` statement in mllib examples. > * Remove unused imports in `ColumnarBatch`. > Please note that the last two checkstyle errors exist on newly added commits > after [SPARK-13583]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13641) getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the original column names
[ https://issues.apache.org/jira/browse/SPARK-13641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15181294#comment-15181294 ] Xusen Yin commented on SPARK-13641: --- Or maybe we try to find another way to extract the feature names. > getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the > original column names > --- > > Key: SPARK-13641 > URL: https://issues.apache.org/jira/browse/SPARK-13641 > Project: Spark > Issue Type: Bug > Components: ML, SparkR >Reporter: Xusen Yin >Priority: Minor > > getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the > original column names. Let's take the HouseVotes84 data set as an example: > {code} > case m: XXXModel => > val attrs = AttributeGroup.fromStructField( > m.summary.predictions.schema(m.summary.featuresCol)) > attrs.attributes.get.map(_.name.get) > {code} > The code above gets features' names from the features column. Usually, the > features column is generated by RFormula. The latter has a VectorAssembler in > it, which leads the output attributes not equal with the original ones. > E.g., we want to learn the HouseVotes84's features' name "V1, V2, ..., V16". > But with RFormula, we can only get "V1_n, V2_y, ..., V16_y" because [the > transform function of > VectorAssembler|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala#L75] > adds salts of the column names. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13641) getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the original column names
[ https://issues.apache.org/jira/browse/SPARK-13641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15181130#comment-15181130 ] Xusen Yin commented on SPARK-13641: --- I am not quite familiar with it but I think it's intentional. Refer to the JIRA [SPARK 7198|https://issues.apache.org/jira/browse/SPARK-7198] and bring [~mengxr] here. > getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the > original column names > --- > > Key: SPARK-13641 > URL: https://issues.apache.org/jira/browse/SPARK-13641 > Project: Spark > Issue Type: Bug > Components: ML, SparkR >Reporter: Xusen Yin >Priority: Minor > > getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the > original column names. Let's take the HouseVotes84 data set as an example: > {code} > case m: XXXModel => > val attrs = AttributeGroup.fromStructField( > m.summary.predictions.schema(m.summary.featuresCol)) > attrs.attributes.get.map(_.name.get) > {code} > The code above gets features' names from the features column. Usually, the > features column is generated by RFormula. The latter has a VectorAssembler in > it, which leads the output attributes not equal with the original ones. > E.g., we want to learn the HouseVotes84's features' name "V1, V2, ..., V16". > But with RFormula, we can only get "V1_n, V2_y, ..., V16_y" because [the > transform function of > VectorAssembler|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala#L75] > adds salts of the column names. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13692) Fix trivial Coverity/Checkstyle defects
[ https://issues.apache.org/jira/browse/SPARK-13692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-13692: -- Description: This issue fixes the following potential bugs and Java coding style detected by Coverity and Checkstyle. * Implement both null and type checking in equals functions. * Fix wrong type casting logic in SimpleJavaBean2.equals. * Add `implement Cloneable` to `UTF8String` and `SortedIterator`. * Remove dereferencing before null check in `AbstractBytesToBytesMapSuite`. * Fix coding style: Add '{}' to single `for` statement in mllib examples. * Remove unused imports in `ColumnarBatch`. Please note that the last two checkstyle errors exist on newly added commits after [SPARK-13583]. was: This issue fixes the following potential bugs and Java coding style detected by Coverity and Checkstyle. * Implement both null and type checking in equals functions. * Fix wrong type casting logic in SimpleJavaBean2.equals. * Add `implement Cloneable` to `UTF8String` and `SortedIterator`. * Remove dereferencing before null check in `AbstractBytesToBytesMapSuite`. * Fix coding style: Add '{}' to single `for` statement in mllib examples. Please note that the checkstyle errors exist on newly added commits after [SPARK-13583]. > Fix trivial Coverity/Checkstyle defects > --- > > Key: SPARK-13692 > URL: https://issues.apache.org/jira/browse/SPARK-13692 > Project: Spark > Issue Type: Bug > Components: Examples, Spark Core, SQL >Reporter: Dongjoon Hyun >Priority: Trivial > > This issue fixes the following potential bugs and Java coding style detected > by Coverity and Checkstyle. > * Implement both null and type checking in equals functions. > * Fix wrong type casting logic in SimpleJavaBean2.equals. > * Add `implement Cloneable` to `UTF8String` and `SortedIterator`. > * Remove dereferencing before null check in `AbstractBytesToBytesMapSuite`. > * Fix coding style: Add '{}' to single `for` statement in mllib examples. > * Remove unused imports in `ColumnarBatch`. > Please note that the last two checkstyle errors exist on newly added commits > after [SPARK-13583]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13693) Flaky test: o.a.s.streaming.MapWithStateSuite
Shixiong Zhu created SPARK-13693: Summary: Flaky test: o.a.s.streaming.MapWithStateSuite Key: SPARK-13693 URL: https://issues.apache.org/jira/browse/SPARK-13693 Project: Spark Issue Type: Test Components: Tests Reporter: Shixiong Zhu Assignee: Shixiong Zhu Priority: Minor Fixed the following flaky test: https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/256/testReport/junit/org.apache.spark.streaming/MapWithStateSuite/_It_is_not_a_test_/ {code} sbt.ForkMain$ForkError: java.io.IOException: Failed to delete: /home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/streaming/checkpoint/spark-e97794a8-b940-4b21-8685-bf1221f9444d at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:934) at org.apache.spark.streaming.MapWithStateSuite$$anonfun$2.apply$mcV$sp(MapWithStateSuite.scala:47) at org.apache.spark.streaming.MapWithStateSuite$$anonfun$2.apply(MapWithStateSuite.scala:45) at org.apache.spark.streaming.MapWithStateSuite$$anonfun$2.apply(MapWithStateSuite.scala:45) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13692) Fix trivial Coverity/Checkstyle defects
Dongjoon Hyun created SPARK-13692: - Summary: Fix trivial Coverity/Checkstyle defects Key: SPARK-13692 URL: https://issues.apache.org/jira/browse/SPARK-13692 Project: Spark Issue Type: Bug Components: Examples, Spark Core, SQL Reporter: Dongjoon Hyun Priority: Trivial This issue fixes the following potential bugs and Java coding style detected by Coverity and Checkstyle. * Implement both null and type checking in equals functions. * Fix wrong type casting logic in SimpleJavaBean2.equals. * Add `implement Cloneable` to `UTF8String` and `SortedIterator`. * Remove dereferencing before null check in `AbstractBytesToBytesMapSuite`. * Fix coding style: Add '{}' to single `for` statement in mllib examples. Please note that the checkstyle errors exist on newly added commits after [SPARK-13583]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13605) Bean encoder cannot handle nonbean properties - no way to Encode nonbean Java objects with columns
[ https://issues.apache.org/jira/browse/SPARK-13605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-13605: - Fix Version/s: (was: 1.6.0) > Bean encoder cannot handle nonbean properties - no way to Encode nonbean Java > objects with columns > -- > > Key: SPARK-13605 > URL: https://issues.apache.org/jira/browse/SPARK-13605 > Project: Spark > Issue Type: New Feature > Components: Java API, SQL >Affects Versions: 1.6.0 > Environment: Any >Reporter: Steven Lewis > Labels: easytest, features > > in the current environment the only way to turn a List or JavaRDD into a > DataSet with columns is to use a Encoders.bean(MyBean.class); The current > implementation fails if a Bean property is not a basic type or a Bean. > I would like to see one of the following > 1) Default to JavaSerialization for any Java Object implementing Serializable > when using bean Encoder > 2) Allow an encoder which is a Map and look up entries in > encoding classes - an ideal implementation would look for the class then any > interfaces and then search base classes > The following code illustrates the issue > {code} > /** > * This class is a good Java bean but one field holds an object > * which is not a bean > */ > public class MyBean implements Serializable { > private int m_count; > private String m_Name; > private MyUnBean m_UnBean; > public MyBean(int count, String name, MyUnBean unBean) { > m_count = count; > m_Name = name; > m_UnBean = unBean; > } > public int getCount() {return m_count; } > public void setCount(int count) {m_count = count;} > public String getName() {return m_Name;} > public void setName(String name) {m_Name = name;} > public MyUnBean getUnBean() {return m_UnBean;} > public void setUnBean(MyUnBean unBean) {m_UnBean = unBean;} > } > /** > * This is a Java object which is not a bean > * no getters or setters but is serializable > */ > public class MyUnBean implements Serializable { > public final int count; > public final String name; > public MyUnBean(int count, String name) { > this.count = count; > this.name = name; > } > } > ** > * This code creates a list of objects containing MyBean - > * a Java Bean containing one field which is not bean > * It then attempts and fails to use a bean encoder > * to make a DataSet > */ > public class DatasetTest { > public static final Random RND = new Random(); > public static final int LIST_SIZE = 100; > public static String makeName() { > return Integer.toString(RND.nextInt()); > } > public static MyUnBean makeUnBean() { > return new MyUnBean(RND.nextInt(), makeName()); > } > public static MyBean makeBean() { > return new MyBean(RND.nextInt(), makeName(), makeUnBean()); > } > /** > * Make a list of MyBeans > * @return > */ > public static List makeBeanList() { > List holder = new ArrayList(); > for (int i = 0; i < LIST_SIZE; i++) { > holder.add(makeBean()); > } > return holder; > } > public static SQLContext getSqlContext() { > SparkConf sparkConf = new SparkConf(); > sparkConf.setAppName("BeanTest") ; > Option option = sparkConf.getOption("spark.master"); > if (!option.isDefined())// use local over nothing > sparkConf.setMaster("local[*]"); > JavaSparkContext ctx = new JavaSparkContext(sparkConf) ; > return new SQLContext(ctx); > } > public static void main(String[] args) { > SQLContext sqlContext = getSqlContext(); > Encoder evidence = Encoders.bean(MyBean.class); > Encoder evidence2 = > Encoders.javaSerialization(MyUnBean.class); > List holder = makeBeanList(); > // fails at this line with > // Exception in thread "main" java.lang.UnsupportedOperationException: no > encoder found for com.lordjoe.testing.MyUnBean > Dataset beanSet = sqlContext.createDataset( holder, > evidence); > long count = beanSet.count(); > if(count != LIST_SIZE) > throw new IllegalStateException("bad count"); > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13605) Bean encoder cannot handle nonbean properties - no way to Encode nonbean Java objects with columns
[ https://issues.apache.org/jira/browse/SPARK-13605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-13605: - Target Version/s: 2.0.0 (was: 1.6.0) > Bean encoder cannot handle nonbean properties - no way to Encode nonbean Java > objects with columns > -- > > Key: SPARK-13605 > URL: https://issues.apache.org/jira/browse/SPARK-13605 > Project: Spark > Issue Type: New Feature > Components: Java API, SQL >Affects Versions: 1.6.0 > Environment: Any >Reporter: Steven Lewis > Labels: easytest, features > Fix For: 1.6.0 > > > in the current environment the only way to turn a List or JavaRDD into a > DataSet with columns is to use a Encoders.bean(MyBean.class); The current > implementation fails if a Bean property is not a basic type or a Bean. > I would like to see one of the following > 1) Default to JavaSerialization for any Java Object implementing Serializable > when using bean Encoder > 2) Allow an encoder which is a Map and look up entries in > encoding classes - an ideal implementation would look for the class then any > interfaces and then search base classes > The following code illustrates the issue > /** > * This class is a good Java bean but one field holds an object > * which is not a bean > */ > public class MyBean implements Serializable { > private int m_count; > private String m_Name; > private MyUnBean m_UnBean; > public MyBean(int count, String name, MyUnBean unBean) { > m_count = count; > m_Name = name; > m_UnBean = unBean; > } > public int getCount() {return m_count; } > public void setCount(int count) {m_count = count;} > public String getName() {return m_Name;} > public void setName(String name) {m_Name = name;} > public MyUnBean getUnBean() {return m_UnBean;} > public void setUnBean(MyUnBean unBean) {m_UnBean = unBean;} > } > /** > * This is a Java object which is not a bean > * no getters or setters but is serializable > */ > public class MyUnBean implements Serializable { > public final int count; > public final String name; > public MyUnBean(int count, String name) { > this.count = count; > this.name = name; > } > } > ** > * This code creates a list of objects containing MyBean - > * a Java Bean containing one field which is not bean > * It then attempts and fails to use a bean encoder > * to make a DataSet > */ > public class DatasetTest { > public static final Random RND = new Random(); > public static final int LIST_SIZE = 100; > public static String makeName() { > return Integer.toString(RND.nextInt()); > } > public static MyUnBean makeUnBean() { > return new MyUnBean(RND.nextInt(), makeName()); > } > public static MyBean makeBean() { > return new MyBean(RND.nextInt(), makeName(), makeUnBean()); > } > /** > * Make a list of MyBeans > * @return > */ > public static List makeBeanList() { > List holder = new ArrayList(); > for (int i = 0; i < LIST_SIZE; i++) { > holder.add(makeBean()); > } > return holder; > } > public static SQLContext getSqlContext() { > SparkConf sparkConf = new SparkConf(); > sparkConf.setAppName("BeanTest") ; > Option option = sparkConf.getOption("spark.master"); > if (!option.isDefined())// use local over nothing > sparkConf.setMaster("local[*]"); > JavaSparkContext ctx = new JavaSparkContext(sparkConf) ; > return new SQLContext(ctx); > } > public static void main(String[] args) { > SQLContext sqlContext = getSqlContext(); > Encoder evidence = Encoders.bean(MyBean.class); > Encoder evidence2 = > Encoders.javaSerialization(MyUnBean.class); > List holder = makeBeanList(); > // fails at this line with > // Exception in thread "main" java.lang.UnsupportedOperationException: no > encoder found for com.lordjoe.testing.MyUnBean > Dataset beanSet = sqlContext.createDataset( holder, > evidence); > long count = beanSet.count(); > if(count != LIST_SIZE) > throw new IllegalStateException("bad count"); > } > } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13605) Bean encoder cannot handle nonbean properties - no way to Encode nonbean Java objects with columns
[ https://issues.apache.org/jira/browse/SPARK-13605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-13605: - Component/s: SQL > Bean encoder cannot handle nonbean properties - no way to Encode nonbean Java > objects with columns > -- > > Key: SPARK-13605 > URL: https://issues.apache.org/jira/browse/SPARK-13605 > Project: Spark > Issue Type: New Feature > Components: Java API, SQL >Affects Versions: 1.6.0 > Environment: Any >Reporter: Steven Lewis > Labels: easytest, features > Fix For: 1.6.0 > > > in the current environment the only way to turn a List or JavaRDD into a > DataSet with columns is to use a Encoders.bean(MyBean.class); The current > implementation fails if a Bean property is not a basic type or a Bean. > I would like to see one of the following > 1) Default to JavaSerialization for any Java Object implementing Serializable > when using bean Encoder > 2) Allow an encoder which is a Map and look up entries in > encoding classes - an ideal implementation would look for the class then any > interfaces and then search base classes > The following code illustrates the issue > /** > * This class is a good Java bean but one field holds an object > * which is not a bean > */ > public class MyBean implements Serializable { > private int m_count; > private String m_Name; > private MyUnBean m_UnBean; > public MyBean(int count, String name, MyUnBean unBean) { > m_count = count; > m_Name = name; > m_UnBean = unBean; > } > public int getCount() {return m_count; } > public void setCount(int count) {m_count = count;} > public String getName() {return m_Name;} > public void setName(String name) {m_Name = name;} > public MyUnBean getUnBean() {return m_UnBean;} > public void setUnBean(MyUnBean unBean) {m_UnBean = unBean;} > } > /** > * This is a Java object which is not a bean > * no getters or setters but is serializable > */ > public class MyUnBean implements Serializable { > public final int count; > public final String name; > public MyUnBean(int count, String name) { > this.count = count; > this.name = name; > } > } > ** > * This code creates a list of objects containing MyBean - > * a Java Bean containing one field which is not bean > * It then attempts and fails to use a bean encoder > * to make a DataSet > */ > public class DatasetTest { > public static final Random RND = new Random(); > public static final int LIST_SIZE = 100; > public static String makeName() { > return Integer.toString(RND.nextInt()); > } > public static MyUnBean makeUnBean() { > return new MyUnBean(RND.nextInt(), makeName()); > } > public static MyBean makeBean() { > return new MyBean(RND.nextInt(), makeName(), makeUnBean()); > } > /** > * Make a list of MyBeans > * @return > */ > public static List makeBeanList() { > List holder = new ArrayList(); > for (int i = 0; i < LIST_SIZE; i++) { > holder.add(makeBean()); > } > return holder; > } > public static SQLContext getSqlContext() { > SparkConf sparkConf = new SparkConf(); > sparkConf.setAppName("BeanTest") ; > Option option = sparkConf.getOption("spark.master"); > if (!option.isDefined())// use local over nothing > sparkConf.setMaster("local[*]"); > JavaSparkContext ctx = new JavaSparkContext(sparkConf) ; > return new SQLContext(ctx); > } > public static void main(String[] args) { > SQLContext sqlContext = getSqlContext(); > Encoder evidence = Encoders.bean(MyBean.class); > Encoder evidence2 = > Encoders.javaSerialization(MyUnBean.class); > List holder = makeBeanList(); > // fails at this line with > // Exception in thread "main" java.lang.UnsupportedOperationException: no > encoder found for com.lordjoe.testing.MyUnBean > Dataset beanSet = sqlContext.createDataset( holder, > evidence); > long count = beanSet.count(); > if(count != LIST_SIZE) > throw new IllegalStateException("bad count"); > } > } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13605) Bean encoder cannot handle nonbean properties - no way to Encode nonbean Java objects with columns
[ https://issues.apache.org/jira/browse/SPARK-13605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-13605: - Description: in the current environment the only way to turn a List or JavaRDD into a DataSet with columns is to use a Encoders.bean(MyBean.class); The current implementation fails if a Bean property is not a basic type or a Bean. I would like to see one of the following 1) Default to JavaSerialization for any Java Object implementing Serializable when using bean Encoder 2) Allow an encoder which is a Map and look up entries in encoding classes - an ideal implementation would look for the class then any interfaces and then search base classes The following code illustrates the issue {code} /** * This class is a good Java bean but one field holds an object * which is not a bean */ public class MyBean implements Serializable { private int m_count; private String m_Name; private MyUnBean m_UnBean; public MyBean(int count, String name, MyUnBean unBean) { m_count = count; m_Name = name; m_UnBean = unBean; } public int getCount() {return m_count; } public void setCount(int count) {m_count = count;} public String getName() {return m_Name;} public void setName(String name) {m_Name = name;} public MyUnBean getUnBean() {return m_UnBean;} public void setUnBean(MyUnBean unBean) {m_UnBean = unBean;} } /** * This is a Java object which is not a bean * no getters or setters but is serializable */ public class MyUnBean implements Serializable { public final int count; public final String name; public MyUnBean(int count, String name) { this.count = count; this.name = name; } } ** * This code creates a list of objects containing MyBean - * a Java Bean containing one field which is not bean * It then attempts and fails to use a bean encoder * to make a DataSet */ public class DatasetTest { public static final Random RND = new Random(); public static final int LIST_SIZE = 100; public static String makeName() { return Integer.toString(RND.nextInt()); } public static MyUnBean makeUnBean() { return new MyUnBean(RND.nextInt(), makeName()); } public static MyBean makeBean() { return new MyBean(RND.nextInt(), makeName(), makeUnBean()); } /** * Make a list of MyBeans * @return */ public static List makeBeanList() { List holder = new ArrayList(); for (int i = 0; i < LIST_SIZE; i++) { holder.add(makeBean()); } return holder; } public static SQLContext getSqlContext() { SparkConf sparkConf = new SparkConf(); sparkConf.setAppName("BeanTest") ; Option option = sparkConf.getOption("spark.master"); if (!option.isDefined())// use local over nothing sparkConf.setMaster("local[*]"); JavaSparkContext ctx = new JavaSparkContext(sparkConf) ; return new SQLContext(ctx); } public static void main(String[] args) { SQLContext sqlContext = getSqlContext(); Encoder evidence = Encoders.bean(MyBean.class); Encoder evidence2 = Encoders.javaSerialization(MyUnBean.class); List holder = makeBeanList(); // fails at this line with // Exception in thread "main" java.lang.UnsupportedOperationException: no encoder found for com.lordjoe.testing.MyUnBean Dataset beanSet = sqlContext.createDataset( holder, evidence); long count = beanSet.count(); if(count != LIST_SIZE) throw new IllegalStateException("bad count"); } } {code} was: in the current environment the only way to turn a List or JavaRDD into a DataSet with columns is to use a Encoders.bean(MyBean.class); The current implementation fails if a Bean property is not a basic type or a Bean. I would like to see one of the following 1) Default to JavaSerialization for any Java Object implementing Serializable when using bean Encoder 2) Allow an encoder which is a Map and look up entries in encoding classes - an ideal implementation would look for the class then any interfaces and then search base classes The following code illustrates the issue /** * This class is a good Java bean but one field holds an object * which is not a bean */ public class MyBean implements Serializable { private int m_count; private String m_Name; private MyUnBean m_UnBean; public MyBean(int count, String name, MyUnBean unBean) { m_count = count; m_Name = name; m_UnBean = unBean; } public int getCount() {return m_count; } public void setCount(int count) {m_count = count;} public String getName() {return m_Name;} public void setName(String name) {m_Name = name;} public MyUnBean getUnBean() {return m_UnBean;} public void setUnBean(MyUnBean un
[jira] [Updated] (SPARK-13691) Scala and Python generate inconsistent results
[ https://issues.apache.org/jira/browse/SPARK-13691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-13691: - Affects Version/s: 1.4.1 1.5.2 1.6.0 > Scala and Python generate inconsistent results > -- > > Key: SPARK-13691 > URL: https://issues.apache.org/jira/browse/SPARK-13691 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.4.1, 1.5.2, 1.6.0 >Reporter: Shixiong Zhu > > Here is an example that Scala and Python generate different results > {code} > Scala: > scala> var i = 0 > i: Int = 0 > scala> val rdd = sc.parallelize(1 to 10).map(_ + i) > scala> rdd.collect() > res0: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10) > scala> i += 1 > scala> rdd.collect() > res2: Array[Int] = Array(2, 3, 4, 5, 6, 7, 8, 9, 10, 11) > Python: > >>> i = 0 > >>> rdd = sc.parallelize(range(1, 10)).map(lambda x: x + i) > >>> rdd.collect() > [1, 2, 3, 4, 5, 6, 7, 8, 9] > >>> i += 1 > >>> rdd.collect() > [1, 2, 3, 4, 5, 6, 7, 8, 9] > {code} > The difference is Scala will capture all variables' values when running a job > every time, but Python just captures variables' values once and always uses > them for all jobs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13255) Integrate vectorized parquet scan with whole stage codegen.
[ https://issues.apache.org/jira/browse/SPARK-13255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-13255. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11435 [https://github.com/apache/spark/pull/11435] > Integrate vectorized parquet scan with whole stage codegen. > --- > > Key: SPARK-13255 > URL: https://issues.apache.org/jira/browse/SPARK-13255 > Project: Spark > Issue Type: Task > Components: SQL >Reporter: Nong Li > Fix For: 2.0.0 > > > The generated whole stage codegen is intended to be run over batches of rows. > This task is to integrate ColumnarBatches with whole stage codegen. > The resulting generated code should look something like: > {code} > Iterator input; > void process() { > while (input.hasNext()) { > ColumnarBatch batch = input.next(); > for (Row: batch) { > // Current function > } > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13691) Scala and Python generate inconsistent results
[ https://issues.apache.org/jira/browse/SPARK-13691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180724#comment-15180724 ] Shixiong Zhu commented on SPARK-13691: -- Ideally, PySpark should always capture all values when running a job like Scala. > Scala and Python generate inconsistent results > -- > > Key: SPARK-13691 > URL: https://issues.apache.org/jira/browse/SPARK-13691 > Project: Spark > Issue Type: Bug > Components: PySpark >Reporter: Shixiong Zhu > > Here is an example that Scala and Python generate different results > {code} > Scala: > scala> var i = 0 > i: Int = 0 > scala> val rdd = sc.parallelize(1 to 10).map(_ + i) > scala> rdd.collect() > res0: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10) > scala> i += 1 > scala> rdd.collect() > res2: Array[Int] = Array(2, 3, 4, 5, 6, 7, 8, 9, 10, 11) > Python: > >>> i = 0 > >>> rdd = sc.parallelize(range(1, 10)).map(lambda x: x + i) > >>> rdd.collect() > [1, 2, 3, 4, 5, 6, 7, 8, 9] > >>> i += 1 > >>> rdd.collect() > [1, 2, 3, 4, 5, 6, 7, 8, 9] > {code} > The difference is Scala will capture all variables' values when running a job > every time, but Python just captures variables' values once and always uses > them for all jobs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10086) Flaky StreamingKMeans test in PySpark
[ https://issues.apache.org/jira/browse/SPARK-10086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180723#comment-15180723 ] Shixiong Zhu commented on SPARK-10086: -- I opened SPARK-13691 to describe the root issue in PySpark. > Flaky StreamingKMeans test in PySpark > - > > Key: SPARK-10086 > URL: https://issues.apache.org/jira/browse/SPARK-10086 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark, Streaming, Tests >Affects Versions: 1.5.0 >Reporter: Joseph K. Bradley >Priority: Critical > Attachments: flakyRepro.py > > > Here's a report on investigating test failures in StreamingKMeans in PySpark. > (See Jenkins links below.) > It is a StreamingKMeans test which trains on a DStream with 2 batches and > then tests on those same 2 batches. It fails here: > [https://github.com/apache/spark/blob/1968276af0f681fe51328b7dd795bd21724a5441/python/pyspark/mllib/tests.py#L1144] > I recreated the same test, with variants training on: (1) the original 2 > batches, (2) just the first batch, (3) just the second batch, and (4) neither > batch. Here is code which avoids Streaming altogether to identify what > batches were processed. > {code} > from pyspark.mllib.clustering import StreamingKMeans, StreamingKMeansModel > batches = [[[-0.5], [0.6], [0.8]], [[0.2], [-0.1], [0.3]]] > batches = [sc.parallelize(batch) for batch in batches] > stkm = StreamingKMeans(decayFactor=0.0, k=2) > stkm.setInitialCenters([[0.0], [1.0]], [1.0, 1.0]) > # Train > def update(rdd): > stkm._model.update(rdd, stkm._decayFactor, stkm._timeUnit) > # Remove one or both of these lines to test skipping batches. > update(batches[0]) > update(batches[1]) > # Test > def predict(rdd): > return stkm._model.predict(rdd) > predict(batches[0]).collect() > predict(batches[1]).collect() > {code} > *Results*: > {code} > ### EXPECTED > [0, 1, 1] > > [1, 0, 1] > ### Skip batch 0 > [1, 0, 0] > [0, 1, 0] > ### Skip batch 1 > [0, 1, 1] > [1, 0, 1] > ### Skip both batches (This is what we see in the test > failures.) > [0, 1, 1] > [0, 0, 0] > {code} > Skipping both batches reproduces the failure. There is no randomness in the > StreamingKMeans algorithm (since initial centers are fixed, not randomized). > CC: [~tdas] [~freeman-lab] [~mengxr] > Failure message: > {code} > == > FAIL: test_trainOn_predictOn (__main__.StreamingKMeansTest) > Test that prediction happens on the updated model. > -- > Traceback (most recent call last): > File > "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py", > line 1147, in test_trainOn_predictOn > self._eventually(condition, catch_assertions=True) > File > "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py", > line 123, in _eventually > raise lastValue > File > "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py", > line 114, in _eventually > lastValue = condition() > File > "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py", > line 1144, in condition > self.assertEqual(predict_results, [[0, 1, 1], [1, 0, 1]]) > AssertionError: Lists differ: [[0, 1, 1], [0, 0, 0]] != [[0, 1, 1], [1, 0, 1]] > First differing element 1: > [0, 0, 0] > [1, 0, 1] > - [[0, 1, 1], [0, 0, 0]] > ? > + [[0, 1, 1], [1, 0, 1]] > ? +++ ^ > -- > Ran 62 tests in 164.188s > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13691) Scala and Python generate inconsistent results
Shixiong Zhu created SPARK-13691: Summary: Scala and Python generate inconsistent results Key: SPARK-13691 URL: https://issues.apache.org/jira/browse/SPARK-13691 Project: Spark Issue Type: Bug Components: PySpark Reporter: Shixiong Zhu Here is an example that Scala and Python generate different results {code} Scala: scala> var i = 0 i: Int = 0 scala> val rdd = sc.parallelize(1 to 10).map(_ + i) scala> rdd.collect() res0: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10) scala> i += 1 scala> rdd.collect() res2: Array[Int] = Array(2, 3, 4, 5, 6, 7, 8, 9, 10, 11) Python: >>> i = 0 >>> rdd = sc.parallelize(range(1, 10)).map(lambda x: x + i) >>> rdd.collect() [1, 2, 3, 4, 5, 6, 7, 8, 9] >>> i += 1 >>> rdd.collect() [1, 2, 3, 4, 5, 6, 7, 8, 9] {code} The difference is Scala will capture all variables' values when running a job every time, but Python just captures variables' values once and always uses them for all jobs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13459) Separate Alive and Dead Executors in Executor Totals Table
[ https://issues.apache.org/jira/browse/SPARK-13459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-13459. --- Resolution: Fixed Fix Version/s: 2.0.0 > Separate Alive and Dead Executors in Executor Totals Table > -- > > Key: SPARK-13459 > URL: https://issues.apache.org/jira/browse/SPARK-13459 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.0.0 >Reporter: Alex Bozarth >Assignee: Alex Bozarth >Priority: Minor > Fix For: 2.0.0 > > > Now that dead executors are shown in the executors table (SPARK-7729) the > totals table added in SPARK-12716 should be updated to include the separate > totals for alive and dead executors as well as the current total. > (This improvement was originally discussed in the PR for SPARK-12716 while > SPARK-7729 was still in progress.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13690) UnsafeShuffleWriterSuite fails on arm64 (SnappyError, no native library is found)
[ https://issues.apache.org/jira/browse/SPARK-13690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180713#comment-15180713 ] Sean Owen commented on SPARK-13690: --- Hm, I doubt Spark would be considered supported on a platform like this for reasons like this. But I thought snappy-java would fall back to non-native code? > UnsafeShuffleWriterSuite fails on arm64 (SnappyError, no native library is > found) > - > > Key: SPARK-13690 > URL: https://issues.apache.org/jira/browse/SPARK-13690 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 > Environment: $ java -version > java version "1.8.0_73" > Java(TM) SE Runtime Environment (build 1.8.0_73-b02) > Java HotSpot(TM) 64-Bit Server VM (build 25.73-b02, mixed mode) > $ uname -a > Linux spark-on-arm 4.2.0-55598-g45f70e3 #5 SMP Tue Feb 2 10:14:08 CET 2016 > aarch64 aarch64 aarch64 GNU/Linux >Reporter: Santiago M. Mola >Priority: Minor > Labels: arm64, porting > > UnsafeShuffleWriterSuite fails because of missing Snappy native library on > arm64. > {code} > Tests run: 19, Failures: 0, Errors: 2, Skipped: 0, Time elapsed: 6.437 sec > <<< FAILURE! - in org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite > mergeSpillsWithFileStreamAndSnappy(org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite) > Time elapsed: 0.072 sec <<< ERROR! > java.lang.reflect.InvocationTargetException > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite.testMergingSpills(UnsafeShuffleWriterSuite.java:337) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite.mergeSpillsWithFileStreamAndSnappy(UnsafeShuffleWriterSuite.java:389) > Caused by: java.lang.IllegalArgumentException: org.xerial.snappy.SnappyError: > [FAILED_TO_LOAD_NATIVE_LIBRARY] no native library is found for os.name=Linux > and os.arch=aarch64 > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite.testMergingSpills(UnsafeShuffleWriterSuite.java:337) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite.mergeSpillsWithFileStreamAndSnappy(UnsafeShuffleWriterSuite.java:389) > Caused by: org.xerial.snappy.SnappyError: [FAILED_TO_LOAD_NATIVE_LIBRARY] no > native library is found for os.name=Linux and os.arch=aarch64 > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite.testMergingSpills(UnsafeShuffleWriterSuite.java:337) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite.mergeSpillsWithFileStreamAndSnappy(UnsafeShuffleWriterSuite.java:389) > mergeSpillsWithTransferToAndSnappy(org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite) > Time elapsed: 0.041 sec <<< ERROR! > java.lang.reflect.InvocationTargetException > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite.testMergingSpills(UnsafeShuffleWriterSuite.java:337) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite.mergeSpillsWithTransferToAndSnappy(UnsafeShuffleWriterSuite.java:384) > Caused by: java.lang.IllegalArgumentException: > java.lang.NoClassDefFoundError: Could not initialize class > org.xerial.snappy.Snappy > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite.testMergingSpills(UnsafeShuffleWriterSuite.java:337) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite.mergeSpillsWithTransferToAndSnappy(UnsafeShuffleWriterSuite.java:384) > Caused by: java.lang.NoClassDefFoundError: Could not initialize class > org.xerial.snappy.Snappy > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite.testMergingSpills(UnsafeShuffleWriterSuite.java:337) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite.mergeSpillsWithTransferToAndSnappy(UnsafeShuffleWriterSuite.java:384) > Running org.apache.spark.JavaAPISuite > Tests run: 90, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 52.526 sec - > in org.apache.spark.JavaAPISuite > Running org.apache.spark.unsafe.map.BytesToBytesMapOnHeapSuite > Tests run: 12, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 9.761 sec - > in org.apache.spark.unsafe.map.BytesToBytesMapOnHeapSuite > Running org.apache.spark.unsafe.map.BytesToBytesMapOffHeapSuite > Tests run: 12, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 8.967 sec - > in org.apache.spark.unsafe.map.BytesToBytesMapOffHeapSuite > Running org.apache.spark.api.java.OptionalSuite > Tests run: 8, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.003 sec - > in org.apache.spark.api.java.OptionalSuite > Results : > Tests in error: > > UnsafeShuffleWriterSuite.mergeSpillsWithFileStreamAndSnappy:389->testMergingSpills:337 > » InvocationTarget > > UnsafeShuffleWriterSuite.mergeSpillsWithTransferToAndSnappy:384->testMergingSpills:337 > » InvocationTarget > {code} -- This messa
[jira] [Updated] (SPARK-13459) Separate Alive and Dead Executors in Executor Totals Table
[ https://issues.apache.org/jira/browse/SPARK-13459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-13459: -- Assignee: Alex Bozarth > Separate Alive and Dead Executors in Executor Totals Table > -- > > Key: SPARK-13459 > URL: https://issues.apache.org/jira/browse/SPARK-13459 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.0.0 >Reporter: Alex Bozarth >Assignee: Alex Bozarth >Priority: Minor > > Now that dead executors are shown in the executors table (SPARK-7729) the > totals table added in SPARK-12716 should be updated to include the separate > totals for alive and dead executors as well as the current total. > (This improvement was originally discussed in the PR for SPARK-12716 while > SPARK-7729 was still in progress.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13523) Reuse the exchanges in a query
[ https://issues.apache.org/jira/browse/SPARK-13523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Hamstra resolved SPARK-13523. -- Resolution: Duplicate > Reuse the exchanges in a query > -- > > Key: SPARK-13523 > URL: https://issues.apache.org/jira/browse/SPARK-13523 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Davies Liu > > In exchange, the RDD will be materialized (shuffled or collected), it's a > good point to eliminate common part of a query. > In some TPCDS queries (for example, Q64), the same exchange (ShuffleExchange > or BroadcastExchange) could be used multiple times, we should re-use them to > avoid the duplicated work and reduce the memory for broadcast. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13690) UnsafeShuffleWriterSuite fails on arm64 (SnappyError, no native library is found)
Santiago M. Mola created SPARK-13690: Summary: UnsafeShuffleWriterSuite fails on arm64 (SnappyError, no native library is found) Key: SPARK-13690 URL: https://issues.apache.org/jira/browse/SPARK-13690 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.0.0 Environment: $ java -version java version "1.8.0_73" Java(TM) SE Runtime Environment (build 1.8.0_73-b02) Java HotSpot(TM) 64-Bit Server VM (build 25.73-b02, mixed mode) $ uname -a Linux spark-on-arm 4.2.0-55598-g45f70e3 #5 SMP Tue Feb 2 10:14:08 CET 2016 aarch64 aarch64 aarch64 GNU/Linux Reporter: Santiago M. Mola Priority: Minor UnsafeShuffleWriterSuite fails because of missing Snappy native library on arm64. {code} Tests run: 19, Failures: 0, Errors: 2, Skipped: 0, Time elapsed: 6.437 sec <<< FAILURE! - in org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite mergeSpillsWithFileStreamAndSnappy(org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite) Time elapsed: 0.072 sec <<< ERROR! java.lang.reflect.InvocationTargetException at org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite.testMergingSpills(UnsafeShuffleWriterSuite.java:337) at org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite.mergeSpillsWithFileStreamAndSnappy(UnsafeShuffleWriterSuite.java:389) Caused by: java.lang.IllegalArgumentException: org.xerial.snappy.SnappyError: [FAILED_TO_LOAD_NATIVE_LIBRARY] no native library is found for os.name=Linux and os.arch=aarch64 at org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite.testMergingSpills(UnsafeShuffleWriterSuite.java:337) at org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite.mergeSpillsWithFileStreamAndSnappy(UnsafeShuffleWriterSuite.java:389) Caused by: org.xerial.snappy.SnappyError: [FAILED_TO_LOAD_NATIVE_LIBRARY] no native library is found for os.name=Linux and os.arch=aarch64 at org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite.testMergingSpills(UnsafeShuffleWriterSuite.java:337) at org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite.mergeSpillsWithFileStreamAndSnappy(UnsafeShuffleWriterSuite.java:389) mergeSpillsWithTransferToAndSnappy(org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite) Time elapsed: 0.041 sec <<< ERROR! java.lang.reflect.InvocationTargetException at org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite.testMergingSpills(UnsafeShuffleWriterSuite.java:337) at org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite.mergeSpillsWithTransferToAndSnappy(UnsafeShuffleWriterSuite.java:384) Caused by: java.lang.IllegalArgumentException: java.lang.NoClassDefFoundError: Could not initialize class org.xerial.snappy.Snappy at org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite.testMergingSpills(UnsafeShuffleWriterSuite.java:337) at org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite.mergeSpillsWithTransferToAndSnappy(UnsafeShuffleWriterSuite.java:384) Caused by: java.lang.NoClassDefFoundError: Could not initialize class org.xerial.snappy.Snappy at org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite.testMergingSpills(UnsafeShuffleWriterSuite.java:337) at org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite.mergeSpillsWithTransferToAndSnappy(UnsafeShuffleWriterSuite.java:384) Running org.apache.spark.JavaAPISuite Tests run: 90, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 52.526 sec - in org.apache.spark.JavaAPISuite Running org.apache.spark.unsafe.map.BytesToBytesMapOnHeapSuite Tests run: 12, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 9.761 sec - in org.apache.spark.unsafe.map.BytesToBytesMapOnHeapSuite Running org.apache.spark.unsafe.map.BytesToBytesMapOffHeapSuite Tests run: 12, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 8.967 sec - in org.apache.spark.unsafe.map.BytesToBytesMapOffHeapSuite Running org.apache.spark.api.java.OptionalSuite Tests run: 8, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.003 sec - in org.apache.spark.api.java.OptionalSuite Results : Tests in error: UnsafeShuffleWriterSuite.mergeSpillsWithFileStreamAndSnappy:389->testMergingSpills:337 » InvocationTarget UnsafeShuffleWriterSuite.mergeSpillsWithTransferToAndSnappy:384->testMergingSpills:337 » InvocationTarget {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13689) Move some methods in CatalystQl to a util object
[ https://issues.apache.org/jira/browse/SPARK-13689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13689: Assignee: Andrew Or (was: Apache Spark) > Move some methods in CatalystQl to a util object > > > Key: SPARK-13689 > URL: https://issues.apache.org/jira/browse/SPARK-13689 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Andrew Or >Assignee: Andrew Or > > When we add more DDL parsing logic in the future, SparkQl will become very > big. To keep it smaller, we'll introduce helper "parser objects", e.g. one to > parse alter table commands. However, these parser objects will need to access > some helper methods that exist in CatalystQl. The proposal is to move those > methods to an isolated ParserUtils object. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13689) Move some methods in CatalystQl to a util object
[ https://issues.apache.org/jira/browse/SPARK-13689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180673#comment-15180673 ] Apache Spark commented on SPARK-13689: -- User 'andrewor14' has created a pull request for this issue: https://github.com/apache/spark/pull/11529 > Move some methods in CatalystQl to a util object > > > Key: SPARK-13689 > URL: https://issues.apache.org/jira/browse/SPARK-13689 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Andrew Or >Assignee: Andrew Or > > When we add more DDL parsing logic in the future, SparkQl will become very > big. To keep it smaller, we'll introduce helper "parser objects", e.g. one to > parse alter table commands. However, these parser objects will need to access > some helper methods that exist in CatalystQl. The proposal is to move those > methods to an isolated ParserUtils object. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13689) Move some methods in CatalystQl to a util object
[ https://issues.apache.org/jira/browse/SPARK-13689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13689: Assignee: Apache Spark (was: Andrew Or) > Move some methods in CatalystQl to a util object > > > Key: SPARK-13689 > URL: https://issues.apache.org/jira/browse/SPARK-13689 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Andrew Or >Assignee: Apache Spark > > When we add more DDL parsing logic in the future, SparkQl will become very > big. To keep it smaller, we'll introduce helper "parser objects", e.g. one to > parse alter table commands. However, these parser objects will need to access > some helper methods that exist in CatalystQl. The proposal is to move those > methods to an isolated ParserUtils object. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13689) Move some methods in CatalystQl to a util object
Andrew Or created SPARK-13689: - Summary: Move some methods in CatalystQl to a util object Key: SPARK-13689 URL: https://issues.apache.org/jira/browse/SPARK-13689 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Andrew Or Assignee: Andrew Or When we add more DDL parsing logic in the future, SparkQl will become very big. To keep it smaller, we'll introduce helper "parser objects", e.g. one to parse alter table commands. However, these parser objects will need to access some helper methods that exist in CatalystQl. The proposal is to move those methods to an isolated ParserUtils object. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13670) spark-class doesn't bubble up error from launcher command
[ https://issues.apache.org/jira/browse/SPARK-13670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180663#comment-15180663 ] Mark Grover commented on SPARK-13670: - I have a mac, I can do some more testing. > spark-class doesn't bubble up error from launcher command > - > > Key: SPARK-13670 > URL: https://issues.apache.org/jira/browse/SPARK-13670 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.0.0 >Reporter: Mark Grover >Priority: Minor > > There's a particular snippet in spark-class > [here|https://github.com/apache/spark/blob/master/bin/spark-class#L86] that > runs the spark-launcher code in a subshell. > {code} > # The launcher library will print arguments separated by a NULL character, to > allow arguments with > # characters that would be otherwise interpreted by the shell. Read that in a > while loop, populating > # an array that will be used to exec the final command. > CMD=() > while IFS= read -d '' -r ARG; do > CMD+=("$ARG") > done < <("$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main > "$@") > {code} > The problem is that the if the launcher Main fails, this code still still > returns success and continues, even though the top level script is marked > {{set -e}}. This is because the launcher.Main is run within a subshell. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13670) spark-class doesn't bubble up error from launcher command
[ https://issues.apache.org/jira/browse/SPARK-13670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180643#comment-15180643 ] Marcelo Vanzin commented on SPARK-13670: Here's one that works regardless of the output of Main. It would be nice to have someone try this on MacOS. {code} # The launcher library will print arguments separated by a NULL character, to allow arguments with # characters that would be otherwise interpreted by the shell. Read that in a while loop, populating # an array that will be used to exec the final command. # # The exit code of the launcher is appended to the output, so the parent shell removes it from the # command array and checks the value to see if the launcher succeeded. build_command() { "$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@" printf "%d\0" $? } CMD=() while IFS= read -d '' -r ARG; do CMD+=("$ARG") done < <(build_command "$@") COUNT=${#CMD[@]} LAST=$((COUNT - 1)) EC=${CMD[$LAST]} if [ $EC != 0 ]; then exit $EC fi CMD=("${CMD[@]:0:$LAST}") exec "${CMD[@]}" {code} > spark-class doesn't bubble up error from launcher command > - > > Key: SPARK-13670 > URL: https://issues.apache.org/jira/browse/SPARK-13670 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.0.0 >Reporter: Mark Grover >Priority: Minor > > There's a particular snippet in spark-class > [here|https://github.com/apache/spark/blob/master/bin/spark-class#L86] that > runs the spark-launcher code in a subshell. > {code} > # The launcher library will print arguments separated by a NULL character, to > allow arguments with > # characters that would be otherwise interpreted by the shell. Read that in a > while loop, populating > # an array that will be used to exec the final command. > CMD=() > while IFS= read -d '' -r ARG; do > CMD+=("$ARG") > done < <("$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main > "$@") > {code} > The problem is that the if the launcher Main fails, this code still still > returns success and continues, even though the top level script is marked > {{set -e}}. This is because the launcher.Main is run within a subshell. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13688) Add option to use dynamic allocation even if spark.executor.instances is set.
[ https://issues.apache.org/jira/browse/SPARK-13688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180636#comment-15180636 ] Apache Spark commented on SPARK-13688: -- User 'rdblue' has created a pull request for this issue: https://github.com/apache/spark/pull/11528 > Add option to use dynamic allocation even if spark.executor.instances is set. > - > > Key: SPARK-13688 > URL: https://issues.apache.org/jira/browse/SPARK-13688 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.6.0 >Reporter: Ryan Blue > > When both spark.dynamicAllocation.enabled and spark.executor.instances are > set, dynamic resource allocation is disabled (see SPARK-9092). This is a > reasonable default, but I think there should be a configuration property to > override it because it isn't obvious to users that dynamic allocation and > number of executors are mutually exclusive. We see users setting > --num-executors because that looks like what they want: a way to get more > executors. > I propose adding a new boolean property, > spark.dynamicAllocation.overrideNumExecutors, that makes dynamic allocation > the default when both are set and uses --num-executors as the minimum number > of executors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13688) Add option to use dynamic allocation even if spark.executor.instances is set.
[ https://issues.apache.org/jira/browse/SPARK-13688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13688: Assignee: (was: Apache Spark) > Add option to use dynamic allocation even if spark.executor.instances is set. > - > > Key: SPARK-13688 > URL: https://issues.apache.org/jira/browse/SPARK-13688 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.6.0 >Reporter: Ryan Blue > > When both spark.dynamicAllocation.enabled and spark.executor.instances are > set, dynamic resource allocation is disabled (see SPARK-9092). This is a > reasonable default, but I think there should be a configuration property to > override it because it isn't obvious to users that dynamic allocation and > number of executors are mutually exclusive. We see users setting > --num-executors because that looks like what they want: a way to get more > executors. > I propose adding a new boolean property, > spark.dynamicAllocation.overrideNumExecutors, that makes dynamic allocation > the default when both are set and uses --num-executors as the minimum number > of executors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13688) Add option to use dynamic allocation even if spark.executor.instances is set.
[ https://issues.apache.org/jira/browse/SPARK-13688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13688: Assignee: Apache Spark > Add option to use dynamic allocation even if spark.executor.instances is set. > - > > Key: SPARK-13688 > URL: https://issues.apache.org/jira/browse/SPARK-13688 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.6.0 >Reporter: Ryan Blue >Assignee: Apache Spark > > When both spark.dynamicAllocation.enabled and spark.executor.instances are > set, dynamic resource allocation is disabled (see SPARK-9092). This is a > reasonable default, but I think there should be a configuration property to > override it because it isn't obvious to users that dynamic allocation and > number of executors are mutually exclusive. We see users setting > --num-executors because that looks like what they want: a way to get more > executors. > I propose adding a new boolean property, > spark.dynamicAllocation.overrideNumExecutors, that makes dynamic allocation > the default when both are set and uses --num-executors as the minimum number > of executors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13688) Add option to use dynamic allocation even if spark.executor.instances is set.
[ https://issues.apache.org/jira/browse/SPARK-13688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated SPARK-13688: -- Affects Version/s: 1.6.0 > Add option to use dynamic allocation even if spark.executor.instances is set. > - > > Key: SPARK-13688 > URL: https://issues.apache.org/jira/browse/SPARK-13688 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.6.0 >Reporter: Ryan Blue > > When both spark.dynamicAllocation.enabled and spark.executor.instances are > set, dynamic resource allocation is disabled (see SPARK-9092). This is a > reasonable default, but I think there should be a configuration property to > override it because it isn't obvious to users that dynamic allocation and > number of executors are mutually exclusive. We see users setting > --num-executors because that looks like what they want: a way to get more > executors. > I propose adding a new boolean property, > spark.dynamicAllocation.overrideNumExecutors, that makes dynamic allocation > the default when both are set and uses --num-executors as the minimum number > of executors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13687) Cleanup pyspark temporary files
Damir created SPARK-13687: - Summary: Cleanup pyspark temporary files Key: SPARK-13687 URL: https://issues.apache.org/jira/browse/SPARK-13687 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 1.6.0, 1.5.2 Reporter: Damir Every time parallelize is called it creates temporary file for rdd in spark.local.dir/spark-uuid/pyspark-uuid/ directory. This directory deletes when context is closed, but for long running applications with permanently opened context this directory growth infinitely. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13688) Add option to use dynamic allocation even if spark.executor.instances is set.
[ https://issues.apache.org/jira/browse/SPARK-13688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated SPARK-13688: -- Component/s: YARN > Add option to use dynamic allocation even if spark.executor.instances is set. > - > > Key: SPARK-13688 > URL: https://issues.apache.org/jira/browse/SPARK-13688 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.6.0 >Reporter: Ryan Blue > > When both spark.dynamicAllocation.enabled and spark.executor.instances are > set, dynamic resource allocation is disabled (see SPARK-9092). This is a > reasonable default, but I think there should be a configuration property to > override it because it isn't obvious to users that dynamic allocation and > number of executors are mutually exclusive. We see users setting > --num-executors because that looks like what they want: a way to get more > executors. > I propose adding a new boolean property, > spark.dynamicAllocation.overrideNumExecutors, that makes dynamic allocation > the default when both are set and uses --num-executors as the minimum number > of executors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13688) Add option to use dynamic allocation even if spark.executor.instances is set.
Ryan Blue created SPARK-13688: - Summary: Add option to use dynamic allocation even if spark.executor.instances is set. Key: SPARK-13688 URL: https://issues.apache.org/jira/browse/SPARK-13688 Project: Spark Issue Type: Bug Reporter: Ryan Blue When both spark.dynamicAllocation.enabled and spark.executor.instances are set, dynamic resource allocation is disabled (see SPARK-9092). This is a reasonable default, but I think there should be a configuration property to override it because it isn't obvious to users that dynamic allocation and number of executors are mutually exclusive. We see users setting --num-executors because that looks like what they want: a way to get more executors. I propose adding a new boolean property, spark.dynamicAllocation.overrideNumExecutors, that makes dynamic allocation the default when both are set and uses --num-executors as the minimum number of executors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13684) Possible unsafe bytesRead increment in StreamInterceptor
[ https://issues.apache.org/jira/browse/SPARK-13684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180575#comment-15180575 ] holdenk commented on SPARK-13684: - It should (coverity doesn't trigger on any of the places where we use AtomicLong), although getting rid of the volatile keyword should also remove the warning (looking at the netty docs http://netty.io/wiki/new-and-noteworthy-in-4.0.html#wiki-h2-34 "A user does not need to define a volatile field to keep the state of a handler.") > Possible unsafe bytesRead increment in StreamInterceptor > > > Key: SPARK-13684 > URL: https://issues.apache.org/jira/browse/SPARK-13684 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: holdenk >Priority: Trivial > > We unsafely increment a volatile (bytesRead) in a call back, if two call > backs are triggered we may under count bytesRead. This issue was found using > coverity. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13686) Add a constructor parameter `reqParam` to (Streaming)LinearRegressionWithSGD
[ https://issues.apache.org/jira/browse/SPARK-13686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13686: Assignee: Apache Spark > Add a constructor parameter `reqParam` to (Streaming)LinearRegressionWithSGD > > > Key: SPARK-13686 > URL: https://issues.apache.org/jira/browse/SPARK-13686 > Project: Spark > Issue Type: Bug > Components: MLlib, Streaming >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Minor > > `LinearRegressionWithSGD` and `StreamingLinearRegressionWithSGD` does not > have `regParam` as their constructor arguments. They just depends on > GradientDescent's default reqParam values. > To be consistent with other algorithms, we had better add them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13686) Add a constructor parameter `reqParam` to (Streaming)LinearRegressionWithSGD
[ https://issues.apache.org/jira/browse/SPARK-13686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13686: Assignee: (was: Apache Spark) > Add a constructor parameter `reqParam` to (Streaming)LinearRegressionWithSGD > > > Key: SPARK-13686 > URL: https://issues.apache.org/jira/browse/SPARK-13686 > Project: Spark > Issue Type: Bug > Components: MLlib, Streaming >Reporter: Dongjoon Hyun >Priority: Minor > > `LinearRegressionWithSGD` and `StreamingLinearRegressionWithSGD` does not > have `regParam` as their constructor arguments. They just depends on > GradientDescent's default reqParam values. > To be consistent with other algorithms, we had better add them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13686) Add a constructor parameter `reqParam` to (Streaming)LinearRegressionWithSGD
[ https://issues.apache.org/jira/browse/SPARK-13686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180567#comment-15180567 ] Apache Spark commented on SPARK-13686: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/11527 > Add a constructor parameter `reqParam` to (Streaming)LinearRegressionWithSGD > > > Key: SPARK-13686 > URL: https://issues.apache.org/jira/browse/SPARK-13686 > Project: Spark > Issue Type: Bug > Components: MLlib, Streaming >Reporter: Dongjoon Hyun >Priority: Minor > > `LinearRegressionWithSGD` and `StreamingLinearRegressionWithSGD` does not > have `regParam` as their constructor arguments. They just depends on > GradientDescent's default reqParam values. > To be consistent with other algorithms, we had better add them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7179) Add pattern after "show tables" to filter desire tablename
[ https://issues.apache.org/jira/browse/SPARK-7179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-7179: Target Version/s: 2.0.0 > Add pattern after "show tables" to filter desire tablename > -- > > Key: SPARK-7179 > URL: https://issues.apache.org/jira/browse/SPARK-7179 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: baishuo >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13686) Add a constructor parameter `reqParam` to (Streaming)LinearRegressionWithSGD
[ https://issues.apache.org/jira/browse/SPARK-13686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-13686: -- Component/s: MLlib > Add a constructor parameter `reqParam` to (Streaming)LinearRegressionWithSGD > > > Key: SPARK-13686 > URL: https://issues.apache.org/jira/browse/SPARK-13686 > Project: Spark > Issue Type: Bug > Components: MLlib, Streaming >Reporter: Dongjoon Hyun >Priority: Minor > > `LinearRegressionWithSGD` and `StreamingLinearRegressionWithSGD` does not > have `regParam` as their constructor arguments. They just depends on > GradientDescent's default reqParam values. > To be consistent with other algorithms, we had better add them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13686) Add a constructor parameter `reqParam` to (Streaming)LinearRegressionWithSGD
[ https://issues.apache.org/jira/browse/SPARK-13686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-13686: -- Component/s: Streaming > Add a constructor parameter `reqParam` to (Streaming)LinearRegressionWithSGD > > > Key: SPARK-13686 > URL: https://issues.apache.org/jira/browse/SPARK-13686 > Project: Spark > Issue Type: Bug > Components: MLlib, Streaming >Reporter: Dongjoon Hyun >Priority: Minor > > `LinearRegressionWithSGD` and `StreamingLinearRegressionWithSGD` does not > have `regParam` as their constructor arguments. They just depends on > GradientDescent's default reqParam values. > To be consistent with other algorithms, we had better add them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13686) Add a constructor parameter `reqParam` to (Streaming)LinearRegressionWithSGD
Dongjoon Hyun created SPARK-13686: - Summary: Add a constructor parameter `reqParam` to (Streaming)LinearRegressionWithSGD Key: SPARK-13686 URL: https://issues.apache.org/jira/browse/SPARK-13686 Project: Spark Issue Type: Bug Reporter: Dongjoon Hyun Priority: Minor `LinearRegressionWithSGD` and `StreamingLinearRegressionWithSGD` does not have `regParam` as their constructor arguments. They just depends on GradientDescent's default reqParam values. To be consistent with other algorithms, we had better add them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13684) Possible unsafe bytesRead increment in StreamInterceptor
[ https://issues.apache.org/jira/browse/SPARK-13684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180554#comment-15180554 ] Marcelo Vanzin commented on SPARK-13684: If switching to AtomicLong gets rid of the warning, that's fine. > Possible unsafe bytesRead increment in StreamInterceptor > > > Key: SPARK-13684 > URL: https://issues.apache.org/jira/browse/SPARK-13684 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: holdenk >Priority: Trivial > > We unsafely increment a volatile (bytesRead) in a call back, if two call > backs are triggered we may under count bytesRead. This issue was found using > coverity. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13685) Rename catalog.Catalog to ExternalCatalog
[ https://issues.apache.org/jira/browse/SPARK-13685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13685: Assignee: Apache Spark (was: Andrew Or) > Rename catalog.Catalog to ExternalCatalog > - > > Key: SPARK-13685 > URL: https://issues.apache.org/jira/browse/SPARK-13685 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Andrew Or >Assignee: Apache Spark > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13685) Rename catalog.Catalog to ExternalCatalog
[ https://issues.apache.org/jira/browse/SPARK-13685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180551#comment-15180551 ] Apache Spark commented on SPARK-13685: -- User 'andrewor14' has created a pull request for this issue: https://github.com/apache/spark/pull/11526 > Rename catalog.Catalog to ExternalCatalog > - > > Key: SPARK-13685 > URL: https://issues.apache.org/jira/browse/SPARK-13685 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Andrew Or >Assignee: Andrew Or > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13685) Rename catalog.Catalog to ExternalCatalog
[ https://issues.apache.org/jira/browse/SPARK-13685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13685: Assignee: Andrew Or (was: Apache Spark) > Rename catalog.Catalog to ExternalCatalog > - > > Key: SPARK-13685 > URL: https://issues.apache.org/jira/browse/SPARK-13685 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Andrew Or >Assignee: Andrew Or > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13684) Possible unsafe bytesRead increment in StreamInterceptor
[ https://issues.apache.org/jira/browse/SPARK-13684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180550#comment-15180550 ] holdenk commented on SPARK-13684: - We can use AtomicLong (so no need for using Doubles). > Possible unsafe bytesRead increment in StreamInterceptor > > > Key: SPARK-13684 > URL: https://issues.apache.org/jira/browse/SPARK-13684 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: holdenk >Priority: Trivial > > We unsafely increment a volatile (bytesRead) in a call back, if two call > backs are triggered we may under count bytesRead. This issue was found using > coverity. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13684) Possible unsafe bytesRead increment in StreamInterceptor
[ https://issues.apache.org/jira/browse/SPARK-13684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180549#comment-15180549 ] Marcelo Vanzin commented on SPARK-13684: I don't think it could break unless netty changes the contract significantly. I find that highly unlikely to happen, since there's not much sense in processing different buffers read from the same socket concurrently. > Possible unsafe bytesRead increment in StreamInterceptor > > > Key: SPARK-13684 > URL: https://issues.apache.org/jira/browse/SPARK-13684 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: holdenk >Priority: Trivial > > We unsafely increment a volatile (bytesRead) in a call back, if two call > backs are triggered we may under count bytesRead. This issue was found using > coverity. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13684) Possible unsafe bytesRead increment in StreamInterceptor
[ https://issues.apache.org/jira/browse/SPARK-13684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180544#comment-15180544 ] Sean Owen commented on SPARK-13684: --- The alternative is AtomicDouble I suppose. If that's true, it's not necessary indeed. Is it worth it for future proofing or is this just not something that could break? then it can be dismissed as a false positive. > Possible unsafe bytesRead increment in StreamInterceptor > > > Key: SPARK-13684 > URL: https://issues.apache.org/jira/browse/SPARK-13684 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: holdenk >Priority: Trivial > > We unsafely increment a volatile (bytesRead) in a call back, if two call > backs are triggered we may under count bytesRead. This issue was found using > coverity. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13572) HiveContext reads avro Hive tables incorrectly
[ https://issues.apache.org/jira/browse/SPARK-13572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zoltan Fedor updated SPARK-13572: - Affects Version/s: 1.6.0 > HiveContext reads avro Hive tables incorrectly > --- > > Key: SPARK-13572 > URL: https://issues.apache.org/jira/browse/SPARK-13572 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.5.2, 1.6.0 > Environment: Hive 0.13.1, Spark 1.5.2 >Reporter: Zoltan Fedor > Attachments: logs, table_definition > > > I am using PySpark to read avro-based tables from Hive and while the avro > tables can be read, some of the columns are incorrectly read - showing value > "None" instead of the actual value. > >>> results_df = sqlContext.sql("""SELECT * FROM trmdw_prod.opsconsole_ingest > >>> where year=2016 and month=2 and day=29 limit 3""") > >>> results_df.take(3) > [Row(kafkaoffsetgeneration=None, kafkapartition=None, kafkaoffset=None, > uuid=None, mid=None, iid=None, product=None, utctime=None, statcode=None, > statvalue=None, displayname=None, category=None, > source_filename=u'ops-20160228_23_35_01.gz', year=2016, month=2, day=29), > Row(kafkaoffsetgeneration=None, kafkapartition=None, kafkaoffset=None, > uuid=None, mid=None, iid=None, product=None, utctime=None, statcode=None, > statvalue=None, displayname=None, category=None, > source_filename=u'ops-20160228_23_35_01.gz', year=2016, month=2, day=29), > Row(kafkaoffsetgeneration=None, kafkapartition=None, kafkaoffset=None, > uuid=None, mid=None, iid=None, product=None, utctime=None, statcode=None, > statvalue=None, displayname=None, category=None, > source_filename=u'ops-20160228_23_35_01.gz', year=2016, month=2, day=29)] > Observe the "None" values at most of the fields. Surprisingly not all fields, > only some of them are showing "None" instead of the real values. The table > definition does not show anything specific about these columns. > Running the same query in Hive: > c:hive2://xyz.com:100> SELECT * FROM trmdw_prod.opsconsole_ingest where > year=2016 and month=2 and day=29 limit 3; > +--+---++---+---+---+++-+--++-++-+--++--+ > | opsconsole_ingest.kafkaoffsetgeneration | opsconsole_ingest.kafkapartition > | opsconsole_ingest.kafkaoffset | opsconsole_ingest.uuid | > opsconsole_ingest.mid | opsconsole_ingest.iid | > opsconsole_ingest.product | opsconsole_ingest.utctime | > opsconsole_ingest.statcode | opsconsole_ingest.statvalue | > opsconsole_ingest.displayname | opsconsole_ingest.category | > opsconsole_ingest.source_filename | opsconsole_ingest.year | > opsconsole_ingest.month | opsconsole_ingest.day | > +--+---++---+---+---+++-+--++-++-+--++--+ > | 11.0 | 0.0 > | 3.83399394E8 | EF0D03C409681B98646F316CA1088973 | > 174f53fb-ca9b-d3f9-64e1-7631bf906817 | ---- > | est| 2016-01-13T06:58:19| 8 > | 3.0 SP11 (8.110.7601.18923) | MSXML 3.0 Version | > PC Information | ops-20160228_23_35_01.gz | 2016 > | 2| 29 | > | 11.0 | 0.0 > | 3.83399395E8 | EF0D03C409681B98646F316CA1088973 | > 174f53fb-ca9b-d3f9-64e1-7631bf906817 | ---- > | est| 2016-01-13T06:58:19| 2 > | GenuineIntel | CPU Vendor | > PC Information | ops-20160228_23_35_01.gz | 2016 > | 2| 29 | > | 11.0
[jira] [Commented] (SPARK-13670) spark-class doesn't bubble up error from launcher command
[ https://issues.apache.org/jira/browse/SPARK-13670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180535#comment-15180535 ] Marcelo Vanzin commented on SPARK-13670: bq. And, it's guaranteed that Main would never return that output? I'm reasonably sure it won't. I just tried spark-shell, pyspark and sparkR and they all work (since they all generate multiple command line arguments). But if worried about that, you could always append the exit code to the output of {{build_command}} by always executing the printf. Then just remove the exit code in the parent shell, something like: {code} COUNT=${#CMD[@]} EC=${CMD[$COUNT]} CMD="${CMD[@]:0:$((COUNT -1))}" {code} Not 100% sure about the syntax, but seems to work in a quick test. > spark-class doesn't bubble up error from launcher command > - > > Key: SPARK-13670 > URL: https://issues.apache.org/jira/browse/SPARK-13670 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.0.0 >Reporter: Mark Grover >Priority: Minor > > There's a particular snippet in spark-class > [here|https://github.com/apache/spark/blob/master/bin/spark-class#L86] that > runs the spark-launcher code in a subshell. > {code} > # The launcher library will print arguments separated by a NULL character, to > allow arguments with > # characters that would be otherwise interpreted by the shell. Read that in a > while loop, populating > # an array that will be used to exec the final command. > CMD=() > while IFS= read -d '' -r ARG; do > CMD+=("$ARG") > done < <("$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main > "$@") > {code} > The problem is that the if the launcher Main fails, this code still still > returns success and continues, even though the top level script is marked > {{set -e}}. This is because the launcher.Main is run within a subshell. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13685) Rename catalog.Catalog to ExternalCatalog
Andrew Or created SPARK-13685: - Summary: Rename catalog.Catalog to ExternalCatalog Key: SPARK-13685 URL: https://issues.apache.org/jira/browse/SPARK-13685 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Andrew Or Assignee: Andrew Or -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13670) spark-class doesn't bubble up error from launcher command
[ https://issues.apache.org/jira/browse/SPARK-13670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180516#comment-15180516 ] Mark Grover commented on SPARK-13670: - And, it's guaranteed that Main would never return that output? I feel like we may be overloading too much here but again, I don't have any non-ugly ideas either. We may just have to choose our poison. > spark-class doesn't bubble up error from launcher command > - > > Key: SPARK-13670 > URL: https://issues.apache.org/jira/browse/SPARK-13670 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.0.0 >Reporter: Mark Grover >Priority: Minor > > There's a particular snippet in spark-class > [here|https://github.com/apache/spark/blob/master/bin/spark-class#L86] that > runs the spark-launcher code in a subshell. > {code} > # The launcher library will print arguments separated by a NULL character, to > allow arguments with > # characters that would be otherwise interpreted by the shell. Read that in a > while loop, populating > # an array that will be used to exec the final command. > CMD=() > while IFS= read -d '' -r ARG; do > CMD+=("$ARG") > done < <("$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main > "$@") > {code} > The problem is that the if the launcher Main fails, this code still still > returns success and continues, even though the top level script is marked > {{set -e}}. This is because the launcher.Main is run within a subshell. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13684) Possible unsafe bytesRead increment in StreamInterceptor
[ https://issues.apache.org/jira/browse/SPARK-13684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180515#comment-15180515 ] Marcelo Vanzin commented on SPARK-13684: It's not a real bug; netty guarantees that, on the read pipeline, a single thread is running the handlers. The volatile is there for paranoia, in case two back-to-back handler invocations happen on different threads. > Possible unsafe bytesRead increment in StreamInterceptor > > > Key: SPARK-13684 > URL: https://issues.apache.org/jira/browse/SPARK-13684 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: holdenk >Priority: Trivial > > We unsafely increment a volatile (bytesRead) in a call back, if two call > backs are triggered we may under count bytesRead. This issue was found using > coverity. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13684) Possible unsafe bytesRead increment in StreamInterceptor
holdenk created SPARK-13684: --- Summary: Possible unsafe bytesRead increment in StreamInterceptor Key: SPARK-13684 URL: https://issues.apache.org/jira/browse/SPARK-13684 Project: Spark Issue Type: Bug Components: Spark Core Reporter: holdenk Priority: Trivial We unsafely increment a volatile (bytesRead) in a call back, if two call backs are triggered we may under count bytesRead. This issue was found using coverity. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13670) spark-class doesn't bubble up error from launcher command
[ https://issues.apache.org/jira/browse/SPARK-13670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180489#comment-15180489 ] Marcelo Vanzin commented on SPARK-13670: Here's something a little ugly that works though: {code} build_command() { "$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@" EC=$? if [ $EC -ne 0 ]; then printf "%d\0" $EC fi } CMD=() while IFS= read -d '' -r ARG; do CMD+=("$ARG") done < <(build_command "$@") if [ ${#CMD[@]} -eq 1 ]; then exit ${CMD[0]} fi exec "${CMD[@]}" {code} > spark-class doesn't bubble up error from launcher command > - > > Key: SPARK-13670 > URL: https://issues.apache.org/jira/browse/SPARK-13670 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.0.0 >Reporter: Mark Grover >Priority: Minor > > There's a particular snippet in spark-class > [here|https://github.com/apache/spark/blob/master/bin/spark-class#L86] that > runs the spark-launcher code in a subshell. > {code} > # The launcher library will print arguments separated by a NULL character, to > allow arguments with > # characters that would be otherwise interpreted by the shell. Read that in a > while loop, populating > # an array that will be used to exec the final command. > CMD=() > while IFS= read -d '' -r ARG; do > CMD+=("$ARG") > done < <("$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main > "$@") > {code} > The problem is that the if the launcher Main fails, this code still still > returns success and continues, even though the top level script is marked > {{set -e}}. This is because the launcher.Main is run within a subshell. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13670) spark-class doesn't bubble up error from launcher command
[ https://issues.apache.org/jira/browse/SPARK-13670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180479#comment-15180479 ] Marcelo Vanzin commented on SPARK-13670: Not a fan... you have to write the file somewhere, clean it up, worry about who can read it, all sorts of stuff. Better to avoid it. > spark-class doesn't bubble up error from launcher command > - > > Key: SPARK-13670 > URL: https://issues.apache.org/jira/browse/SPARK-13670 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.0.0 >Reporter: Mark Grover >Priority: Minor > > There's a particular snippet in spark-class > [here|https://github.com/apache/spark/blob/master/bin/spark-class#L86] that > runs the spark-launcher code in a subshell. > {code} > # The launcher library will print arguments separated by a NULL character, to > allow arguments with > # characters that would be otherwise interpreted by the shell. Read that in a > while loop, populating > # an array that will be used to exec the final command. > CMD=() > while IFS= read -d '' -r ARG; do > CMD+=("$ARG") > done < <("$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main > "$@") > {code} > The problem is that the if the launcher Main fails, this code still still > returns success and continues, even though the top level script is marked > {{set -e}}. This is because the launcher.Main is run within a subshell. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13670) spark-class doesn't bubble up error from launcher command
[ https://issues.apache.org/jira/browse/SPARK-13670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180463#comment-15180463 ] Mark Grover commented on SPARK-13670: - Thanks for looking at this. What do you think of #2 - writing the output to a file? > spark-class doesn't bubble up error from launcher command > - > > Key: SPARK-13670 > URL: https://issues.apache.org/jira/browse/SPARK-13670 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.0.0 >Reporter: Mark Grover >Priority: Minor > > There's a particular snippet in spark-class > [here|https://github.com/apache/spark/blob/master/bin/spark-class#L86] that > runs the spark-launcher code in a subshell. > {code} > # The launcher library will print arguments separated by a NULL character, to > allow arguments with > # characters that would be otherwise interpreted by the shell. Read that in a > while loop, populating > # an array that will be used to exec the final command. > CMD=() > while IFS= read -d '' -r ARG; do > CMD+=("$ARG") > done < <("$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main > "$@") > {code} > The problem is that the if the launcher Main fails, this code still still > returns success and continues, even though the top level script is marked > {{set -e}}. This is because the launcher.Main is run within a subshell. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13683) Finalize the public interface for OutputWriter[Factory]
Michael Armbrust created SPARK-13683: Summary: Finalize the public interface for OutputWriter[Factory] Key: SPARK-13683 URL: https://issues.apache.org/jira/browse/SPARK-13683 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Michael Armbrust We need to at least remove bucketing. I would also like to remove {{Job}} and the configuration stuff as well if possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12675) Executor dies because of ClassCastException and causes timeout
[ https://issues.apache.org/jira/browse/SPARK-12675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180456#comment-15180456 ] Anthony Brew commented on SPARK-12675: -- I'm hitting a bug similar to this also (same cast exceptions)... java.io.IOException: java.lang.ClassCastException: cannot assign instance of scala.collection.immutable.HashMap$SerializationProxy to field org.apache.spark.executor.TaskMetrics._accumulatorUpdates of type scala.collection.immutable.Map in instance of org.apache.spark.executor.TaskMetrics at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1207) at org.apache.spark.executor.TaskMetrics.readObject(TaskMetrics.scala:219) at sun.reflect.GeneratedMethodAccessor26.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > Executor dies because of ClassCastException and causes timeout > -- > > Key: SPARK-12675 > URL: https://issues.apache.org/jira/browse/SPARK-12675 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0, 2.0.0 > Environment: 64-bit Linux Ubuntu 15.10, 16GB RAM, 8 cores 3ghz >Reporter: Alexandru Rosianu >Priority: Minor > > I'm trying to fit a Spark ML pipeline but my executor dies. Here's the script > which doesn't work (a bit simplified): > {code:title=Script.scala} > // Prepare data sets > logInfo("Getting datasets") > val emoTrainingData = > sqlc.read.parquet("/tw/sentiment/emo/parsed/data.parquet") > val trainingData = emoTrainingData > // Configure the pipeline > val pipeline = new Pipeline().setStages(Array( > new > FeatureReducer().setInputCol("raw_text").setOutputCol("reduced_text"), > new StringSanitizer().setInputCol("reduced_text").setOutputCol("text"), > new Tokenizer().setInputCol("text").setOutputCol("raw_words"), > new StopWordsRemover().setInputCol("raw_words").setOutputCol("words"), > new HashingTF().setInputCol("words").setOutputCol("features"), > new NaiveBayes().setSmoothing(0.5).setFeaturesCol("features"), > new ColumnDropper().setDropColumns("raw_text", "reduced_text", "text", > "raw_words", "words", "features") > )) > // Fit the pipeline > logInfo(s"Training model on ${trainingData.count()} rows") > val model = pipeline.fit(trainingData) > {code} > It executes up to the last line. It prints "Training model on xx rows", then > it starts fitting, the executor dies, the drivers doesn't receive heartbeats > from the executor and it times out, then the script exits. It doesn't get > past that line. > This is the exception that kills the executor: > {code} > java.io.IOException: java.lang.ClassCastException: cannot assign instance > of scala.collection.immutable.HashMap$SerializationProxy to field > org.apache.spark.executor.TaskMetrics._accumulatorUpdates of type > scala.collection.immutable.Map in instance of > org.apache.spark.executor.TaskMetrics > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1207) > at > org.apache.spark.executor.TaskMetrics.readObject(TaskMetrics.scala:219) > at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1058) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1900) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371) > at org.apache.spark.util.Utils$.deserialize(Utils.scala:92) > at > org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1$$anonfun$apply$6.apply(Executor.scala:436) > at > org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1$$anonfun$apply$6.apply(Executor.scala:426) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1.apply(Executor.scala:426) > at > org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1.apply(Executor.scala:424) > at scala.collection.Iterator$class.foreach(Iterator.scala:742) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1194) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractI
[jira] [Commented] (SPARK-13670) spark-class doesn't bubble up error from launcher command
[ https://issues.apache.org/jira/browse/SPARK-13670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180455#comment-15180455 ] Marcelo Vanzin commented on SPARK-13670: Actually scrap that, it breaks things when the spark-shell actually runs... back to the drawing board. > spark-class doesn't bubble up error from launcher command > - > > Key: SPARK-13670 > URL: https://issues.apache.org/jira/browse/SPARK-13670 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.0.0 >Reporter: Mark Grover >Priority: Minor > > There's a particular snippet in spark-class > [here|https://github.com/apache/spark/blob/master/bin/spark-class#L86] that > runs the spark-launcher code in a subshell. > {code} > # The launcher library will print arguments separated by a NULL character, to > allow arguments with > # characters that would be otherwise interpreted by the shell. Read that in a > while loop, populating > # an array that will be used to exec the final command. > CMD=() > while IFS= read -d '' -r ARG; do > CMD+=("$ARG") > done < <("$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main > "$@") > {code} > The problem is that the if the launcher Main fails, this code still still > returns success and continues, even though the top level script is marked > {{set -e}}. This is because the launcher.Main is run within a subshell. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13670) spark-class doesn't bubble up error from launcher command
[ https://issues.apache.org/jira/browse/SPARK-13670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180444#comment-15180444 ] Marcelo Vanzin commented on SPARK-13670: Note that will probably leave a bash process running somewhere alongside the Spark jvm, so probably would need tweaks to avoid that... bash is fun. > spark-class doesn't bubble up error from launcher command > - > > Key: SPARK-13670 > URL: https://issues.apache.org/jira/browse/SPARK-13670 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.0.0 >Reporter: Mark Grover >Priority: Minor > > There's a particular snippet in spark-class > [here|https://github.com/apache/spark/blob/master/bin/spark-class#L86] that > runs the spark-launcher code in a subshell. > {code} > # The launcher library will print arguments separated by a NULL character, to > allow arguments with > # characters that would be otherwise interpreted by the shell. Read that in a > while loop, populating > # an array that will be used to exec the final command. > CMD=() > while IFS= read -d '' -r ARG; do > CMD+=("$ARG") > done < <("$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main > "$@") > {code} > The problem is that the if the launcher Main fails, this code still still > returns success and continues, even though the top level script is marked > {{set -e}}. This is because the launcher.Main is run within a subshell. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13670) spark-class doesn't bubble up error from launcher command
[ https://issues.apache.org/jira/browse/SPARK-13670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180427#comment-15180427 ] Marcelo Vanzin commented on SPARK-13670: After some fun playing with arcane bash syntax, here's something that worked for me: {code} run_command() { CMD=() while IFS='' read -d '' -r ARG; do echo "line: $ARG" CMD+=("$ARG") done if [ ${#CMD[@]} -gt 0 ]; then exec "${CMD[@]}" fi } set -o pipefail "$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@" | run_command {code} Example: {noformat} $ ./bin/spark-shell NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes ahead of assembly. Exception in thread "main" java.lang.IllegalArgumentException: Testing, testing, testing... at org.apache.spark.launcher.Main.main(Main.java:93) $ echo $? 1 {noformat} > spark-class doesn't bubble up error from launcher command > - > > Key: SPARK-13670 > URL: https://issues.apache.org/jira/browse/SPARK-13670 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.0.0 >Reporter: Mark Grover >Priority: Minor > > There's a particular snippet in spark-class > [here|https://github.com/apache/spark/blob/master/bin/spark-class#L86] that > runs the spark-launcher code in a subshell. > {code} > # The launcher library will print arguments separated by a NULL character, to > allow arguments with > # characters that would be otherwise interpreted by the shell. Read that in a > while loop, populating > # an array that will be used to exec the final command. > CMD=() > while IFS= read -d '' -r ARG; do > CMD+=("$ARG") > done < <("$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main > "$@") > {code} > The problem is that the if the launcher Main fails, this code still still > returns success and continues, even though the top level script is marked > {{set -e}}. This is because the launcher.Main is run within a subshell. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13682) Finalize the public API for FileFormat
Michael Armbrust created SPARK-13682: Summary: Finalize the public API for FileFormat Key: SPARK-13682 URL: https://issues.apache.org/jira/browse/SPARK-13682 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Michael Armbrust The current file format interface needs to be cleaned up before its acceptable for public consumption: - Have a version that takes Row and does a conversion, hide the internal API. - Remove bucketing - Remove RDD and the broadcastedConf - Remove SQLContext (maybe include SparkSession?) - Pass a better conf object -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13681) Reimplement CommitFailureTestRelationSuite
[ https://issues.apache.org/jira/browse/SPARK-13681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-13681: - Description: This test case got broken by [#11509|https://github.com/apache/spark/pull/11509]. We should reimplement it as a format. > Reimplement CommitFailureTestRelationSuite > -- > > Key: SPARK-13681 > URL: https://issues.apache.org/jira/browse/SPARK-13681 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Michael Armbrust >Priority: Blocker > > This test case got broken by > [#11509|https://github.com/apache/spark/pull/11509]. We should reimplement > it as a format. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13681) Reimplement CommitFailureTestRelationSuite
Michael Armbrust created SPARK-13681: Summary: Reimplement CommitFailureTestRelationSuite Key: SPARK-13681 URL: https://issues.apache.org/jira/browse/SPARK-13681 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Michael Armbrust Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13633) Move parser classes to o.a.s.sql.catalyst.parser package
[ https://issues.apache.org/jira/browse/SPARK-13633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-13633. --- Resolution: Fixed Fix Version/s: 2.0.0 > Move parser classes to o.a.s.sql.catalyst.parser package > > > Key: SPARK-13633 > URL: https://issues.apache.org/jira/browse/SPARK-13633 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Andrew Or >Assignee: Andrew Or > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13494) Cannot sort on a column which is of type "array"
[ https://issues.apache.org/jira/browse/SPARK-13494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180289#comment-15180289 ] Xiao Li commented on SPARK-13494: - Can you try the other newer versions? You know, a lot of issues have been fixed in each release. > Cannot sort on a column which is of type "array" > > > Key: SPARK-13494 > URL: https://issues.apache.org/jira/browse/SPARK-13494 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yael Aharon > > Executing the following SQL results in an error if columnName refers to a > column of type array > SELECT * FROM myTable ORDER BY columnName ASC LIMIT 50 > The error is > org.apache.spark.sql.AnalysisException: cannot resolve 'columnName ASC' due > to data type mismatch: cannot sort data type array -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13680) Java UDAF with more than one intermediate argument returns wrong results
[ https://issues.apache.org/jira/browse/SPARK-13680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yael Aharon updated SPARK-13680: Attachment: setup.hql > Java UDAF with more than one intermediate argument returns wrong results > > > Key: SPARK-13680 > URL: https://issues.apache.org/jira/browse/SPARK-13680 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 > Environment: CDH 5.5.2 >Reporter: Yael Aharon > Attachments: data.csv, setup.hql > > > I am trying to incorporate the Java UDAF from > https://github.com/apache/spark/blob/master/sql/hive/src/test/java/org/apache/spark/sql/hive/aggregate/MyDoubleAvg.java > into an SQL query. > I registered the UDAF like this: > sqlContext.udf().register("myavg", new MyDoubleAvg()); > My SQL query is: > SELECT AVG(seqi) AS `avg_seqi`, AVG(seqd) AS `avg_seqd`, AVG(ci) AS `avg_ci`, > AVG(cd) AS `avg_cd`, AVG(stdevd) AS `avg_stdevd`, AVG(stdevi) AS > `avg_stdevi`, MAX(seqi) AS `max_seqi`, MAX(seqd) AS `max_seqd`, MAX(ci) AS > `max_ci`, MAX(cd) AS `max_cd`, MAX(stdevd) AS `max_stdevd`, MAX(stdevi) AS > `max_stdevi`, MIN(seqi) AS `min_seqi`, MIN(seqd) AS `min_seqd`, MIN(ci) AS > `min_ci`, MIN(cd) AS `min_cd`, MIN(stdevd) AS `min_stdevd`, MIN(stdevi) AS > `min_stdevi`,SUM(seqi) AS `sum_seqi`, SUM(seqd) AS `sum_seqd`, SUM(ci) AS > `sum_ci`, SUM(cd) AS `sum_cd`, SUM(stdevd) AS `sum_stdevd`, SUM(stdevi) AS > `sum_stdevi`, myavg(seqd) as `myavg_seqd`, AVG(zero) AS `avg_zero`, > AVG(nulli) AS `avg_nulli`,AVG(nulld) AS `avg_nulld`, SUM(zero) AS `sum_zero`, > SUM(nulli) AS `sum_nulli`,SUM(nulld) AS `sum_nulld`,MAX(zero) AS `max_zero`, > MAX(nulli) AS `max_nulli`,MAX(nulld) AS `max_nulld`,count( * ) AS > `count_all`, count(nulli) AS `count_nulli` FROM mytable > As soon as I add the UDAF myavg to the SQL, all the results become incorrect. > When I remove the call to the UDAF, the results are correct. > I was able to go around the issue by modifying bufferSchema of the UDAF to > use an array and the corresponding update and merge methods. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13434) Reduce Spark RandomForest memory footprint
[ https://issues.apache.org/jira/browse/SPARK-13434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180243#comment-15180243 ] Joseph K. Bradley commented on SPARK-13434: --- There are a few options here: * Temp fix: Reduce the number of executors, as you suggested. * Long-term for this RF implementation: Implement local training for deep trees. Spilling the current tree to disk would help, but I'd guess that local training would have a bigger impact. * Long-term fix via a separate RF implementation: I've been working for a long time on a column-partitioned implementation which will be better for tasks like yours with many features & deep trees. It's making progress but not yet ready to merge into Spark. > Reduce Spark RandomForest memory footprint > -- > > Key: SPARK-13434 > URL: https://issues.apache.org/jira/browse/SPARK-13434 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.6.0 > Environment: Linux >Reporter: Ewan Higgs > Labels: decisiontree, mllib, randomforest > Attachments: heap-usage.log, rf-heap-usage.png > > > The RandomForest implementation can easily run out of memory on moderate > datasets. This was raised in the a user's benchmarking game on github > (https://github.com/szilard/benchm-ml/issues/19). I looked to see if there > was a tracking issue, but I couldn't fine one. > Using Spark 1.6, a user of mine is running into problems running the > RandomForest training on largish datasets on machines with 64G memory and the > following in {{spark-defaults.conf}}: > {code} > spark.executor.cores 2 > spark.executor.instances 199 > spark.executor.memory 10240M > {code} > I reproduced the excessive memory use from the benchmark example (using an > input CSV of 1.3G and 686 columns) in spark shell with {{spark-shell > --driver-memory 30G --executor-memory 30G}} and have a heap profile from a > single machine by running {{jmap -histo:live }}. I took a sample > every 5 seconds and at the peak it looks like this: > {code} > num #instances #bytes class name > -- >1: 5428073 8458773496 [D >2: 12293653 4124641992 [I >3: 32508964 1820501984 org.apache.spark.mllib.tree.model.Node >4: 53068426 1698189632 org.apache.spark.mllib.tree.model.Predict >5: 72853787 1165660592 scala.Some >6: 16263408 910750848 > org.apache.spark.mllib.tree.model.InformationGainStats >7: 72969 390492744 [B >8: 3327008 133080320 > org.apache.spark.mllib.tree.impl.DTStatsAggregator >9: 3754500 120144000 > scala.collection.immutable.HashMap$HashMap1 > 10: 3318349 106187168 org.apache.spark.mllib.tree.model.Split > 11: 3534946 84838704 > org.apache.spark.mllib.tree.RandomForest$NodeIndexInfo > 12: 3764745 60235920 java.lang.Integer > 13: 3327008 53232128 > org.apache.spark.mllib.tree.impurity.EntropyAggregator > 14:380804 45361144 [C > 15:268887 34877128 > 16:268887 34431568 > 17:908377 34042760 [Lscala.collection.immutable.HashMap; > 18: 110 2640 > org.apache.spark.mllib.regression.LabeledPoint > 19: 110 2640 org.apache.spark.mllib.linalg.SparseVector > 20: 20206 25979864 > 21: 100 2400 org.apache.spark.mllib.tree.impl.TreePoint > 22: 100 2400 > org.apache.spark.mllib.tree.impl.BaggedPoint > 23:908332 21799968 > scala.collection.immutable.HashMap$HashTrieMap > 24: 20206 20158864 > 25: 17023 14380352 > 26:16 13308288 > [Lorg.apache.spark.mllib.tree.impl.DTStatsAggregator; > 27:445797 10699128 scala.Tuple2 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13048) EMLDAOptimizer deletes dependent checkpoint of DistributedLDAModel
[ https://issues.apache.org/jira/browse/SPARK-13048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180241#comment-15180241 ] Joseph K. Bradley commented on SPARK-13048: --- I'd say the best fix would be to add an option to LDA to not delete the last checkpoint. I'd prefer to expose this as a Param in the spark.ml API, but it could be added to the spark.mllib API as well if necessary. [~holdenk] I agree we need to figure out how to handle/control caching and checkpointing within Pipelines, but that will have to wait for after 2.0. [~jvstein] We try to minimize the public API. Although I agree with you about opening up APIs in principal, it have proven dangerous in practice. Even when we mark things DeveloperApi, many users still use those APIs, making it difficult to change them in the future. > EMLDAOptimizer deletes dependent checkpoint of DistributedLDAModel > -- > > Key: SPARK-13048 > URL: https://issues.apache.org/jira/browse/SPARK-13048 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.5.2 > Environment: Standalone Spark cluster >Reporter: Jeff Stein > > In EMLDAOptimizer, all checkpoints are deleted before returning the > DistributedLDAModel. > The most recent checkpoint is still necessary for operations on the > DistributedLDAModel under a couple scenarios: > - The graph doesn't fit in memory on the worker nodes (e.g. very large data > set). > - Late worker failures that require reading the now-dependent checkpoint. > I ran into this problem running a 10M record LDA model in a memory starved > environment. The model consistently failed in either the {{collect at > LDAModel.scala:528}} stage (when converting to a LocalLDAModel) or in the > {{reduce at LDAModel.scala:563}} stage (when calling "describeTopics" on the > model). In both cases, a FileNotFoundException is thrown attempting to access > a checkpoint file. > I'm not sure what the correct fix is here; it might involve a class signature > change. An alternative simple fix is to leave the last checkpoint around and > expect the user to clean the checkpoint directory themselves. > {noformat} > java.io.FileNotFoundException: File does not exist: > /hdfs/path/to/checkpoints/c8bd2b4e-27dd-47b3-84ec-3ff0bac04587/rdd-635/part-00071 > {noformat} > Relevant code is included below. > LDAOptimizer.scala: > {noformat} > override private[clustering] def getLDAModel(iterationTimes: > Array[Double]): LDAModel = { > require(graph != null, "graph is null, EMLDAOptimizer not initialized.") > this.graphCheckpointer.deleteAllCheckpoints() > // The constructor's default arguments assume gammaShape = 100 to ensure > equivalence in > // LDAModel.toLocal conversion > new DistributedLDAModel(this.graph, this.globalTopicTotals, this.k, > this.vocabSize, > Vectors.dense(Array.fill(this.k)(this.docConcentration)), > this.topicConcentration, > iterationTimes) > } > {noformat} > PeriodicCheckpointer.scala > {noformat} > /** >* Call this at the end to delete any remaining checkpoint files. >*/ > def deleteAllCheckpoints(): Unit = { > while (checkpointQueue.nonEmpty) { > removeCheckpointFile() > } > } > /** >* Dequeue the oldest checkpointed Dataset, and remove its checkpoint files. >* This prints a warning but does not fail if the files cannot be removed. >*/ > private def removeCheckpointFile(): Unit = { > val old = checkpointQueue.dequeue() > // Since the old checkpoint is not deleted by Spark, we manually delete > it. > val fs = FileSystem.get(sc.hadoopConfiguration) > getCheckpointFiles(old).foreach { checkpointFile => > try { > fs.delete(new Path(checkpointFile), true) > } catch { > case e: Exception => > logWarning("PeriodicCheckpointer could not remove old checkpoint > file: " + > checkpointFile) > } > } > } > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13230) HashMap.merged not working properly with Spark
[ https://issues.apache.org/jira/browse/SPARK-13230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180234#comment-15180234 ] Sean Owen commented on SPARK-13230: --- Thanks, that's a great analysis. It sounds like we might need to close this as a Scala problem, and offer a workaround. For example, it's obviously possible to write a little function that accomplishes the same thing, and which I hope doesn't depend on serializing the same internal representation. (PS JIRA does not use markdown. Use pairs of curly braces to {{format as code}}. > HashMap.merged not working properly with Spark > -- > > Key: SPARK-13230 > URL: https://issues.apache.org/jira/browse/SPARK-13230 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 > Environment: Ubuntu 14.04.3, Scala 2.11.7, Spark 1.6.0 >Reporter: Alin Treznai > > Using HashMap.merged with Spark fails with NullPointerException. > {noformat} > import org.apache.spark.{SparkConf, SparkContext} > import scala.collection.immutable.HashMap > object MergeTest { > def mergeFn:(HashMap[String, Long], HashMap[String, Long]) => > HashMap[String, Long] = { > case (m1, m2) => m1.merged(m2){ case (x,y) => (x._1, x._2 + y._2) } > } > def main(args: Array[String]) = { > val input = Seq(HashMap("A" -> 1L), HashMap("A" -> 2L, "B" -> > 3L),HashMap("A" -> 2L, "C" -> 4L)) > val conf = new SparkConf().setAppName("MergeTest").setMaster("local[*]") > val sc = new SparkContext(conf) > val result = sc.parallelize(input).reduce(mergeFn) > println(s"Result=$result") > sc.stop() > } > } > {noformat} > Error message: > org.apache.spark.SparkDriverExecutionException: Execution error > at > org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1169) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1637) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1952) > at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1025) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:316) > at org.apache.spark.rdd.RDD.reduce(RDD.scala:1007) > at MergeTest$.main(MergeTest.scala:21) > at MergeTest.main(MergeTest.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > Caused by: java.lang.NullPointerException > at > MergeTest$$anonfun$mergeFn$1$$anonfun$apply$1.apply(MergeTest.scala:12) > at > MergeTest$$anonfun$mergeFn$1$$anonfun$apply$1.apply(MergeTest.scala:12) > at scala.collection.immutable.HashMap$$anon$2.apply(HashMap.scala:148) > at > scala.collection.immutable.HashMap$HashMap1.updated0(HashMap.scala:200) > at > scala.collection.immutable.HashMap$HashTrieMap.updated0(HashMap.scala:322) > at > scala.collection.immutable.HashMap$HashTrieMap.merge0(HashMap.scala:463) > at scala.collection.immutable.HashMap.merged(HashMap.scala:117) > at MergeTest$$anonfun$mergeFn$1.apply(MergeTest.scala:12) > at MergeTest$$anonfun$mergeFn$1.apply(MergeTest.scala:11) > at > org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply(RDD.scala:1020) > at > org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply(RDD.scala:1017) > at > org.apache.spark.scheduler.JobWaiter.taskSucceeded(JobWaiter.scala:56) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1165) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1637) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588) > at org.apache.spark.util.EventLoop$$anon$1.run(
[jira] [Comment Edited] (SPARK-13230) HashMap.merged not working properly with Spark
[ https://issues.apache.org/jira/browse/SPARK-13230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180216#comment-15180216 ] Łukasz Gieroń edited comment on SPARK-13230 at 3/4/16 5:36 PM: --- The issue here is the bug in Scala library, in deserialization of `HashMap1` objects. When they get deserialized, the internal `kv` field does not get deserialized (is left `null`), which causes a `NullPointerException` in `merged`. I've fixed this is Scala library, and it fixes the issue. I'm going to open a bug to Scala library and submit a pull request for it, and link that ticket here (if it's possible to link between Jiras). PS. Not sure why Jira doesn't recognize my backticks markdown. was (Author: lgieron): The issue here is the bug in Scala library, in deserialization of `HashMap1` objects. When they get deserialized, the internal `kv` field does not get deserialized (is left `null`), which causes a `NullPointerException` in `merged`. I've fixed this is Scala library, and it fixes the issue. I'm going to open a bug to Scala library and submit a pull request for it, and link that ticket here (if it's possible to link between Jiras). > HashMap.merged not working properly with Spark > -- > > Key: SPARK-13230 > URL: https://issues.apache.org/jira/browse/SPARK-13230 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 > Environment: Ubuntu 14.04.3, Scala 2.11.7, Spark 1.6.0 >Reporter: Alin Treznai > > Using HashMap.merged with Spark fails with NullPointerException. > {noformat} > import org.apache.spark.{SparkConf, SparkContext} > import scala.collection.immutable.HashMap > object MergeTest { > def mergeFn:(HashMap[String, Long], HashMap[String, Long]) => > HashMap[String, Long] = { > case (m1, m2) => m1.merged(m2){ case (x,y) => (x._1, x._2 + y._2) } > } > def main(args: Array[String]) = { > val input = Seq(HashMap("A" -> 1L), HashMap("A" -> 2L, "B" -> > 3L),HashMap("A" -> 2L, "C" -> 4L)) > val conf = new SparkConf().setAppName("MergeTest").setMaster("local[*]") > val sc = new SparkContext(conf) > val result = sc.parallelize(input).reduce(mergeFn) > println(s"Result=$result") > sc.stop() > } > } > {noformat} > Error message: > org.apache.spark.SparkDriverExecutionException: Execution error > at > org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1169) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1637) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1952) > at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1025) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:316) > at org.apache.spark.rdd.RDD.reduce(RDD.scala:1007) > at MergeTest$.main(MergeTest.scala:21) > at MergeTest.main(MergeTest.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > Caused by: java.lang.NullPointerException > at > MergeTest$$anonfun$mergeFn$1$$anonfun$apply$1.apply(MergeTest.scala:12) > at > MergeTest$$anonfun$mergeFn$1$$anonfun$apply$1.apply(MergeTest.scala:12) > at scala.collection.immutable.HashMap$$anon$2.apply(HashMap.scala:148) > at > scala.collection.immutable.HashMap$HashMap1.updated0(HashMap.scala:200) > at > scala.collection.immutable.HashMap$HashTrieMap.updated0(HashMap.scala:322) > at > scala.collection.immutable.HashMap$HashTrieMap.merge0(HashMap.scala:463) > at scala.collection.immutable.HashMap.merged(HashMap.scala:117) > at MergeTest$$anonfun$mergeFn$1.apply(MergeTest.scala:12) > at MergeTest$$anonfun$mergeFn$1.apply(MergeTest.scala:11) > at > org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply(RDD.scala:1020) > at > org.apache
[jira] [Comment Edited] (SPARK-13230) HashMap.merged not working properly with Spark
[ https://issues.apache.org/jira/browse/SPARK-13230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180216#comment-15180216 ] Łukasz Gieroń edited comment on SPARK-13230 at 3/4/16 5:35 PM: --- The issue here is the bug in Scala library, in deserialization of `HashMap1` objects. When they get deserialized, the internal `kv` field does not get deserialized (is left `null`), which causes a `NullPointerException` in `merged`. I've fixed this is Scala library, and it fixes the issue. I'm going to open a bug to Scala library and submit a pull request for it, and link that ticket here (if it's possible to link between Jiras). was (Author: lgieron): The issue here is the bug in Scala library, in deserialization of 'HashMap1' objects. When they get deserialized, the internal `kv` field does not get deserialized (is left `null`), which causes a `NullPointerException` in `merged`. I've fixed this is Scala library, and it fixes the issue. I'm going to open a bug to Scala library and submit a pull request for it, and link that ticket here (if it's possible to link between Jiras). > HashMap.merged not working properly with Spark > -- > > Key: SPARK-13230 > URL: https://issues.apache.org/jira/browse/SPARK-13230 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 > Environment: Ubuntu 14.04.3, Scala 2.11.7, Spark 1.6.0 >Reporter: Alin Treznai > > Using HashMap.merged with Spark fails with NullPointerException. > {noformat} > import org.apache.spark.{SparkConf, SparkContext} > import scala.collection.immutable.HashMap > object MergeTest { > def mergeFn:(HashMap[String, Long], HashMap[String, Long]) => > HashMap[String, Long] = { > case (m1, m2) => m1.merged(m2){ case (x,y) => (x._1, x._2 + y._2) } > } > def main(args: Array[String]) = { > val input = Seq(HashMap("A" -> 1L), HashMap("A" -> 2L, "B" -> > 3L),HashMap("A" -> 2L, "C" -> 4L)) > val conf = new SparkConf().setAppName("MergeTest").setMaster("local[*]") > val sc = new SparkContext(conf) > val result = sc.parallelize(input).reduce(mergeFn) > println(s"Result=$result") > sc.stop() > } > } > {noformat} > Error message: > org.apache.spark.SparkDriverExecutionException: Execution error > at > org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1169) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1637) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1952) > at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1025) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:316) > at org.apache.spark.rdd.RDD.reduce(RDD.scala:1007) > at MergeTest$.main(MergeTest.scala:21) > at MergeTest.main(MergeTest.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > Caused by: java.lang.NullPointerException > at > MergeTest$$anonfun$mergeFn$1$$anonfun$apply$1.apply(MergeTest.scala:12) > at > MergeTest$$anonfun$mergeFn$1$$anonfun$apply$1.apply(MergeTest.scala:12) > at scala.collection.immutable.HashMap$$anon$2.apply(HashMap.scala:148) > at > scala.collection.immutable.HashMap$HashMap1.updated0(HashMap.scala:200) > at > scala.collection.immutable.HashMap$HashTrieMap.updated0(HashMap.scala:322) > at > scala.collection.immutable.HashMap$HashTrieMap.merge0(HashMap.scala:463) > at scala.collection.immutable.HashMap.merged(HashMap.scala:117) > at MergeTest$$anonfun$mergeFn$1.apply(MergeTest.scala:12) > at MergeTest$$anonfun$mergeFn$1.apply(MergeTest.scala:11) > at > org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply(RDD.scala:1020) > at > org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply(RDD.scala:101
[jira] [Comment Edited] (SPARK-13230) HashMap.merged not working properly with Spark
[ https://issues.apache.org/jira/browse/SPARK-13230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180216#comment-15180216 ] Łukasz Gieroń edited comment on SPARK-13230 at 3/4/16 5:34 PM: --- The issue here is the bug in Scala library, in deserialization of 'HashMap1' objects. When they get deserialized, the internal `kv` field does not get deserialized (is left `null`), which causes a `NullPointerException` in `merged`. I've fixed this is Scala library, and it fixes the issue. I'm going to open a bug to Scala library and submit a pull request for it, and link that ticket here (if it's possible to link between Jiras). was (Author: lgieron): The issue here is the bug in Scala library, in deserialization of `HashMap1` objects. When they get deserialized, the internal `kv` field does not get deserialized (is left `null`), which causes a `NullPointerException` in `merged`. I've fixed this is Scala library, and it fixes the issue. I'm going to open a bug to Scala library and submit a pull request for it, and link that ticket here (if it's possible to link between Jiras). > HashMap.merged not working properly with Spark > -- > > Key: SPARK-13230 > URL: https://issues.apache.org/jira/browse/SPARK-13230 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 > Environment: Ubuntu 14.04.3, Scala 2.11.7, Spark 1.6.0 >Reporter: Alin Treznai > > Using HashMap.merged with Spark fails with NullPointerException. > {noformat} > import org.apache.spark.{SparkConf, SparkContext} > import scala.collection.immutable.HashMap > object MergeTest { > def mergeFn:(HashMap[String, Long], HashMap[String, Long]) => > HashMap[String, Long] = { > case (m1, m2) => m1.merged(m2){ case (x,y) => (x._1, x._2 + y._2) } > } > def main(args: Array[String]) = { > val input = Seq(HashMap("A" -> 1L), HashMap("A" -> 2L, "B" -> > 3L),HashMap("A" -> 2L, "C" -> 4L)) > val conf = new SparkConf().setAppName("MergeTest").setMaster("local[*]") > val sc = new SparkContext(conf) > val result = sc.parallelize(input).reduce(mergeFn) > println(s"Result=$result") > sc.stop() > } > } > {noformat} > Error message: > org.apache.spark.SparkDriverExecutionException: Execution error > at > org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1169) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1637) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1952) > at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1025) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:316) > at org.apache.spark.rdd.RDD.reduce(RDD.scala:1007) > at MergeTest$.main(MergeTest.scala:21) > at MergeTest.main(MergeTest.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > Caused by: java.lang.NullPointerException > at > MergeTest$$anonfun$mergeFn$1$$anonfun$apply$1.apply(MergeTest.scala:12) > at > MergeTest$$anonfun$mergeFn$1$$anonfun$apply$1.apply(MergeTest.scala:12) > at scala.collection.immutable.HashMap$$anon$2.apply(HashMap.scala:148) > at > scala.collection.immutable.HashMap$HashMap1.updated0(HashMap.scala:200) > at > scala.collection.immutable.HashMap$HashTrieMap.updated0(HashMap.scala:322) > at > scala.collection.immutable.HashMap$HashTrieMap.merge0(HashMap.scala:463) > at scala.collection.immutable.HashMap.merged(HashMap.scala:117) > at MergeTest$$anonfun$mergeFn$1.apply(MergeTest.scala:12) > at MergeTest$$anonfun$mergeFn$1.apply(MergeTest.scala:11) > at > org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply(RDD.scala:1020) > at > org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply(RDD.scala:101