[jira] [Created] (SPARK-18433) Improve `DataSource.scala` to be more case-insensitive
Dongjoon Hyun created SPARK-18433: - Summary: Improve `DataSource.scala` to be more case-insensitive Key: SPARK-18433 URL: https://issues.apache.org/jira/browse/SPARK-18433 Project: Spark Issue Type: Improvement Components: SQL Reporter: Dongjoon Hyun Priority: Minor Currently, `DataSource.scala` partially use `CaseInsensitiveMap` in code-path. This issue aims to make `DataSource` to use `CaseInsensitiveMap` generally except passing to the other modules (`InMemoryFileIndex` and `InsertIntoHadoopFsRelationCommand`). They creates new case-sensitive `HadoopConfs` by calling `newHadoopConfWithOptions(options)` inside. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18432) Fix HDFS block size in programming guide
Noritaka Sekiyama created SPARK-18432: - Summary: Fix HDFS block size in programming guide Key: SPARK-18432 URL: https://issues.apache.org/jira/browse/SPARK-18432 Project: Spark Issue Type: Documentation Components: Documentation Affects Versions: 2.0.1 Reporter: Noritaka Sekiyama Priority: Minor http://spark.apache.org/docs/latest/programming-guide.html "By default, Spark creates one partition for each block of the file (blocks being 64MB by default in HDFS)" Currently default block size in HDFS is 128MB. The default value has been already increased in Hadoop 2.2.0 (the oldest supported version of Spark). https://issues.apache.org/jira/browse/HDFS-4053 Since it looks confusing explanation, I'd like to fix the value from 64MB to 128MB. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18431) Hard coded value in org.apache.spark.streaming.kinesis.KinesisReceiver
Shushant Arora created SPARK-18431: -- Summary: Hard coded value in org.apache.spark.streaming.kinesis.KinesisReceiver Key: SPARK-18431 URL: https://issues.apache.org/jira/browse/SPARK-18431 Project: Spark Issue Type: Bug Components: DStreams Affects Versions: 2.0.1 Reporter: Shushant Arora There is a hardcoded value of taskBackoffTimeMillisas 500in onstart method of org.apache.spark.streaming.kinesis.KinesisReceiver.Instead of hardcoded value it must be configurable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18430) Returned Message Null when Hitting an Invocation Exception of Function Lookup.
[ https://issues.apache.org/jira/browse/SPARK-18430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15662955#comment-15662955 ] Apache Spark commented on SPARK-18430: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/15878 > Returned Message Null when Hitting an Invocation Exception of Function Lookup. > -- > > Key: SPARK-18430 > URL: https://issues.apache.org/jira/browse/SPARK-18430 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.1.0 >Reporter: Xiao Li >Assignee: Xiao Li > > When the exception is an invocation exception during function lookup, we > return a useless/confusing error message: > For example, > {code} > df.selectExpr("format_string()") > {code} > or > {code} > df.selectExpr("concat_ws()") > {code} > Below is the error message we got: > {code} > null; line 1 pos 0 > org.apache.spark.sql.AnalysisException: null; line 1 pos 0 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18430) Returned Message Null when Hitting an Invocation Exception of Function Lookup.
[ https://issues.apache.org/jira/browse/SPARK-18430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18430: Assignee: Xiao Li (was: Apache Spark) > Returned Message Null when Hitting an Invocation Exception of Function Lookup. > -- > > Key: SPARK-18430 > URL: https://issues.apache.org/jira/browse/SPARK-18430 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.1.0 >Reporter: Xiao Li >Assignee: Xiao Li > > When the exception is an invocation exception during function lookup, we > return a useless/confusing error message: > For example, > {code} > df.selectExpr("format_string()") > {code} > or > {code} > df.selectExpr("concat_ws()") > {code} > Below is the error message we got: > {code} > null; line 1 pos 0 > org.apache.spark.sql.AnalysisException: null; line 1 pos 0 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18430) Returned Message Null when Hitting an Invocation Exception of Function Lookup.
[ https://issues.apache.org/jira/browse/SPARK-18430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18430: Assignee: Apache Spark (was: Xiao Li) > Returned Message Null when Hitting an Invocation Exception of Function Lookup. > -- > > Key: SPARK-18430 > URL: https://issues.apache.org/jira/browse/SPARK-18430 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.1.0 >Reporter: Xiao Li >Assignee: Apache Spark > > When the exception is an invocation exception during function lookup, we > return a useless/confusing error message: > For example, > {code} > df.selectExpr("format_string()") > {code} > or > {code} > df.selectExpr("concat_ws()") > {code} > Below is the error message we got: > {code} > null; line 1 pos 0 > org.apache.spark.sql.AnalysisException: null; line 1 pos 0 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18430) Returned Message Null when Hitting an Invocation Exception of Function Lookup.
[ https://issues.apache.org/jira/browse/SPARK-18430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-18430: Description: When the exception is an invocation exception during function lookup, we return a useless/confusing error message: For example, {code} df.selectExpr("format_string()") {code} or {code} df.selectExpr("concat_ws()") {code} Below is the error message we got: {code} null; line 1 pos 0 org.apache.spark.sql.AnalysisException: null; line 1 pos 0 {code} was: When the exception is an invocation exception during function lookup, we return a useless/confusing error message: For example, {code} df.selectExpr("format_string()") {code} or {code} df.selectExpr("concat_ws()") {code} {code} null; line 1 pos 0 org.apache.spark.sql.AnalysisException: null; line 1 pos 0 {code} > Returned Message Null when Hitting an Invocation Exception of Function Lookup. > -- > > Key: SPARK-18430 > URL: https://issues.apache.org/jira/browse/SPARK-18430 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.1.0 >Reporter: Xiao Li >Assignee: Xiao Li > > When the exception is an invocation exception during function lookup, we > return a useless/confusing error message: > For example, > {code} > df.selectExpr("format_string()") > {code} > or > {code} > df.selectExpr("concat_ws()") > {code} > Below is the error message we got: > {code} > null; line 1 pos 0 > org.apache.spark.sql.AnalysisException: null; line 1 pos 0 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18430) Returned Message Null when Hitting an Invocation Exception of Function Lookup.
Xiao Li created SPARK-18430: --- Summary: Returned Message Null when Hitting an Invocation Exception of Function Lookup. Key: SPARK-18430 URL: https://issues.apache.org/jira/browse/SPARK-18430 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0, 2.1.0 Reporter: Xiao Li Assignee: Xiao Li When the exception is an invocation exception during function lookup, we return a useless/confusing error message: For example, {code} df.selectExpr("format_string()") {code} or {code} df.selectExpr("concat_ws()") {code} {code} null; line 1 pos 0 org.apache.spark.sql.AnalysisException: null; line 1 pos 0 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18429) implement a new Aggregate for CountMinSketch
[ https://issues.apache.org/jira/browse/SPARK-18429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18429: Assignee: (was: Apache Spark) > implement a new Aggregate for CountMinSketch > > > Key: SPARK-18429 > URL: https://issues.apache.org/jira/browse/SPARK-18429 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Zhenhua Wang > > Implement a new Aggregate to generate count min sketch, which is a wrapper of > CountMinSketch. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18429) implement a new Aggregate for CountMinSketch
[ https://issues.apache.org/jira/browse/SPARK-18429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15662906#comment-15662906 ] Apache Spark commented on SPARK-18429: -- User 'wzhfy' has created a pull request for this issue: https://github.com/apache/spark/pull/15877 > implement a new Aggregate for CountMinSketch > > > Key: SPARK-18429 > URL: https://issues.apache.org/jira/browse/SPARK-18429 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Zhenhua Wang > > Implement a new Aggregate to generate count min sketch, which is a wrapper of > CountMinSketch. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18429) implement a new Aggregate for CountMinSketch
[ https://issues.apache.org/jira/browse/SPARK-18429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18429: Assignee: Apache Spark > implement a new Aggregate for CountMinSketch > > > Key: SPARK-18429 > URL: https://issues.apache.org/jira/browse/SPARK-18429 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Zhenhua Wang >Assignee: Apache Spark > > Implement a new Aggregate to generate count min sketch, which is a wrapper of > CountMinSketch. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18429) implement a new Aggregate for CountMinSketch
Zhenhua Wang created SPARK-18429: Summary: implement a new Aggregate for CountMinSketch Key: SPARK-18429 URL: https://issues.apache.org/jira/browse/SPARK-18429 Project: Spark Issue Type: New Feature Components: SQL Reporter: Zhenhua Wang Implement a new Aggregate to generate count min sketch, which is a wrapper of CountMinSketch. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11496) Parallel implementation of personalized pagerank
[ https://issues.apache.org/jira/browse/SPARK-11496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15662890#comment-15662890 ] Apache Spark commented on SPARK-11496: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/15876 > Parallel implementation of personalized pagerank > > > Key: SPARK-11496 > URL: https://issues.apache.org/jira/browse/SPARK-11496 > Project: Spark > Issue Type: New Feature > Components: GraphX >Affects Versions: 2.1.0 >Reporter: Yves Raimond >Assignee: Yves Raimond >Priority: Minor > Fix For: 2.1.0 > > > The current implementation of personalized pagerank only supports one source > node. Most applications of personalized pagerank require to run the > propagation for multiple source nodes. However code such as: > {code} > sourceVertices.map { sourceVertex => > graph.staticPersonalizedPageRank(sourceVertex, 10) } > {code} > Will be very slow, as it needs to run 10 iterations * sourceVertices.size > propagation steps. > It would be good to offer an alternative API that runs personalized pagerank > over a list of source vertices in parallel, so that it only needs to run 10 > propagation steps in the example above. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18427) Update docs of mllib.KMeans
[ https://issues.apache.org/jira/browse/SPARK-18427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-18427: - Component/s: MLlib > Update docs of mllib.KMeans > - > > Key: SPARK-18427 > URL: https://issues.apache.org/jira/browse/SPARK-18427 > Project: Spark > Issue Type: Improvement > Components: Documentation, MLlib >Reporter: zhengruifeng >Priority: Minor > > 1,Remove {{runs}} from docs of {{mllib.KMeans}}, Since {{runs}} is now > disabled > 2,Add notes about {{k}} that fewer than {{k}} clusters to be returned, > according to comments in sources -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18428) Update docs for Graph.op
[ https://issues.apache.org/jira/browse/SPARK-18428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18428: Assignee: (was: Apache Spark) > Update docs for Graph.op > > > Key: SPARK-18428 > URL: https://issues.apache.org/jira/browse/SPARK-18428 > Project: Spark > Issue Type: Improvement > Components: Documentation, GraphX >Reporter: zhengruifeng >Priority: Minor > > Update {{Summary List of Operators}} and {{VertexRDDs}} to include missing > APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18428) Update docs for Graph.op
[ https://issues.apache.org/jira/browse/SPARK-18428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15662856#comment-15662856 ] Apache Spark commented on SPARK-18428: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/15875 > Update docs for Graph.op > > > Key: SPARK-18428 > URL: https://issues.apache.org/jira/browse/SPARK-18428 > Project: Spark > Issue Type: Improvement > Components: Documentation, GraphX >Reporter: zhengruifeng >Priority: Minor > > Update {{Summary List of Operators}} and {{VertexRDDs}} to include missing > APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18428) Update docs for Graph.op
[ https://issues.apache.org/jira/browse/SPARK-18428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18428: Assignee: Apache Spark > Update docs for Graph.op > > > Key: SPARK-18428 > URL: https://issues.apache.org/jira/browse/SPARK-18428 > Project: Spark > Issue Type: Improvement > Components: Documentation, GraphX >Reporter: zhengruifeng >Assignee: Apache Spark >Priority: Minor > > Update {{Summary List of Operators}} and {{VertexRDDs}} to include missing > APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18428) Update docs for Graph.op
zhengruifeng created SPARK-18428: Summary: Update docs for Graph.op Key: SPARK-18428 URL: https://issues.apache.org/jira/browse/SPARK-18428 Project: Spark Issue Type: Improvement Components: Documentation, GraphX Reporter: zhengruifeng Priority: Minor Update {{Summary List of Operators}} and {{VertexRDDs}} to include missing APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18420) Fix the compile errors caused by checkstyle
[ https://issues.apache.org/jira/browse/SPARK-18420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15662832#comment-15662832 ] Hyukjin Kwon commented on SPARK-18420: -- It seems {code} [ERROR] src/main/java/org/apache/spark/io/NioBufferedFileInputStream.java:[133] (coding) NoFinalizer: Avoid using finalizer method. {code} is missed here. > Fix the compile errors caused by checkstyle > --- > > Key: SPARK-18420 > URL: https://issues.apache.org/jira/browse/SPARK-18420 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.0.1 >Reporter: coneyliu >Priority: Minor > > Small fix, fix the compile errors caused by checkstyle. > Before: > ``` > Checkstyle checks failed at following occurrences: > [ERROR] src/main/java/org/apache/spark/network/util/TransportConf.java:[21,8] > (imports) UnusedImports: Unused import - > org.apache.commons.crypto.cipher.CryptoCipherFactory. > [ERROR] > src/test/java/org/apache/spark/network/sasl/SparkSaslSuite.java:[516,5] > (modifier) RedundantModifier: Redundant 'public' modifier. > [ERROR] > src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeMapData.java:[71] > (sizes) LineLength: Line is longer than 100 characters (found 113). > [ERROR] > src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeArrayData.java:[112] > (sizes) LineLength: Line is longer than 100 characters (found 110). > src/main/java/org/apache/spark/examples/ml/JavaLogisticRegressionWithElasticNetExample.java:[64] > (sizes) LineLength: Line is longer than 100 characters (found 103). > [ERROR] > src/main/java/org/apache/spark/examples/ml/JavaInteractionExample.java:[22,8] > (imports) UnusedImports: Unused import - org.apache.spark.ml.linalg.Vectors. > [ERROR] > src/main/java/org/apache/spark/examples/ml/JavaInteractionExample.java:[51] > (regexp) RegexpSingleline: No trailing whitespace allowed. > ``` > After: > `mvn install` > `lint-java` > Checkstyle checks passed -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18413) Add a property to control the number of partitions when save a jdbc rdd
[ https://issues.apache.org/jira/browse/SPARK-18413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15662784#comment-15662784 ] Dongjoon Hyun commented on SPARK-18413: --- Yep. Thank you for review. I left a comment for that. After collecting more opinions, I'll update the PR together. > Add a property to control the number of partitions when save a jdbc rdd > --- > > Key: SPARK-18413 > URL: https://issues.apache.org/jira/browse/SPARK-18413 > Project: Spark > Issue Type: Wish > Components: SQL >Affects Versions: 2.0.1 >Reporter: lichenglin > > {code} > CREATE or replace TEMPORARY VIEW resultview > USING org.apache.spark.sql.jdbc > OPTIONS ( > url "jdbc:oracle:thin:@10.129.10.111:1521:BKDB", > dbtable "result", > user "HIVE", > password "HIVE" > ); > --set spark.sql.shuffle.partitions=200 > insert overwrite table resultview select g,count(1) as count from > tnet.DT_LIVE_INFO group by g > {code} > I'm tring to save a spark sql result to oracle. > And I found spark will create a jdbc connection for each partition. > if the sql create to many partitions , the database can't hold so many > connections and return exception. > In above situation is 200 because of the "group by" and > "spark.sql.shuffle.partitions" > the spark source code JdbcUtil is > {code} > def saveTable( > df: DataFrame, > url: String, > table: String, > properties: Properties) { > val dialect = JdbcDialects.get(url) > val nullTypes: Array[Int] = df.schema.fields.map { field => > getJdbcType(field.dataType, dialect).jdbcNullType > } > val rddSchema = df.schema > val getConnection: () => Connection = createConnectionFactory(url, > properties) > val batchSize = properties.getProperty(JDBC_BATCH_INSERT_SIZE, > "1000").toInt > df.foreachPartition { iterator => > savePartition(getConnection, table, iterator, rddSchema, nullTypes, > batchSize, dialect) > } > } > {code} > May be we can add a property for df.repartition(num).foreachPartition ? > In fact I got an exception "ORA-12519, TNS:no appropriate service handler > found" -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18413) Add a property to control the number of partitions when save a jdbc rdd
[ https://issues.apache.org/jira/browse/SPARK-18413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15662759#comment-15662759 ] lichenglin commented on SPARK-18413: I'm sorry,my network is too bad to download dependencies from maven rep for building spark. I have made a comment on your PR,please check if it is right. Thanks > Add a property to control the number of partitions when save a jdbc rdd > --- > > Key: SPARK-18413 > URL: https://issues.apache.org/jira/browse/SPARK-18413 > Project: Spark > Issue Type: Wish > Components: SQL >Affects Versions: 2.0.1 >Reporter: lichenglin > > {code} > CREATE or replace TEMPORARY VIEW resultview > USING org.apache.spark.sql.jdbc > OPTIONS ( > url "jdbc:oracle:thin:@10.129.10.111:1521:BKDB", > dbtable "result", > user "HIVE", > password "HIVE" > ); > --set spark.sql.shuffle.partitions=200 > insert overwrite table resultview select g,count(1) as count from > tnet.DT_LIVE_INFO group by g > {code} > I'm tring to save a spark sql result to oracle. > And I found spark will create a jdbc connection for each partition. > if the sql create to many partitions , the database can't hold so many > connections and return exception. > In above situation is 200 because of the "group by" and > "spark.sql.shuffle.partitions" > the spark source code JdbcUtil is > {code} > def saveTable( > df: DataFrame, > url: String, > table: String, > properties: Properties) { > val dialect = JdbcDialects.get(url) > val nullTypes: Array[Int] = df.schema.fields.map { field => > getJdbcType(field.dataType, dialect).jdbcNullType > } > val rddSchema = df.schema > val getConnection: () => Connection = createConnectionFactory(url, > properties) > val batchSize = properties.getProperty(JDBC_BATCH_INSERT_SIZE, > "1000").toInt > df.foreachPartition { iterator => > savePartition(getConnection, table, iterator, rddSchema, nullTypes, > batchSize, dialect) > } > } > {code} > May be we can add a property for df.repartition(num).foreachPartition ? > In fact I got an exception "ORA-12519, TNS:no appropriate service handler > found" -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18408) API Improvements for LSH
[ https://issues.apache.org/jira/browse/SPARK-18408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18408: Assignee: Apache Spark > API Improvements for LSH > > > Key: SPARK-18408 > URL: https://issues.apache.org/jira/browse/SPARK-18408 > Project: Spark > Issue Type: Improvement >Reporter: Yun Ni >Assignee: Apache Spark > > As the first improvements to current LSH Implementations, we are planning to > do the followings: > - Change output schema to {{Array of Vector}} instead of {{Vectors}} > - Use {{numHashTables}} as the dimension of {{Array}} and > {{numHashFunctions}} as the dimension of {{Vector}} > - Rename {{RandomProjection}} to {{BucketedRandomProjectionLSH}}, > {{MinHash}} to {{MinHashLSH}} > - Make randUnitVectors/randCoefficients private > - Make Multi-Probe NN Search and {{hashDistance}} private for future > discussion -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18408) API Improvements for LSH
[ https://issues.apache.org/jira/browse/SPARK-18408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18408: Assignee: (was: Apache Spark) > API Improvements for LSH > > > Key: SPARK-18408 > URL: https://issues.apache.org/jira/browse/SPARK-18408 > Project: Spark > Issue Type: Improvement >Reporter: Yun Ni > > As the first improvements to current LSH Implementations, we are planning to > do the followings: > - Change output schema to {{Array of Vector}} instead of {{Vectors}} > - Use {{numHashTables}} as the dimension of {{Array}} and > {{numHashFunctions}} as the dimension of {{Vector}} > - Rename {{RandomProjection}} to {{BucketedRandomProjectionLSH}}, > {{MinHash}} to {{MinHashLSH}} > - Make randUnitVectors/randCoefficients private > - Make Multi-Probe NN Search and {{hashDistance}} private for future > discussion -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18408) API Improvements for LSH
[ https://issues.apache.org/jira/browse/SPARK-18408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15662754#comment-15662754 ] Apache Spark commented on SPARK-18408: -- User 'Yunni' has created a pull request for this issue: https://github.com/apache/spark/pull/15874 > API Improvements for LSH > > > Key: SPARK-18408 > URL: https://issues.apache.org/jira/browse/SPARK-18408 > Project: Spark > Issue Type: Improvement >Reporter: Yun Ni > > As the first improvements to current LSH Implementations, we are planning to > do the followings: > - Change output schema to {{Array of Vector}} instead of {{Vectors}} > - Use {{numHashTables}} as the dimension of {{Array}} and > {{numHashFunctions}} as the dimension of {{Vector}} > - Rename {{RandomProjection}} to {{BucketedRandomProjectionLSH}}, > {{MinHash}} to {{MinHashLSH}} > - Make randUnitVectors/randCoefficients private > - Make Multi-Probe NN Search and {{hashDistance}} private for future > discussion -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18412) SparkR spark.randomForest classification throws exception when training on libsvm data
[ https://issues.apache.org/jira/browse/SPARK-18412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang resolved SPARK-18412. - Resolution: Fixed Fix Version/s: 2.1.0 > SparkR spark.randomForest classification throws exception when training on > libsvm data > -- > > Key: SPARK-18412 > URL: https://issues.apache.org/jira/browse/SPARK-18412 > Project: Spark > Issue Type: Bug > Components: ML, SparkR >Reporter: Yanbo Liang >Assignee: Yanbo Liang > Fix For: 2.1.0 > > > {{spark.randomForest}} classification throws exception when training on > libsvm data. It can be reproduced as following: > {code} > df <- read.df("data/mllib/sample_multiclass_classification_data.txt", source > = "libsvm") > model <- spark.randomForest(df, label ~ features, "classification") > {code} > The exception is: > {code} > Error in handleErrors(returnStatus, conn) : > java.lang.IllegalArgumentException: requirement failed: If label column > already exists, forceIndexLabel can not be set with true. > at scala.Predef$.require(Predef.scala:224) > at > org.apache.spark.ml.feature.RFormula.transformSchema(RFormula.scala:205) > at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:70) > at org.apache.spark.ml.feature.RFormula.fit(RFormula.scala:136) > at > org.apache.spark.ml.r.RandomForestClassifierWrapper$.fit(RandomForestClassificationWrapper.scala:86) > at > org.apache.spark.ml.r.RandomForestClassifierWrapper.fit(RandomForestClassificationWrapper.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:172) > {code} > This error is caused by the label column of the R formula already exists, we > can not force to index label. However, it must index the label for > classification algorithms, so we need to rename the RFormula.labelCol to a > new value and then we can index the original label. > This issue also appears at other algorithms: spark.naiveBayes, spark.glm(only > for binomial family) and spark.gbt (only for classification). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18408) API Improvements for LSH
[ https://issues.apache.org/jira/browse/SPARK-18408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yun Ni updated SPARK-18408: --- Description: As the first improvements to current LSH Implementations, we are planning to do the followings: - Change output schema to {{Array of Vector}} instead of {{Vectors}} - Use {{numHashTables}} as the dimension of {{Array}} and {{numHashFunctions}} as the dimension of {{Vector}} - Rename {{RandomProjection}} to {{BucketedRandomProjectionLSH}}, {{MinHash}} to {{MinHashLSH}} - Make randUnitVectors/randCoefficients private - Make Multi-Probe NN Search and {{hashDistance}} private for future discussion was: As the first improvements to current LSH Implementations, we are planning to do the followings: - Change output schema to {{Array of Vector}} instead of {{Vectors}} - Use {{numHashTables}} as the dimension of {{Array}} and {{numHashFunctions}} as the dimension of {{Vector}} - Rename {{RandomProjection}} to {{BucketedRandomProjectionLSH}}, {{MinHash}} to {{MinHashLSH}} - Make randUnitVectors/randCoefficients private > API Improvements for LSH > > > Key: SPARK-18408 > URL: https://issues.apache.org/jira/browse/SPARK-18408 > Project: Spark > Issue Type: Improvement >Reporter: Yun Ni > > As the first improvements to current LSH Implementations, we are planning to > do the followings: > - Change output schema to {{Array of Vector}} instead of {{Vectors}} > - Use {{numHashTables}} as the dimension of {{Array}} and > {{numHashFunctions}} as the dimension of {{Vector}} > - Rename {{RandomProjection}} to {{BucketedRandomProjectionLSH}}, > {{MinHash}} to {{MinHashLSH}} > - Make randUnitVectors/randCoefficients private > - Make Multi-Probe NN Search and {{hashDistance}} private for future > discussion -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18427) Update docs of mllib.KMeans
[ https://issues.apache.org/jira/browse/SPARK-18427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-18427: - Description: 1,Remove {{runs}} from docs of {{mllib.KMeans}}, Since {{runs}} is now disabled 2,Add notes about {{k}} that fewer than {{k}} clusters to be returned, according to comments in sources was: 1,Remove {{runs}} from docs of {{mllib.KMeans}}, Since {{runs}} is now disabled 2,Add notes for {{k}} that according to comments in sources > Update docs of mllib.KMeans > - > > Key: SPARK-18427 > URL: https://issues.apache.org/jira/browse/SPARK-18427 > Project: Spark > Issue Type: Improvement > Components: Documentation >Reporter: zhengruifeng >Priority: Minor > > 1,Remove {{runs}} from docs of {{mllib.KMeans}}, Since {{runs}} is now > disabled > 2,Add notes about {{k}} that fewer than {{k}} clusters to be returned, > according to comments in sources -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18427) Update docs of mllib.KMeans
[ https://issues.apache.org/jira/browse/SPARK-18427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-18427: - Description: 1,Remove {{runs}} from docs of {{mllib.KMeans}}, Since {{runs}} is now disabled 2,Add notes for {{k}} that according to comments in sources was:Since {{runs}} is disabled in {{mllib.KMeans}}, the corresponding docs should be also updated. > Update docs of mllib.KMeans > - > > Key: SPARK-18427 > URL: https://issues.apache.org/jira/browse/SPARK-18427 > Project: Spark > Issue Type: Improvement > Components: Documentation >Reporter: zhengruifeng >Priority: Minor > > 1,Remove {{runs}} from docs of {{mllib.KMeans}}, Since {{runs}} is now > disabled > 2,Add notes for {{k}} that according to comments in sources -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18427) Update docs of mllib.KMeans
[ https://issues.apache.org/jira/browse/SPARK-18427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-18427: - Summary: Update docs of mllib.KMeans(was: Remove 'runs' from docs of mllib.KMeans ) > Update docs of mllib.KMeans > - > > Key: SPARK-18427 > URL: https://issues.apache.org/jira/browse/SPARK-18427 > Project: Spark > Issue Type: Improvement > Components: Documentation >Reporter: zhengruifeng >Priority: Minor > > Since {{runs}} is disabled in {{mllib.KMeans}}, the corresponding docs should > be also updated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18427) Remove 'runs' from docs of mllib.KMeans
[ https://issues.apache.org/jira/browse/SPARK-18427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18427: Assignee: Apache Spark > Remove 'runs' from docs of mllib.KMeans > - > > Key: SPARK-18427 > URL: https://issues.apache.org/jira/browse/SPARK-18427 > Project: Spark > Issue Type: Improvement > Components: Documentation >Reporter: zhengruifeng >Assignee: Apache Spark >Priority: Minor > > Since {{runs}} is disabled in {{mllib.KMeans}}, the corresponding docs should > be also updated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18427) Remove 'runs' from docs of mllib.KMeans
[ https://issues.apache.org/jira/browse/SPARK-18427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15662634#comment-15662634 ] Apache Spark commented on SPARK-18427: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/15873 > Remove 'runs' from docs of mllib.KMeans > - > > Key: SPARK-18427 > URL: https://issues.apache.org/jira/browse/SPARK-18427 > Project: Spark > Issue Type: Improvement > Components: Documentation >Reporter: zhengruifeng >Priority: Minor > > Since {{runs}} is disabled in {{mllib.KMeans}}, the corresponding docs should > be also updated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18427) Remove 'runs' from docs of mllib.KMeans
[ https://issues.apache.org/jira/browse/SPARK-18427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18427: Assignee: (was: Apache Spark) > Remove 'runs' from docs of mllib.KMeans > - > > Key: SPARK-18427 > URL: https://issues.apache.org/jira/browse/SPARK-18427 > Project: Spark > Issue Type: Improvement > Components: Documentation >Reporter: zhengruifeng >Priority: Minor > > Since {{runs}} is disabled in {{mllib.KMeans}}, the corresponding docs should > be also updated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18427) Remove 'runs' from docs of mllib.KMeans
zhengruifeng created SPARK-18427: Summary: Remove 'runs' from docs of mllib.KMeans Key: SPARK-18427 URL: https://issues.apache.org/jira/browse/SPARK-18427 Project: Spark Issue Type: Improvement Components: Documentation Reporter: zhengruifeng Priority: Minor Since {{runs}} is disabled in {{mllib.KMeans}}, the corresponding docs should be also updated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18426) Python Documentation Fix for Structured Streaming Programming Guide
[ https://issues.apache.org/jira/browse/SPARK-18426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-18426: Fix Version/s: 2.0.3 > Python Documentation Fix for Structured Streaming Programming Guide > --- > > Key: SPARK-18426 > URL: https://issues.apache.org/jira/browse/SPARK-18426 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.0.1 >Reporter: Denny Lee >Assignee: Denny Lee >Priority: Minor > Labels: documentation > Fix For: 2.0.3, 2.1.0 > > > When running python example in Structured Streaming Guide, get the error: > spark = SparkSession\ > TypeError: 'Builder' object is not callable > This is fixed by changing .builder() to .builder -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18426) Python Documentation Fix for Structured Streaming Programming Guide
[ https://issues.apache.org/jira/browse/SPARK-18426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-18426. - Resolution: Fixed Assignee: Denny Lee Fix Version/s: (was: 2.0.2) 2.1.0 2.0.3 > Python Documentation Fix for Structured Streaming Programming Guide > --- > > Key: SPARK-18426 > URL: https://issues.apache.org/jira/browse/SPARK-18426 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.0.1 >Reporter: Denny Lee >Assignee: Denny Lee >Priority: Minor > Labels: documentation > Fix For: 2.0.3, 2.1.0 > > > When running python example in Structured Streaming Guide, get the error: > spark = SparkSession\ > TypeError: 'Builder' object is not callable > This is fixed by changing .builder() to .builder -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18426) Python Documentation Fix for Structured Streaming Programming Guide
[ https://issues.apache.org/jira/browse/SPARK-18426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-18426: Fix Version/s: (was: 2.0.3) > Python Documentation Fix for Structured Streaming Programming Guide > --- > > Key: SPARK-18426 > URL: https://issues.apache.org/jira/browse/SPARK-18426 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.0.1 >Reporter: Denny Lee >Assignee: Denny Lee >Priority: Minor > Labels: documentation > Fix For: 2.1.0 > > > When running python example in Structured Streaming Guide, get the error: > spark = SparkSession\ > TypeError: 'Builder' object is not callable > This is fixed by changing .builder() to .builder -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17836) Use cross validation to determine the number of clusters for EM or KMeans algorithms
[ https://issues.apache.org/jira/browse/SPARK-17836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15662142#comment-15662142 ] Aditya edited comment on SPARK-17836 at 11/13/16 9:39 PM: -- I want to work on this issue. Is it fine? was (Author: aditya1702): [-Sean Owen] I want to work on this issue. Is it fine? > Use cross validation to determine the number of clusters for EM or KMeans > algorithms > > > Key: SPARK-17836 > URL: https://issues.apache.org/jira/browse/SPARK-17836 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Lei Wang >Priority: Minor > > Sometimes it's not easy for users to determine number of clusters. > It would be very useful If spark ml can support this. > There are several methods to do this according to wiki > https://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set > Weka uses cross validation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17836) Use cross validation to determine the number of clusters for EM or KMeans algorithms
[ https://issues.apache.org/jira/browse/SPARK-17836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15662142#comment-15662142 ] Aditya edited comment on SPARK-17836 at 11/13/16 9:39 PM: -- [-Sean Owen] I want to work on this issue. Is it fine? was (Author: aditya1702): Sean Owen I want to work on this issue. Is it fine? > Use cross validation to determine the number of clusters for EM or KMeans > algorithms > > > Key: SPARK-17836 > URL: https://issues.apache.org/jira/browse/SPARK-17836 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Lei Wang >Priority: Minor > > Sometimes it's not easy for users to determine number of clusters. > It would be very useful If spark ml can support this. > There are several methods to do this according to wiki > https://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set > Weka uses cross validation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17836) Use cross validation to determine the number of clusters for EM or KMeans algorithms
[ https://issues.apache.org/jira/browse/SPARK-17836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15662142#comment-15662142 ] Aditya commented on SPARK-17836: Sean Owen I want to work on this issue. Is it fine? > Use cross validation to determine the number of clusters for EM or KMeans > algorithms > > > Key: SPARK-17836 > URL: https://issues.apache.org/jira/browse/SPARK-17836 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Lei Wang >Priority: Minor > > Sometimes it's not easy for users to determine number of clusters. > It would be very useful If spark ml can support this. > There are several methods to do this according to wiki > https://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set > Weka uses cross validation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18424) Improve Date Parsing Functionality
[ https://issues.apache.org/jira/browse/SPARK-18424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell updated SPARK-18424: -- Assignee: Bill Chambers > Improve Date Parsing Functionality > -- > > Key: SPARK-18424 > URL: https://issues.apache.org/jira/browse/SPARK-18424 > Project: Spark > Issue Type: Improvement >Reporter: Bill Chambers >Assignee: Bill Chambers >Priority: Minor > > I've found it quite cumbersome to work with dates thus far in Spark, it can > be hard to reason about the timeformat and what type you're working with, for > instance: > say that I have a date in the format > {code} > 2017-20-12 > // Y-D-M > {code} > In order to parse that into a Date, I have to perform several conversions. > {code} > to_date( > unix_timestamp(col("date"), dateFormat) > .cast("timestamp")) >.alias("date") > {code} > I propose simplifying this by adding a to_date function (exists) but adding > one that accepts a format for that date. I also propose a to_timestamp > function that also supports a format. > so that you can avoid entirely the above conversion. > It's also worth mentioning that many other databases support this. For > instance, mysql has the STR_TO_DATE function, netezza supports the > to_timestamp semantic. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18356) Issue + Resolution: Kmeans Spark Performances (ML package)
[ https://issues.apache.org/jira/browse/SPARK-18356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15662092#comment-15662092 ] yuhao yang commented on SPARK-18356: Checking and caching the training data are quite common in MLlib algorithms. Some algorithms (LR, ANN) would persist the rdd data if parents DataFrame are not cached (use a variable handlePersistence). We can refer to that for the implementation. > Issue + Resolution: Kmeans Spark Performances (ML package) > -- > > Key: SPARK-18356 > URL: https://issues.apache.org/jira/browse/SPARK-18356 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.0.0, 2.0.1 >Reporter: zakaria hili >Priority: Minor > Labels: easyfix > > Hello, > I'm newbie in spark, but I think that I found a small problem that can affect > spark Kmeans performances. > Before starting to explain the problem, I want to explain the warning that I > faced. > I tried to use Spark Kmeans with Dataframes to cluster my data > df_Part = assembler.transform(df_Part) > df_Part.cache() > while (k<=max_cluster) and (wssse > seuilStop): > kmeans = KMeans().setK(k) > model = kmeans.fit(df_Part) > wssse = model.computeCost(df_Part) > k=k+1 > but when I run the code I receive the warning : > WARN KMeans: The input data is not directly cached, which may hurt > performance if its parent RDDs are also uncached. > I searched in spark source code to find the source of this problem, then I > realized there is two classes responsible for this warning: > (mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala ) > (mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala ) > > When my dataframe is cached, the fit method transform my dataframe into an > internally rdd which is not cached. > Dataframe -> rdd -> run Training Kmeans Algo(rdd) > -> The first class (ml package) responsible for converting the dataframe into > rdd then call Kmeans Algorithm > ->The second class (mllib package) implements Kmeans Algorithm, and here > spark verify if the rdd is cached, if not a warning will be generated. > So, the solution of this problem is to cache the rdd before running Kmeans > Algorithm. > https://github.com/ZakariaHili/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala > All what we need is to add two lines: > Cache rdd just after dataframe transformation, then uncached it after > training algorithm. > I hope that I was clear. > If you think that I was wrong, please let me know. > Sincerely, > Zakaria HILI -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18374) Incorrect words in StopWords/english.txt
[ https://issues.apache.org/jira/browse/SPARK-18374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15662015#comment-15662015 ] yuhao yang commented on SPARK-18374: With the default behavior of the _Tokenizer_ and _RegexTokenizer_, I think it's more reasonable to directly include words like _won't_, _haven't_ in the stop words lists, as shown in the list on http://www.ranks.nl/stopwords. More specifically, if a user is using the default _Tokenizer_ and _RegexTokenizer_ in spark.ml without customization, then _weren_, _wasn_ in current stop words list are useless,whereas _weren't_ and _wasn't_ can be helpful. The default behavior of ml transformers should be consistent and effective. > Incorrect words in StopWords/english.txt > > > Key: SPARK-18374 > URL: https://issues.apache.org/jira/browse/SPARK-18374 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.0.1 >Reporter: nirav patel > > I was just double checking english.txt for list of stopwords as I felt it was > taking out valid tokens like 'won'. I think issue is english.txt list is > missing apostrophe character and all character after apostrophe. So "won't" > becam "won" in that list; "wouldn't" is "wouldn" . > Here are some incorrect tokens in this list: > won > wouldn > ma > mightn > mustn > needn > shan > shouldn > wasn > weren > I think ideal list should have both style. i.e. won't and wont both should be > part of english.txt as some tokenizer might remove special characters. But > 'won' is obviously shouldn't be in this list. > Here's list of snowball english stop words: > http://snowball.tartarus.org/algorithms/english/stop.txt -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18426) Python Documentation Fix for Structured Streaming Programming Guide
[ https://issues.apache.org/jira/browse/SPARK-18426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18426: Assignee: Apache Spark > Python Documentation Fix for Structured Streaming Programming Guide > --- > > Key: SPARK-18426 > URL: https://issues.apache.org/jira/browse/SPARK-18426 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.0.1 >Reporter: Denny Lee >Assignee: Apache Spark >Priority: Minor > Labels: documentation > Fix For: 2.0.2 > > > When running python example in Structured Streaming Guide, get the error: > spark = SparkSession\ > TypeError: 'Builder' object is not callable > This is fixed by changing .builder() to .builder -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18426) Python Documentation Fix for Structured Streaming Programming Guide
[ https://issues.apache.org/jira/browse/SPARK-18426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15661961#comment-15661961 ] Apache Spark commented on SPARK-18426: -- User 'dennyglee' has created a pull request for this issue: https://github.com/apache/spark/pull/15872 > Python Documentation Fix for Structured Streaming Programming Guide > --- > > Key: SPARK-18426 > URL: https://issues.apache.org/jira/browse/SPARK-18426 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.0.1 >Reporter: Denny Lee >Priority: Minor > Labels: documentation > Fix For: 2.0.2 > > > When running python example in Structured Streaming Guide, get the error: > spark = SparkSession\ > TypeError: 'Builder' object is not callable > This is fixed by changing .builder() to .builder -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18426) Python Documentation Fix for Structured Streaming Programming Guide
[ https://issues.apache.org/jira/browse/SPARK-18426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18426: Assignee: (was: Apache Spark) > Python Documentation Fix for Structured Streaming Programming Guide > --- > > Key: SPARK-18426 > URL: https://issues.apache.org/jira/browse/SPARK-18426 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.0.1 >Reporter: Denny Lee >Priority: Minor > Labels: documentation > Fix For: 2.0.2 > > > When running python example in Structured Streaming Guide, get the error: > spark = SparkSession\ > TypeError: 'Builder' object is not callable > This is fixed by changing .builder() to .builder -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18426) Python Documentation Fix for Structured Streaming Programming Guide
Denny Lee created SPARK-18426: - Summary: Python Documentation Fix for Structured Streaming Programming Guide Key: SPARK-18426 URL: https://issues.apache.org/jira/browse/SPARK-18426 Project: Spark Issue Type: Documentation Components: Documentation Affects Versions: 2.0.1 Reporter: Denny Lee Priority: Minor Fix For: 2.0.2 When running python example in Structured Streaming Guide, get the error: spark = SparkSession\ TypeError: 'Builder' object is not callable This is fixed by changing .builder() to .builder -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15798) Secondary sort in Dataset/DataFrame
[ https://issues.apache.org/jira/browse/SPARK-15798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15661918#comment-15661918 ] koert kuipers commented on SPARK-15798: --- turns out the operations needed for this are already mostly available in Dataset. the one big limitation is that it seems the secondary sort does not get pushed into the shuffle in spark sql (but it is done efficiently with spilling to disk etc.). see this conversation: https://www.mail-archive.com/user@spark.apache.org/msg58844.html i added support for Dataset secondary sort to spark-sorted. see here: https://github.com/tresata/spark-sorted i would also like to add support for DataFrame but to do so i would need operations to convert Row to UDF inputs and back, which in spark sql are available (Encoder, ScalaReflection, etc.) but they support InternalRow only while in a 3rd party library i need to work with normal rows since InternalRows are never exposed (for example in Dataset[Row].mapPartitions i have Rows but not InternalRows). > Secondary sort in Dataset/DataFrame > --- > > Key: SPARK-15798 > URL: https://issues.apache.org/jira/browse/SPARK-15798 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: koert kuipers > > Secondary sort for Spark RDDs was discussed in > https://issues.apache.org/jira/browse/SPARK-3655 > Since the RDD API allows for easy extensions outside the core library this > was implemented separately here: > https://github.com/tresata/spark-sorted > However it seems to me that with Dataset an implementation in a 3rd party > library of such a feature is not really an option. > Dataset already has methods that suggest a secondary sort is present, such as > in KeyValueGroupedDataset: > {noformat} > def flatMapGroups[U : Encoder](f: (K, Iterator[V]) => TraversableOnce[U]): > Dataset[U] > {noformat} > This operation pushes all the data to the reducer, something you only would > want to do if you need the elements in a particular order. > How about as an API sortBy methods in KeyValueGroupedDataset and > RelationalGroupedDataset? > {noformat} > dataFrame.groupBy("a").sortBy("b").fold(...) > {noformat} > (yes i know RelationalGroupedDataset doesnt have a fold yet... but it should > :)) > {noformat} > dataset.groupBy(_._1).sortBy(_._3).flatMapGroups(...) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17116) Allow params to be a {string, value} dict at fit time
[ https://issues.apache.org/jira/browse/SPARK-17116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17116: Assignee: Apache Spark > Allow params to be a {string, value} dict at fit time > - > > Key: SPARK-17116 > URL: https://issues.apache.org/jira/browse/SPARK-17116 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Manoj Kumar >Assignee: Apache Spark >Priority: Minor > > Currently, it is possible to override the default params set at constructor > time by supplying a ParamMap which is essentially a (Param: value) dict. > Looking at the codebase, it should be trivial to extend this to a (string, > value) representation. > {code} > # This hints that the maxiter param of the lr instance is modified in-place > lr = LogisticRegression(maxIter=10, regParam=0.01) > lr.fit(dataset, {lr.maxIter: 20}) > # This seems more natural. > lr.fit(dataset, {"maxIter": 20}) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17116) Allow params to be a {string, value} dict at fit time
[ https://issues.apache.org/jira/browse/SPARK-17116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17116: Assignee: (was: Apache Spark) > Allow params to be a {string, value} dict at fit time > - > > Key: SPARK-17116 > URL: https://issues.apache.org/jira/browse/SPARK-17116 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Manoj Kumar >Priority: Minor > > Currently, it is possible to override the default params set at constructor > time by supplying a ParamMap which is essentially a (Param: value) dict. > Looking at the codebase, it should be trivial to extend this to a (string, > value) representation. > {code} > # This hints that the maxiter param of the lr instance is modified in-place > lr = LogisticRegression(maxIter=10, regParam=0.01) > lr.fit(dataset, {lr.maxIter: 20}) > # This seems more natural. > lr.fit(dataset, {"maxIter": 20}) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17116) Allow params to be a {string, value} dict at fit time
[ https://issues.apache.org/jira/browse/SPARK-17116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15661887#comment-15661887 ] Apache Spark commented on SPARK-17116: -- User 'aditya1702' has created a pull request for this issue: https://github.com/apache/spark/pull/15871 > Allow params to be a {string, value} dict at fit time > - > > Key: SPARK-17116 > URL: https://issues.apache.org/jira/browse/SPARK-17116 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Manoj Kumar >Priority: Minor > > Currently, it is possible to override the default params set at constructor > time by supplying a ParamMap which is essentially a (Param: value) dict. > Looking at the codebase, it should be trivial to extend this to a (string, > value) representation. > {code} > # This hints that the maxiter param of the lr instance is modified in-place > lr = LogisticRegression(maxIter=10, regParam=0.01) > lr.fit(dataset, {lr.maxIter: 20}) > # This seems more natural. > lr.fit(dataset, {"maxIter": 20}) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18251) DataSet API | RuntimeException: Null value appeared in non-nullable field when holding Option Case Class
[ https://issues.apache.org/jira/browse/SPARK-18251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15658492#comment-15658492 ] Aniket Bhatnagar edited comment on SPARK-18251 at 11/13/16 5:17 PM: Hi [~jayadevan.m] Which version of scala and spark did you use? I can reproduce this on spark 2.0.1 and scala 2.11.8. I have created a sample project with all the dependencies to easily reproduce this: https://github.com/aniketbhatnagar/SPARK-18251-data-set-option-bug To reproduce the bug, simple checkout the project and run the command sbt run. Thanks, Aniket was (Author: aniket): Hi [~jayadevan.m] Which version of scala spark did you use? I can reproduce this on spark 2.0.1 and scala 2.11.8. I have created a sample project with all the dependencies to easily reproduce this: https://github.com/aniketbhatnagar/SPARK-18251-data-set-option-bug To reproduce the bug, simple checkout the project and run the command sbt run. Thanks, Aniket > DataSet API | RuntimeException: Null value appeared in non-nullable field > when holding Option Case Class > > > Key: SPARK-18251 > URL: https://issues.apache.org/jira/browse/SPARK-18251 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.1 > Environment: OS X >Reporter: Aniket Bhatnagar > > I am running into a runtime exception when a DataSet is holding an Empty > object instance for an Option type that is holding non-nullable field. For > instance, if we have the following case class: > case class DataRow(id: Int, value: String) > Then, DataSet[Option[DataRow]] can only hold Some(DataRow) objects and cannot > hold Empty. If it does so, the following exception is thrown: > {noformat} > Exception in thread "main" org.apache.spark.SparkException: Job aborted due > to stage failure: Task 6 in stage 0.0 failed 1 times, most recent failure: > Lost task 6.0 in stage 0.0 (TID 6, localhost): java.lang.RuntimeException: > Null value appeared in non-nullable field: > - field (class: "scala.Int", name: "id") > - option value class: "DataSetOptBug.DataRow" > - root class: "scala.Option" > If the schema is inferred from a Scala tuple/case class, or a Java bean, > please try to use scala.Option[_] or other nullable types (e.g. > java.lang.Integer instead of int/scala.Int). > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {noformat} > The bug can be reproduce by using the program: > https://gist.github.com/aniketbhatnagar/2ed74613f70d2defe999c18afaa4816e -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18413) Add a property to control the number of partitions when save a jdbc rdd
[ https://issues.apache.org/jira/browse/SPARK-18413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15661769#comment-15661769 ] Dongjoon Hyun commented on SPARK-18413: --- Hi, [~lichenglingl]. Although the code is simple, I cannot find a proper unit testing method for this. So, it took a long time. Could you try the PR for your use case? > Add a property to control the number of partitions when save a jdbc rdd > --- > > Key: SPARK-18413 > URL: https://issues.apache.org/jira/browse/SPARK-18413 > Project: Spark > Issue Type: Wish > Components: SQL >Affects Versions: 2.0.1 >Reporter: lichenglin > > {code} > CREATE or replace TEMPORARY VIEW resultview > USING org.apache.spark.sql.jdbc > OPTIONS ( > url "jdbc:oracle:thin:@10.129.10.111:1521:BKDB", > dbtable "result", > user "HIVE", > password "HIVE" > ); > --set spark.sql.shuffle.partitions=200 > insert overwrite table resultview select g,count(1) as count from > tnet.DT_LIVE_INFO group by g > {code} > I'm tring to save a spark sql result to oracle. > And I found spark will create a jdbc connection for each partition. > if the sql create to many partitions , the database can't hold so many > connections and return exception. > In above situation is 200 because of the "group by" and > "spark.sql.shuffle.partitions" > the spark source code JdbcUtil is > {code} > def saveTable( > df: DataFrame, > url: String, > table: String, > properties: Properties) { > val dialect = JdbcDialects.get(url) > val nullTypes: Array[Int] = df.schema.fields.map { field => > getJdbcType(field.dataType, dialect).jdbcNullType > } > val rddSchema = df.schema > val getConnection: () => Connection = createConnectionFactory(url, > properties) > val batchSize = properties.getProperty(JDBC_BATCH_INSERT_SIZE, > "1000").toInt > df.foreachPartition { iterator => > savePartition(getConnection, table, iterator, rddSchema, nullTypes, > batchSize, dialect) > } > } > {code} > May be we can add a property for df.repartition(num).foreachPartition ? > In fact I got an exception "ORA-12519, TNS:no appropriate service handler > found" -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18421) Dynamic disk allocation
[ https://issues.apache.org/jira/browse/SPARK-18421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15661760#comment-15661760 ] Aniket Bhatnagar commented on SPARK-18421: -- I agree that spark doesn't manage the storage and therefore, running of an agent and dynamic addition of storage to a host is outside the scope. However, what's in scope for spark is ability for spark to use added storage without forcing restart of executor process. Specifically, spark.local.dirs needs to be a dynamic property. For example, spark.local.dirs could be configured as a glob pattern (something like /mnt*) and whenever a new disk is added & mounted (as /mnt), spark's shuffle service should be able to use the locally added disk. Additionally, there maybe a task to rebalance shuffle blocks once a disk is added so that all local dirs are once again used equally. I don't think, detection of newly mounted directory, rebalancing of blocks, etc is cloud specific as all of this can be done using java's IO/NIO api. This feature would however be mostly useful for users running in spark on cloud. Currently, the users are expected to guess their shuffle storage footprint and accordingly mount the right sized disks. If the guess is wrong, the job fails, wasting a lot of time. > Dynamic disk allocation > --- > > Key: SPARK-18421 > URL: https://issues.apache.org/jira/browse/SPARK-18421 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.0.1 >Reporter: Aniket Bhatnagar >Priority: Minor > > Dynamic allocation feature allows you to add executors and scale computation > power. This is great, however, I feel like we also need a way to dynamically > scale storage. Currently, if the disk is not able to hold the spilled/shuffle > data, the job is aborted (in yarn, the node manager kills the container) > causing frustration and loss of time. In deployments like AWS EMR, it is > possible to run an agent that add disks on the fly if it sees that the disks > are running out of space and it would be great if Spark could immediately > start using the added disks just as it does when new executors are added. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17436) dataframe.write sometimes does not keep sorting
[ https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15661640#comment-15661640 ] Ran Haim edited comment on SPARK-17436 at 11/13/16 3:37 PM: Hi, I only got a chance to work on it now. I saw that the whole class tree got changed - I changed the code in org.apache.spark.sql.execution.datasources.FileFormatWriter. The problem is I cannot seem to run a mvn clean install...A lot of tests fail (not relevant to my change, and happen without it) - And I do want to make sure there are relevant tests (though I did not find any). Any Ideas? Also I cannot create a pull request, I get 403. Ran, was (Author: ran.h...@optimalplus.com): Hi, I only got a chance to work on it now. I saw that the whole class tree got changed - I changed the code in org.apache.spark.sql.execution.datasources.FileFormatWriter. The problem is I cannot seem to run a mvn clean install...A lot of tests fail (not relevant to my change, and happen without it) - And I do want to make sure there are relevant tests (though I did not find any). Any Ideas? Ran, > dataframe.write sometimes does not keep sorting > --- > > Key: SPARK-17436 > URL: https://issues.apache.org/jira/browse/SPARK-17436 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.1, 1.6.2, 2.0.0 >Reporter: Ran Haim > > When using partition by, datawriter can sometimes mess up an ordered > dataframe. > The problem originates in > org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer. > In the writeRows method when too many files are opened (configurable), it > starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows > again from the sorter and writes them to the corresponding files. > The problem is that the sorter actually sorts the rows using the partition > key, and that can sometimes mess up the original sort (or secondary sort if > you will). > I think the best way to fix it is to stop using a sorter, and just put the > rows in a map using key as partition key and value as an arraylist, and then > just walk through all the keys and write it in the original order - this will > probably be faster as there no need for ordering. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17436) dataframe.write sometimes does not keep sorting
[ https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15661640#comment-15661640 ] Ran Haim commented on SPARK-17436: -- Hi, I only got a chance to work on it now. I saw that the whole class tree got changed - I changed the code in org.apache.spark.sql.execution.datasources.FileFormatWriter. The problem is I cannot seem to run a mvn clean install...A lot of tests fail (not relevant to my change, and happen without it) - And I do want to make sure there are relevant tests (though I did not find any). Any Ideas? Ran, > dataframe.write sometimes does not keep sorting > --- > > Key: SPARK-17436 > URL: https://issues.apache.org/jira/browse/SPARK-17436 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.1, 1.6.2, 2.0.0 >Reporter: Ran Haim > > When using partition by, datawriter can sometimes mess up an ordered > dataframe. > The problem originates in > org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer. > In the writeRows method when too many files are opened (configurable), it > starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows > again from the sorter and writes them to the corresponding files. > The problem is that the sorter actually sorts the rows using the partition > key, and that can sometimes mess up the original sort (or secondary sort if > you will). > I think the best way to fix it is to stop using a sorter, and just put the > rows in a map using key as partition key and value as an arraylist, and then > just walk through all the keys and write it in the original order - this will > probably be faster as there no need for ordering. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18420) Fix the compile errors caused by checkstyle
[ https://issues.apache.org/jira/browse/SPARK-18420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] coneyliu updated SPARK-18420: - Component/s: Build > Fix the compile errors caused by checkstyle > --- > > Key: SPARK-18420 > URL: https://issues.apache.org/jira/browse/SPARK-18420 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.0.1 >Reporter: coneyliu >Priority: Minor > > Small fix, fix the compile errors caused by checkstyle. > Before: > ``` > Checkstyle checks failed at following occurrences: > [ERROR] src/main/java/org/apache/spark/network/util/TransportConf.java:[21,8] > (imports) UnusedImports: Unused import - > org.apache.commons.crypto.cipher.CryptoCipherFactory. > [ERROR] > src/test/java/org/apache/spark/network/sasl/SparkSaslSuite.java:[516,5] > (modifier) RedundantModifier: Redundant 'public' modifier. > [ERROR] > src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeMapData.java:[71] > (sizes) LineLength: Line is longer than 100 characters (found 113). > [ERROR] > src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeArrayData.java:[112] > (sizes) LineLength: Line is longer than 100 characters (found 110). > src/main/java/org/apache/spark/examples/ml/JavaLogisticRegressionWithElasticNetExample.java:[64] > (sizes) LineLength: Line is longer than 100 characters (found 103). > [ERROR] > src/main/java/org/apache/spark/examples/ml/JavaInteractionExample.java:[22,8] > (imports) UnusedImports: Unused import - org.apache.spark.ml.linalg.Vectors. > [ERROR] > src/main/java/org/apache/spark/examples/ml/JavaInteractionExample.java:[51] > (regexp) RegexpSingleline: No trailing whitespace allowed. > ``` > After: > `mvn install` > `lint-java` > Checkstyle checks passed -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18420) Fix the compile errors caused by checkstyle
[ https://issues.apache.org/jira/browse/SPARK-18420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] coneyliu updated SPARK-18420: - Description: Small fix, fix the compile errors caused by checkstyle. Before: ``` Checkstyle checks failed at following occurrences: [ERROR] src/main/java/org/apache/spark/network/util/TransportConf.java:[21,8] (imports) UnusedImports: Unused import - org.apache.commons.crypto.cipher.CryptoCipherFactory. [ERROR] src/test/java/org/apache/spark/network/sasl/SparkSaslSuite.java:[516,5] (modifier) RedundantModifier: Redundant 'public' modifier. [ERROR] src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeMapData.java:[71] (sizes) LineLength: Line is longer than 100 characters (found 113). [ERROR] src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeArrayData.java:[112] (sizes) LineLength: Line is longer than 100 characters (found 110). src/main/java/org/apache/spark/examples/ml/JavaLogisticRegressionWithElasticNetExample.java:[64] (sizes) LineLength: Line is longer than 100 characters (found 103). [ERROR] src/main/java/org/apache/spark/examples/ml/JavaInteractionExample.java:[22,8] (imports) UnusedImports: Unused import - org.apache.spark.ml.linalg.Vectors. [ERROR] src/main/java/org/apache/spark/examples/ml/JavaInteractionExample.java:[51] (regexp) RegexpSingleline: No trailing whitespace allowed. ``` After: `mvn install` `lint-java` Checkstyle checks passed > Fix the compile errors caused by checkstyle > --- > > Key: SPARK-18420 > URL: https://issues.apache.org/jira/browse/SPARK-18420 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.0.1 >Reporter: coneyliu >Priority: Minor > > Small fix, fix the compile errors caused by checkstyle. > Before: > ``` > Checkstyle checks failed at following occurrences: > [ERROR] src/main/java/org/apache/spark/network/util/TransportConf.java:[21,8] > (imports) UnusedImports: Unused import - > org.apache.commons.crypto.cipher.CryptoCipherFactory. > [ERROR] > src/test/java/org/apache/spark/network/sasl/SparkSaslSuite.java:[516,5] > (modifier) RedundantModifier: Redundant 'public' modifier. > [ERROR] > src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeMapData.java:[71] > (sizes) LineLength: Line is longer than 100 characters (found 113). > [ERROR] > src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeArrayData.java:[112] > (sizes) LineLength: Line is longer than 100 characters (found 110). > src/main/java/org/apache/spark/examples/ml/JavaLogisticRegressionWithElasticNetExample.java:[64] > (sizes) LineLength: Line is longer than 100 characters (found 103). > [ERROR] > src/main/java/org/apache/spark/examples/ml/JavaInteractionExample.java:[22,8] > (imports) UnusedImports: Unused import - org.apache.spark.ml.linalg.Vectors. > [ERROR] > src/main/java/org/apache/spark/examples/ml/JavaInteractionExample.java:[51] > (regexp) RegexpSingleline: No trailing whitespace allowed. > ``` > After: > `mvn install` > `lint-java` > Checkstyle checks passed -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18421) Dynamic disk allocation
[ https://issues.apache.org/jira/browse/SPARK-18421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15661520#comment-15661520 ] Sean Owen commented on SPARK-18421: --- Spark doesn't manage storage at all. I don't think this could be in scope therefore especially because it could only apply to cloud and would be cloud specific. > Dynamic disk allocation > --- > > Key: SPARK-18421 > URL: https://issues.apache.org/jira/browse/SPARK-18421 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.0.1 >Reporter: Aniket Bhatnagar >Priority: Minor > > Dynamic allocation feature allows you to add executors and scale computation > power. This is great, however, I feel like we also need a way to dynamically > scale storage. Currently, if the disk is not able to hold the spilled/shuffle > data, the job is aborted (in yarn, the node manager kills the container) > causing frustration and loss of time. In deployments like AWS EMR, it is > possible to run an agent that add disks on the fly if it sees that the disks > are running out of space and it would be great if Spark could immediately > start using the added disks just as it does when new executors are added. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-18363) Connected component for large graph result is wrong
[ https://issues.apache.org/jira/browse/SPARK-18363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reopened SPARK-18363: --- > Connected component for large graph result is wrong > --- > > Key: SPARK-18363 > URL: https://issues.apache.org/jira/browse/SPARK-18363 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 2.0.1 >Reporter: Philip Adetiloye > > The clustering done by Graphx connected component doesn't seems to work > correctly with large nodes. > It only works correctly on a small graph -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18363) Connected component for large graph result is wrong
[ https://issues.apache.org/jira/browse/SPARK-18363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-18363. --- Resolution: Not A Problem > Connected component for large graph result is wrong > --- > > Key: SPARK-18363 > URL: https://issues.apache.org/jira/browse/SPARK-18363 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 2.0.1 >Reporter: Philip Adetiloye > > The clustering done by Graphx connected component doesn't seems to work > correctly with large nodes. > It only works correctly on a small graph -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org