[jira] [Updated] (SPARK-20152) Time zone is not respected while parsing csv for timeStampFormat "MM-dd-yyyy'T'HH:mm:ss.SSSZZ"
[ https://issues.apache.org/jira/browse/SPARK-20152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Navya Krishnappa updated SPARK-20152: - Description: When reading the below mentioned time value by specifying the "timestampFormat": "MM-dd-'T'HH:mm:ss.SSSZZ", time zone is ignored. Source File: TimeColumn 03-21-2017T03:30:02Z Source code1: Dataset dataset = getSqlContext().read() .option(DAWBConstant.PARSER_LIB, "commons") .option(INFER_SCHEMA, "true") .option(DAWBConstant.DELIMITER, ",") .option(DAWBConstant.QUOTE, "\"") .option(DAWBConstant.ESCAPE, "\\") .option("timestampFormat" , "MM-dd-'T'HH:mm:ss.SSSZZ") .option(DAWBConstant.MODE, Mode.PERMISSIVE) .csv(sourceFile); Result: TimeColumn [ StringType] and value is "03-21-2017T03:30:02Z", but expected result is TimeCoumn should be of "TimestampType" and should consider time zone for manipulation Source code2: Dataset dataset = getSqlContext().read() .option(DAWBConstant.PARSER_LIB, "commons") .option(INFER_SCHEMA, "true") .option(DAWBConstant.DELIMITER, ",") .option(DAWBConstant.QUOTE, "\"") .option(DAWBConstant.ESCAPE, "\\") .option("timestampFormat" , "MM-dd-'T'HH:mm:ss") .option(DAWBConstant.MODE, Mode.PERMISSIVE) .csv(sourceFile); Result: TimeColumn [ TimestampType] and value is "2017-04-22 03:30:02.0", but expected result is TimeCoumn should consider time zone for manipulation was: When reading the below mentioned time value by specifying the "timestampFormat": "MM-dd-'T'HH:mm:ss.SSSZZ", time zone is ignored. Sample data: TimeColumn 03-21-2017T03:30:02Z Result: TimeColumn [ StringType] and value is "03-21-2017T03:30:02Z" Expected Result: TimeCoumn should be of "TimestampType" > Time zone is not respected while parsing csv for timeStampFormat > "MM-dd-'T'HH:mm:ss.SSSZZ" > -- > > Key: SPARK-20152 > URL: https://issues.apache.org/jira/browse/SPARK-20152 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Navya Krishnappa > > When reading the below mentioned time value by specifying the > "timestampFormat": "MM-dd-'T'HH:mm:ss.SSSZZ", time zone is ignored. > Source File: > TimeColumn > 03-21-2017T03:30:02Z > Source code1: > Dataset dataset = getSqlContext().read() > .option(DAWBConstant.PARSER_LIB, "commons") > .option(INFER_SCHEMA, "true") > .option(DAWBConstant.DELIMITER, ",") > .option(DAWBConstant.QUOTE, "\"") > .option(DAWBConstant.ESCAPE, "\\") > .option("timestampFormat" , "MM-dd-'T'HH:mm:ss.SSSZZ") > .option(DAWBConstant.MODE, Mode.PERMISSIVE) > .csv(sourceFile); > Result: TimeColumn [ StringType] and value is "03-21-2017T03:30:02Z", but > expected result is TimeCoumn should be of "TimestampType" and should > consider time zone for manipulation > Source code2: > Dataset dataset = getSqlContext().read() > .option(DAWBConstant.PARSER_LIB, "commons") > .option(INFER_SCHEMA, "true") > .option(DAWBConstant.DELIMITER, ",") > .option(DAWBConstant.QUOTE, "\"") > .option(DAWBConstant.ESCAPE, "\\") > .option("timestampFormat" , "MM-dd-'T'HH:mm:ss") > .option(DAWBConstant.MODE, Mode.PERMISSIVE) > .csv(sourceFile); > Result: TimeColumn [ TimestampType] and value is "2017-04-22 03:30:02.0", but > expected result is TimeCoumn should consider time zone for manipulation -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20152) Time zone is not respected while parsing csv for timeStampFormat "MM-dd-yyyy'T'HH:mm:ss.SSSZZ"
Navya Krishnappa created SPARK-20152: Summary: Time zone is not respected while parsing csv for timeStampFormat "MM-dd-'T'HH:mm:ss.SSSZZ" Key: SPARK-20152 URL: https://issues.apache.org/jira/browse/SPARK-20152 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0 Reporter: Navya Krishnappa When reading the below mentioned time value by specifying the "timestampFormat": "MM-dd-'T'HH:mm:ss.SSSZZ", time zone is ignored. Sample data: TimeColumn 03-21-2017T03:30:02Z Result: TimeColumn [ StringType] and value is "03-21-2017T03:30:02Z" Expected Result: TimeCoumn should be of "TimestampType" -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20151) Account for partition pruning in scan metadataTime metrics
[ https://issues.apache.org/jira/browse/SPARK-20151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20151: Assignee: Reynold Xin (was: Apache Spark) > Account for partition pruning in scan metadataTime metrics > -- > > Key: SPARK-20151 > URL: https://issues.apache.org/jira/browse/SPARK-20151 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Reynold Xin >Assignee: Reynold Xin > > After SPARK-20136, we report metadata timing metrics in scan operator. > However, that timing metric doesn't include one of the most important part of > metadata, which is partition pruning. This patch adds that time measurement > to the scan metrics. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20151) Account for partition pruning in scan metadataTime metrics
[ https://issues.apache.org/jira/browse/SPARK-20151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15948422#comment-15948422 ] Apache Spark commented on SPARK-20151: -- User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/17476 > Account for partition pruning in scan metadataTime metrics > -- > > Key: SPARK-20151 > URL: https://issues.apache.org/jira/browse/SPARK-20151 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Reynold Xin >Assignee: Reynold Xin > > After SPARK-20136, we report metadata timing metrics in scan operator. > However, that timing metric doesn't include one of the most important part of > metadata, which is partition pruning. This patch adds that time measurement > to the scan metrics. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20151) Account for partition pruning in scan metadataTime metrics
[ https://issues.apache.org/jira/browse/SPARK-20151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20151: Assignee: Apache Spark (was: Reynold Xin) > Account for partition pruning in scan metadataTime metrics > -- > > Key: SPARK-20151 > URL: https://issues.apache.org/jira/browse/SPARK-20151 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Reynold Xin >Assignee: Apache Spark > > After SPARK-20136, we report metadata timing metrics in scan operator. > However, that timing metric doesn't include one of the most important part of > metadata, which is partition pruning. This patch adds that time measurement > to the scan metrics. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20151) Account for partition pruning in scan metadataTime metrics
[ https://issues.apache.org/jira/browse/SPARK-20151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-20151: Summary: Account for partition pruning in scan metadataTime metrics (was: Take partition pruning timing into account in scan metadataTime metrics) > Account for partition pruning in scan metadataTime metrics > -- > > Key: SPARK-20151 > URL: https://issues.apache.org/jira/browse/SPARK-20151 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Reynold Xin >Assignee: Reynold Xin > > After SPARK-20136, we report metadata timing metrics in scan operator. > However, that timing metric doesn't include one of the most important part of > metadata, which is partition pruning. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20151) Take partition pruning timing into account in scan metadataTime metrics
Reynold Xin created SPARK-20151: --- Summary: Take partition pruning timing into account in scan metadataTime metrics Key: SPARK-20151 URL: https://issues.apache.org/jira/browse/SPARK-20151 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.1.0 Reporter: Reynold Xin Assignee: Reynold Xin After SPARK-20136, we report metadata timing metrics in scan operator. However, that timing metric doesn't include one of the most important part of metadata, which is partition pruning. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20151) Account for partition pruning in scan metadataTime metrics
[ https://issues.apache.org/jira/browse/SPARK-20151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-20151: Description: After SPARK-20136, we report metadata timing metrics in scan operator. However, that timing metric doesn't include one of the most important part of metadata, which is partition pruning. This patch adds that time measurement to the scan metrics. was: After SPARK-20136, we report metadata timing metrics in scan operator. However, that timing metric doesn't include one of the most important part of metadata, which is partition pruning. > Account for partition pruning in scan metadataTime metrics > -- > > Key: SPARK-20151 > URL: https://issues.apache.org/jira/browse/SPARK-20151 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Reynold Xin >Assignee: Reynold Xin > > After SPARK-20136, we report metadata timing metrics in scan operator. > However, that timing metric doesn't include one of the most important part of > metadata, which is partition pruning. This patch adds that time measurement > to the scan metrics. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20148) Extend the file commit interface to allow subscribing to task commit messages
[ https://issues.apache.org/jira/browse/SPARK-20148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-20148. - Resolution: Fixed Assignee: Eric Liang Fix Version/s: 2.2.0 > Extend the file commit interface to allow subscribing to task commit messages > - > > Key: SPARK-20148 > URL: https://issues.apache.org/jira/browse/SPARK-20148 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Eric Liang >Assignee: Eric Liang >Priority: Minor > Fix For: 2.2.0 > > > The internal FileCommitProtocol interface returns all task commit messages in > bulk to the implementation when a job finishes. However, it is sometimes > useful to access those messages before the job completes, so that the driver > gets incremental progress updates before the job finishes. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20150) Add permsize statistics for worker memory which may be very useful for the memory usage assessment
[ https://issues.apache.org/jira/browse/SPARK-20150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jinhua Fu updated SPARK-20150: -- Summary: Add permsize statistics for worker memory which may be very useful for the memory usage assessment (was: Can the spark add a mechanism for permsize statistics which may be very useful for the memory usage assessment) > Add permsize statistics for worker memory which may be very useful for the > memory usage assessment > -- > > Key: SPARK-20150 > URL: https://issues.apache.org/jira/browse/SPARK-20150 > Project: Spark > Issue Type: Wish > Components: Web UI >Affects Versions: 2.0.2 >Reporter: Jinhua Fu > > It seems worker memory only be assigned to executor heap which is usually not > enough for estimating the whole clauster memory usage,especially when memory > becomes a bottleneck of the clauster.In many case,we found a executor's real > memory usage was much larger than its heap size which make me have to check > for every application's real memory expenditure. > This can be improved by adding a mechanism for Non-Heap(permsize) > statistics,only shown for extra memory usage which has no effect on the > current worker memory allocation and statistics.The permsize can be obtained > easily from executor java options. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14492) Spark SQL 1.6.0 does not work with Hive version lower than 1.2.0; its not backwards compatible with earlier version
[ https://issues.apache.org/jira/browse/SPARK-14492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15948335#comment-15948335 ] Chico Qi commented on SPARK-14492: -- I had the same issue when I upgraded to Spark 2.1.0 and my Hive's version is 1.1.0-cdh5.7.0. Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 17/03/30 16:07:30 WARN spark.SparkContext: Support for Java 7 is deprecated as of Spark 2.0.0 java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveSessionState': at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$reflect(SparkSession.scala:981) at org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:110) at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:109) at org.apache.spark.sql.SparkSession$Builder$$anonfun$getOrCreate$5.apply(SparkSession.scala:878) at org.apache.spark.sql.SparkSession$Builder$$anonfun$getOrCreate$5.apply(SparkSession.scala:878) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99) at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40) at scala.collection.mutable.HashMap.foreach(HashMap.scala:99) at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:878) at org.apache.spark.repl.Main$.createSparkSession(Main.scala:95) ... 47 elided Caused by: java.lang.reflect.InvocationTargetException: java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveExternalCatalog': at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$reflect(SparkSession.scala:978) ... 58 more Caused by: java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveExternalCatalog': at org.apache.spark.sql.internal.SharedState$.org$apache$spark$sql$internal$SharedState$$reflect(SharedState.scala:169) at org.apache.spark.sql.internal.SharedState.(SharedState.scala:86) at org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:101) at org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:101) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.SparkSession.sharedState$lzycompute(SparkSession.scala:101) at org.apache.spark.sql.SparkSession.sharedState(SparkSession.scala:100) at org.apache.spark.sql.internal.SessionState.(SessionState.scala:157) at org.apache.spark.sql.hive.HiveSessionState.(HiveSessionState.scala:32) ... 63 more Caused by: java.lang.reflect.InvocationTargetException: java.lang.NoSuchFieldError: METASTORE_CLIENT_SOCKET_LIFETIME at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.apache.spark.sql.internal.SharedState$.org$apache$spark$sql$internal$SharedState$$reflect(SharedState.scala:166) ... 71 more Caused by: java.lang.NoSuchFieldError: METASTORE_CLIENT_SOCKET_LIFETIME at org.apache.spark.sql.hive.HiveUtils$.hiveClientConfigurations(HiveUtils.scala:194) at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:269) at org.apache.spark.sql.hive.HiveExternalCatalog.(HiveExternalCatalog.scala:65) ... 76 more :14: error: not found: value spark import spark.implicits._ ^ :14: error: not found: value spark import spark.sql ^ > Spark SQL 1.6.0 does not work with Hive version lower than 1.2.0; its not > backwards compatible with earlier version > --- > > Key: SPARK-14492 > URL: https://issues.apache.org/jira/browse/SPARK-14492 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Sunil Rangwani >Priority: Critical > > Spark SQL when configured with a Hive version lower than 1.2.0 throws a > java.lang.NoSuchFieldError for the field METASTORE_CLIENT_SOCKET_LIFETIME > because this
[jira] [Created] (SPARK-20150) Can the spark add a mechanism for permsize statistics which may be very useful for the memory usage assessment
Jinhua Fu created SPARK-20150: - Summary: Can the spark add a mechanism for permsize statistics which may be very useful for the memory usage assessment Key: SPARK-20150 URL: https://issues.apache.org/jira/browse/SPARK-20150 Project: Spark Issue Type: Wish Components: Web UI Affects Versions: 2.0.2 Reporter: Jinhua Fu It seems worker memory only be assigned to executor heap which is usually not enough for estimating the whole clauster memory usage,especially when memory becomes a bottleneck of the clauster.In many case,we found a executor's real memory usage was much larger than its heap size which make me have to check for every application's real memory expenditure. This can be improved by adding a mechanism for Non-Heap(permsize) statistics,only shown for extra memory usage which has no effect on the current worker memory allocation and statistics.The permsize can be obtained easily from executor java options. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20136) Add num files and metadata operation timing to scan metrics
[ https://issues.apache.org/jira/browse/SPARK-20136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-20136. - Resolution: Fixed Fix Version/s: 2.2.0 > Add num files and metadata operation timing to scan metrics > --- > > Key: SPARK-20136 > URL: https://issues.apache.org/jira/browse/SPARK-20136 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 2.2.0 > > > We currently do not include explicitly metadata operation timing and number > of files in data source metrics. Those would be useful to include for > performance profiling. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20146) Column comment information is missing for Thrift Server's TableSchema
[ https://issues.apache.org/jira/browse/SPARK-20146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-20146. - Resolution: Fixed Assignee: Bo Meng Fix Version/s: 2.2.0 > Column comment information is missing for Thrift Server's TableSchema > - > > Key: SPARK-20146 > URL: https://issues.apache.org/jira/browse/SPARK-20146 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Bo Meng >Assignee: Bo Meng >Priority: Minor > Fix For: 2.2.0 > > > I found this issue while doing some tests against Thrift Server. > The column comments information were missing while querying the TableSchema. > Currently, all the comments were ignored. > I will post a fix shortly. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20104) Don't estimate IsNull or IsNotNull predicates for non-leaf node
[ https://issues.apache.org/jira/browse/SPARK-20104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhenhua Wang updated SPARK-20104: - Issue Type: Sub-task (was: Bug) Parent: SPARK-16026 > Don't estimate IsNull or IsNotNull predicates for non-leaf node > --- > > Key: SPARK-20104 > URL: https://issues.apache.org/jira/browse/SPARK-20104 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.2.0 >Reporter: Zhenhua Wang >Assignee: Zhenhua Wang > Fix For: 2.2.0 > > > In current stage, we don't have advanced statistics such as sketches or > histograms. As a result, some operator can't estimate `nullCount` accurately. > E.g. left outer join estimation does not accurately update `nullCount` > currently. So for IsNull and IsNotNull predicates, we only estimate them when > the child is a leaf node, whose `nullCount` is accurate. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18692) Test Java 8 unidoc build on Jenkins master builder
[ https://issues.apache.org/jira/browse/SPARK-18692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15948248#comment-15948248 ] Hyukjin Kwon commented on SPARK-18692: -- Thank you for asking this. Let me give a shot after testing/double-checking it. > Test Java 8 unidoc build on Jenkins master builder > -- > > Key: SPARK-18692 > URL: https://issues.apache.org/jira/browse/SPARK-18692 > Project: Spark > Issue Type: Test > Components: Build, Documentation >Reporter: Joseph K. Bradley > Labels: jenkins > > [SPARK-3359] fixed the unidoc build for Java 8, but it is easy to break. It > would be great to add this build to the Spark master builder on Jenkins to > make it easier to identify PRs which break doc builds. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15427) Spark SQL doesn't support field case sensitive when load data use Phoenix
[ https://issues.apache.org/jira/browse/SPARK-15427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-15427. -- Resolution: Not A Problem {{SELECT * FROM $table WHERE 1=0}} seems now changable via dialect in favour of SPARK-17614. I am resolving this. Please reopen this if I misunderstood. I am also resolving this as it seems the related code path has been changed radically to me. > Spark SQL doesn't support field case sensitive when load data use Phoenix > - > > Key: SPARK-15427 > URL: https://issues.apache.org/jira/browse/SPARK-15427 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 1.5.0 >Reporter: deng > Labels: easyfix, features, newbie > > I use sparkSql load data from Apache Phoenix. > SQLContext sqlContext = new SQLContext(sc); > Mapoptions = new HashMap(); > options.put("driver", driver); > options.put("url", PhoenixUtil.p.getProperty("phoenixURL")); > options.put("dbtable", "(select "value","name" from "user")"); > DataFrame jdbcDF = sqlContext.load("jdbc", options); > It always throws exception, like "can't find field VALUE". > I tracked the code and found spark will use: > val rs = conn.prepareStatement(s"SELECT * FROM $table WHERE > 1=0").executeQuery() > to get the field.But the field already be uppercased, like "value" to VALUE. > So it will always throws "can't find field VALUE"; > It didn't think of the the case when data loaded from source in which filed > is case sensitive. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20149) Audit PySpark code base for 2.6 specific work arounds
holdenk created SPARK-20149: --- Summary: Audit PySpark code base for 2.6 specific work arounds Key: SPARK-20149 URL: https://issues.apache.org/jira/browse/SPARK-20149 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 2.2.0 Reporter: holdenk We should determine what the areas in PySpark are that have specific 2.6 work arounds and create issues for them. The audit can be started during 2.2.0, but cleaning up all the 2.6 specific code is likely too much to try and get in so the actual fixing should probably be considered for 2.2.1 or 2.3 (unless 2.2.0 is delayed). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14657) RFormula output wrong features when formula w/o intercept
[ https://issues.apache.org/jira/browse/SPARK-14657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15948142#comment-15948142 ] Joseph K. Bradley commented on SPARK-14657: --- I'm going to remove the target version, but please retarget if we can reactivate this. > RFormula output wrong features when formula w/o intercept > - > > Key: SPARK-14657 > URL: https://issues.apache.org/jira/browse/SPARK-14657 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Yanbo Liang >Assignee: Yanbo Liang > > SparkR::glm output different features compared with R glm when fit w/o > intercept and having string/category features. Take the following example, > SparkR output three features compared with four features for native R. > SparkR::glm > {quote} > training <- suppressWarnings(createDataFrame(sqlContext, iris)) > model <- glm(Sepal_Width ~ Sepal_Length + Species - 1, data = training) > summary(model) > Coefficients: > Estimate Std. Error t value Pr(>|t|) > Sepal_Length0.67468 0.0093013 72.536 0 > Species_versicolor -1.2349 0.07269 -16.989 0 > Species_virginica -1.4708 0.077397-19.003 0 > {quote} > stats::glm > {quote} > summary(glm(Sepal.Width ~ Sepal.Length + Species - 1, data = iris)) > Coefficients: > Estimate Std. Error t value Pr(>|t|) > Sepal.Length0.3499 0.0463 7.557 4.19e-12 *** > Speciessetosa 1.6765 0.2354 7.123 4.46e-11 *** > Speciesversicolor 0.6931 0.2779 2.494 0.0137 * > Speciesvirginica0.6690 0.3078 2.174 0.0313 * > {quote} > The encoder for string/category feature is different. R did not drop any > category but SparkR drop the last one. > I searched online and test some other cases, found when we fit R glm model(or > other models powered by R formula) w/o intercept on a dataset including > string/category features, one of the categories in the first category feature > is being used as reference category, we will not drop any category for that > feature. > I think we should keep consistent semantics between Spark RFormula and R > formula. > cc [~mengxr] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14657) RFormula output wrong features when formula w/o intercept
[ https://issues.apache.org/jira/browse/SPARK-14657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-14657: -- Target Version/s: (was: 2.2.0) > RFormula output wrong features when formula w/o intercept > - > > Key: SPARK-14657 > URL: https://issues.apache.org/jira/browse/SPARK-14657 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Yanbo Liang >Assignee: Yanbo Liang > > SparkR::glm output different features compared with R glm when fit w/o > intercept and having string/category features. Take the following example, > SparkR output three features compared with four features for native R. > SparkR::glm > {quote} > training <- suppressWarnings(createDataFrame(sqlContext, iris)) > model <- glm(Sepal_Width ~ Sepal_Length + Species - 1, data = training) > summary(model) > Coefficients: > Estimate Std. Error t value Pr(>|t|) > Sepal_Length0.67468 0.0093013 72.536 0 > Species_versicolor -1.2349 0.07269 -16.989 0 > Species_virginica -1.4708 0.077397-19.003 0 > {quote} > stats::glm > {quote} > summary(glm(Sepal.Width ~ Sepal.Length + Species - 1, data = iris)) > Coefficients: > Estimate Std. Error t value Pr(>|t|) > Sepal.Length0.3499 0.0463 7.557 4.19e-12 *** > Speciessetosa 1.6765 0.2354 7.123 4.46e-11 *** > Speciesversicolor 0.6931 0.2779 2.494 0.0137 * > Speciesvirginica0.6690 0.3078 2.174 0.0313 * > {quote} > The encoder for string/category feature is different. R did not drop any > category but SparkR drop the last one. > I searched online and test some other cases, found when we fit R glm model(or > other models powered by R formula) w/o intercept on a dataset including > string/category features, one of the categories in the first category feature > is being used as reference category, we will not drop any category for that > feature. > I think we should keep consistent semantics between Spark RFormula and R > formula. > cc [~mengxr] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14657) RFormula output wrong features when formula w/o intercept
[ https://issues.apache.org/jira/browse/SPARK-14657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-14657: -- Shepherd: (was: Xiangrui Meng) > RFormula output wrong features when formula w/o intercept > - > > Key: SPARK-14657 > URL: https://issues.apache.org/jira/browse/SPARK-14657 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Yanbo Liang >Assignee: Yanbo Liang > > SparkR::glm output different features compared with R glm when fit w/o > intercept and having string/category features. Take the following example, > SparkR output three features compared with four features for native R. > SparkR::glm > {quote} > training <- suppressWarnings(createDataFrame(sqlContext, iris)) > model <- glm(Sepal_Width ~ Sepal_Length + Species - 1, data = training) > summary(model) > Coefficients: > Estimate Std. Error t value Pr(>|t|) > Sepal_Length0.67468 0.0093013 72.536 0 > Species_versicolor -1.2349 0.07269 -16.989 0 > Species_virginica -1.4708 0.077397-19.003 0 > {quote} > stats::glm > {quote} > summary(glm(Sepal.Width ~ Sepal.Length + Species - 1, data = iris)) > Coefficients: > Estimate Std. Error t value Pr(>|t|) > Sepal.Length0.3499 0.0463 7.557 4.19e-12 *** > Speciessetosa 1.6765 0.2354 7.123 4.46e-11 *** > Speciesversicolor 0.6931 0.2779 2.494 0.0137 * > Speciesvirginica0.6690 0.3078 2.174 0.0313 * > {quote} > The encoder for string/category feature is different. R did not drop any > category but SparkR drop the last one. > I searched online and test some other cases, found when we fit R glm model(or > other models powered by R formula) w/o intercept on a dataset including > string/category features, one of the categories in the first category feature > is being used as reference category, we will not drop any category for that > feature. > I think we should keep consistent semantics between Spark RFormula and R > formula. > cc [~mengxr] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20148) Extend the file commit interface to allow subscribing to task commit messages
[ https://issues.apache.org/jira/browse/SPARK-20148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15948085#comment-15948085 ] Apache Spark commented on SPARK-20148: -- User 'ericl' has created a pull request for this issue: https://github.com/apache/spark/pull/17475 > Extend the file commit interface to allow subscribing to task commit messages > - > > Key: SPARK-20148 > URL: https://issues.apache.org/jira/browse/SPARK-20148 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Eric Liang >Priority: Minor > > The internal FileCommitProtocol interface returns all task commit messages in > bulk to the implementation when a job finishes. However, it is sometimes > useful to access those messages before the job completes, so that the driver > gets incremental progress updates before the job finishes. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20148) Extend the file commit interface to allow subscribing to task commit messages
[ https://issues.apache.org/jira/browse/SPARK-20148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20148: Assignee: (was: Apache Spark) > Extend the file commit interface to allow subscribing to task commit messages > - > > Key: SPARK-20148 > URL: https://issues.apache.org/jira/browse/SPARK-20148 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Eric Liang >Priority: Minor > > The internal FileCommitProtocol interface returns all task commit messages in > bulk to the implementation when a job finishes. However, it is sometimes > useful to access those messages before the job completes, so that the driver > gets incremental progress updates before the job finishes. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20148) Extend the file commit interface to allow subscribing to task commit messages
[ https://issues.apache.org/jira/browse/SPARK-20148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20148: Assignee: Apache Spark > Extend the file commit interface to allow subscribing to task commit messages > - > > Key: SPARK-20148 > URL: https://issues.apache.org/jira/browse/SPARK-20148 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Eric Liang >Assignee: Apache Spark >Priority: Minor > > The internal FileCommitProtocol interface returns all task commit messages in > bulk to the implementation when a job finishes. However, it is sometimes > useful to access those messages before the job completes, so that the driver > gets incremental progress updates before the job finishes. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20148) Extend the file commit interface to allow subscribing to task commit messages
Eric Liang created SPARK-20148: -- Summary: Extend the file commit interface to allow subscribing to task commit messages Key: SPARK-20148 URL: https://issues.apache.org/jira/browse/SPARK-20148 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.2.0 Reporter: Eric Liang Priority: Minor The internal FileCommitProtocol interface returns all task commit messages in bulk to the implementation when a job finishes. However, it is sometimes useful to access those messages before the job completes, so that the driver gets incremental progress updates before the job finishes. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18958) SparkR should support toJSON on DataFrame
[ https://issues.apache.org/jira/browse/SPARK-18958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-18958: -- Fix Version/s: 2.2.0 > SparkR should support toJSON on DataFrame > - > > Key: SPARK-18958 > URL: https://issues.apache.org/jira/browse/SPARK-18958 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.1.0 >Reporter: Felix Cheung >Assignee: Felix Cheung >Priority: Minor > Fix For: 2.2.0 > > > It makes it easier to interop with other component (esp. since R does not > have json support built in) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3723) DecisionTree, RandomForest: Add more instrumentation
[ https://issues.apache.org/jira/browse/SPARK-3723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-3723: - Target Version/s: (was: 2.2.0) > DecisionTree, RandomForest: Add more instrumentation > > > Key: SPARK-3723 > URL: https://issues.apache.org/jira/browse/SPARK-3723 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Priority: Minor > > Some simple instrumentation would help advanced users understand performance, > and to check whether parameters (such as maxMemoryInMB) need to be tuned. > Most important instrumentation (simple): > * min, avg, max nodes per group > * number of groups (passes over data) > More advanced instrumentation: > * For each tree (or averaged over trees), training set accuracy after > training each level. This would be useful for visualizing learning behavior > (to convince oneself that model selection was being done correctly). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3723) DecisionTree, RandomForest: Add more instrumentation
[ https://issues.apache.org/jira/browse/SPARK-3723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-3723: - Component/s: (was: MLlib) ML > DecisionTree, RandomForest: Add more instrumentation > > > Key: SPARK-3723 > URL: https://issues.apache.org/jira/browse/SPARK-3723 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Priority: Minor > > Some simple instrumentation would help advanced users understand performance, > and to check whether parameters (such as maxMemoryInMB) need to be tuned. > Most important instrumentation (simple): > * min, avg, max nodes per group > * number of groups (passes over data) > More advanced instrumentation: > * For each tree (or averaged over trees), training set accuracy after > training each level. This would be useful for visualizing learning behavior > (to convince oneself that model selection was being done correctly). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3723) DecisionTree, RandomForest: Add more instrumentation
[ https://issues.apache.org/jira/browse/SPARK-3723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-3723: - Shepherd: Joseph K. Bradley > DecisionTree, RandomForest: Add more instrumentation > > > Key: SPARK-3723 > URL: https://issues.apache.org/jira/browse/SPARK-3723 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Priority: Minor > > Some simple instrumentation would help advanced users understand performance, > and to check whether parameters (such as maxMemoryInMB) need to be tuned. > Most important instrumentation (simple): > * min, avg, max nodes per group > * number of groups (passes over data) > More advanced instrumentation: > * For each tree (or averaged over trees), training set accuracy after > training each level. This would be useful for visualizing learning behavior > (to convince oneself that model selection was being done correctly). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18570) Consider supporting other R formula operators
[ https://issues.apache.org/jira/browse/SPARK-18570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15948056#comment-15948056 ] Joseph K. Bradley commented on SPARK-18570: --- Is this still targeted for 2.2, or shall we retarget it? > Consider supporting other R formula operators > - > > Key: SPARK-18570 > URL: https://issues.apache.org/jira/browse/SPARK-18570 > Project: Spark > Issue Type: Sub-task > Components: ML, SparkR >Reporter: Felix Cheung >Priority: Minor > > Such as > {code} > ∗ > X∗Y include these variables and the interactions between them > ^ > (X + Z + W)^3 include these variables and all interactions up to three way > | > X | Z conditioning: include x given z > {code} > Other includes, %in%, ` (backtick) > https://stat.ethz.ch/R-manual/R-devel/library/stats/html/formula.html -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3181) Add Robust Regression Algorithm with Huber Estimator
[ https://issues.apache.org/jira/browse/SPARK-3181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15948055#comment-15948055 ] Joseph K. Bradley commented on SPARK-3181: -- Is this still active, and should it be targeted at 2.2? > Add Robust Regression Algorithm with Huber Estimator > > > Key: SPARK-3181 > URL: https://issues.apache.org/jira/browse/SPARK-3181 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Fan Jiang >Assignee: Yanbo Liang > Labels: features > Original Estimate: 0h > Remaining Estimate: 0h > > Linear least square estimates assume the error has normal distribution and > can behave badly when the errors are heavy-tailed. In practical we get > various types of data. We need to include Robust Regression to employ a > fitting criterion that is not as vulnerable as least square. > In 1973, Huber introduced M-estimation for regression which stands for > "maximum likelihood type". The method is resistant to outliers in the > response variable and has been widely used. > The new feature for MLlib will contain 3 new files > /main/scala/org/apache/spark/mllib/regression/RobustRegression.scala > /test/scala/org/apache/spark/mllib/regression/RobustRegressionSuite.scala > /main/scala/org/apache/spark/examples/mllib/HuberRobustRegression.scala > and one new class HuberRobustGradient in > /main/scala/org/apache/spark/mllib/optimization/Gradient.scala -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14659) OneHotEncoder support drop first category alphabetically in the encoded vector
[ https://issues.apache.org/jira/browse/SPARK-14659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-14659: -- Target Version/s: (was: 2.2.0) > OneHotEncoder support drop first category alphabetically in the encoded > vector > --- > > Key: SPARK-14659 > URL: https://issues.apache.org/jira/browse/SPARK-14659 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Yanbo Liang > > R formula drop the first category alphabetically when encode string/category > feature. Spark RFormula use OneHotEncoder to encode string/category feature > into vector, but only supporting "dropLast" by string/category frequencies. > This will cause SparkR produce different models compared with native R. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14659) OneHotEncoder support drop first category alphabetically in the encoded vector
[ https://issues.apache.org/jira/browse/SPARK-14659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15948049#comment-15948049 ] Joseph K. Bradley commented on SPARK-14659: --- [~actuaryzhang] I'm sorry I haven't had time to check on this; there have just been too many other things. I'll remove the target version until someone can shepherd it. > OneHotEncoder support drop first category alphabetically in the encoded > vector > --- > > Key: SPARK-14659 > URL: https://issues.apache.org/jira/browse/SPARK-14659 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Yanbo Liang > > R formula drop the first category alphabetically when encode string/category > feature. Spark RFormula use OneHotEncoder to encode string/category feature > into vector, but only supporting "dropLast" by string/category frequencies. > This will cause SparkR produce different models compared with native R. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18822) Support ML Pipeline in SparkR
[ https://issues.apache.org/jira/browse/SPARK-18822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-18822: -- Target Version/s: (was: 2.2.0) > Support ML Pipeline in SparkR > - > > Key: SPARK-18822 > URL: https://issues.apache.org/jira/browse/SPARK-18822 > Project: Spark > Issue Type: New Feature > Components: ML, SparkR >Reporter: Felix Cheung > > From Joseph Bradley: > " > Supporting Pipelines and advanced use cases: There really needs to be more > design discussion around SparkR. Felix Cheung would you be interested in > leading some discussion? I'm envisioning something similar to what was done a > while back for Pipelines in Scala/Java/Python, where we consider several use > cases of MLlib: fitting a single model, creating and tuning a complex > Pipeline, and working with multiple languages. That should help inform what > APIs should look like in Spark R. > " > Certain ML model, such as OneVsRest, is harder to represent in a single call > R API. Having advanced API or Pipeline API like this could help to expose > that to our users. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18822) Support ML Pipeline in SparkR
[ https://issues.apache.org/jira/browse/SPARK-18822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15948045#comment-15948045 ] Joseph K. Bradley commented on SPARK-18822: --- Since 2.2 will be cut soon (I presume), I'm going to untarget this. Felix, please retarget if you like. > Support ML Pipeline in SparkR > - > > Key: SPARK-18822 > URL: https://issues.apache.org/jira/browse/SPARK-18822 > Project: Spark > Issue Type: New Feature > Components: ML, SparkR >Reporter: Felix Cheung > > From Joseph Bradley: > " > Supporting Pipelines and advanced use cases: There really needs to be more > design discussion around SparkR. Felix Cheung would you be interested in > leading some discussion? I'm envisioning something similar to what was done a > while back for Pipelines in Scala/Java/Python, where we consider several use > cases of MLlib: fitting a single model, creating and tuning a complex > Pipeline, and working with multiple languages. That should help inform what > APIs should look like in Spark R. > " > Certain ML model, such as OneVsRest, is harder to represent in a single call > R API. Having advanced API or Pipeline API like this could help to expose > that to our users. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20103) Spark structured steaming from kafka - last message processed again after resume from checkpoint
[ https://issues.apache.org/jira/browse/SPARK-20103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-20103: - Fix Version/s: 2.2.0 > Spark structured steaming from kafka - last message processed again after > resume from checkpoint > > > Key: SPARK-20103 > URL: https://issues.apache.org/jira/browse/SPARK-20103 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0 > Environment: Linux, Spark 2.10 >Reporter: Rajesh Mutha > Labels: spark, streaming > Fix For: 2.2.0 > > > When the application starts after a failure or a graceful shutdown, it is > consistently processing the last message of the previous batch even though it > was already processed correctly without failure. > We are making sure database writes are idempotent using postgres 9.6 feature. > Is this the default behavior of spark? I added a code snippet with 2 > streaming queries. One of the query is idempotent; since query2 is not > idempotent, we are seeing duplicate entries in table. > {code} > object StructuredStreaming { > def main(args: Array[String]): Unit = { > val db_url = > "jdbc:postgresql://dynamic-milestone-dev.crv1otzbekk9.us-east-1.rds.amazonaws.com:5432/DYNAMICPOSTGRES?user=dsdbadmin=password" > val spark = SparkSession > .builder > .appName("StructuredKafkaReader") > .master("local[*]") > .getOrCreate() > spark.conf.set("spark.sql.streaming.checkpointLocation", > "/tmp/checkpoint_research/") > import spark.implicits._ > val server = "10.205.82.113:9092" > val topic = "checkpoint" > val subscribeType="subscribe" > val lines = spark > .readStream > .format("kafka") > .option("kafka.bootstrap.servers", server) > .option(subscribeType, topic) > .load().selectExpr("CAST(value AS STRING)").as[String] > lines.printSchema() > import org.apache.spark.sql.ForeachWriter > val writer = new ForeachWriter[String] { >def open(partitionId: Long, version: Long): Boolean = { > println("After db props"); true >} >def process(value: String) = { > val conn = DriverManager.getConnection(db_url) > try{ >conn.createStatement().executeUpdate("INSERT INTO > PUBLIC.checkpoint1 VALUES ('"+value+"')") > } > finally { >conn.close() > } > } >def close(errorOrNull: Throwable) = {} > } > import scala.concurrent.duration._ > val query1 = lines.writeStream > .outputMode("append") > .queryName("checkpoint1") > .trigger(ProcessingTime(30.seconds)) > .foreach(writer) > .start() > val writer2 = new ForeachWriter[String] { > def open(partitionId: Long, version: Long): Boolean = { > println("After db props"); true > } > def process(value: String) = { > val conn = DriverManager.getConnection(db_url) > try{ > conn.createStatement().executeUpdate("INSERT INTO > PUBLIC.checkpoint2 VALUES ('"+value+"')") > } > finally { > conn.close() > } >} > def close(errorOrNull: Throwable) = {} > } > import scala.concurrent.duration._ > val query2 = lines.writeStream > .outputMode("append") > .queryName("checkpoint2") > .trigger(ProcessingTime(30.seconds)) > .foreach(writer2) > .start() > query2.awaitTermination() > query1.awaitTermination() > }} > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20103) Spark structured steaming from kafka - last message processed again after resume from checkpoint
[ https://issues.apache.org/jira/browse/SPARK-20103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15948035#comment-15948035 ] Michael Armbrust commented on SPARK-20103: -- It is fixed in 2.2 but by [SPARK-19876]. > Spark structured steaming from kafka - last message processed again after > resume from checkpoint > > > Key: SPARK-20103 > URL: https://issues.apache.org/jira/browse/SPARK-20103 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0 > Environment: Linux, Spark 2.10 >Reporter: Rajesh Mutha > Labels: spark, streaming > Fix For: 2.2.0 > > > When the application starts after a failure or a graceful shutdown, it is > consistently processing the last message of the previous batch even though it > was already processed correctly without failure. > We are making sure database writes are idempotent using postgres 9.6 feature. > Is this the default behavior of spark? I added a code snippet with 2 > streaming queries. One of the query is idempotent; since query2 is not > idempotent, we are seeing duplicate entries in table. > {code} > object StructuredStreaming { > def main(args: Array[String]): Unit = { > val db_url = > "jdbc:postgresql://dynamic-milestone-dev.crv1otzbekk9.us-east-1.rds.amazonaws.com:5432/DYNAMICPOSTGRES?user=dsdbadmin=password" > val spark = SparkSession > .builder > .appName("StructuredKafkaReader") > .master("local[*]") > .getOrCreate() > spark.conf.set("spark.sql.streaming.checkpointLocation", > "/tmp/checkpoint_research/") > import spark.implicits._ > val server = "10.205.82.113:9092" > val topic = "checkpoint" > val subscribeType="subscribe" > val lines = spark > .readStream > .format("kafka") > .option("kafka.bootstrap.servers", server) > .option(subscribeType, topic) > .load().selectExpr("CAST(value AS STRING)").as[String] > lines.printSchema() > import org.apache.spark.sql.ForeachWriter > val writer = new ForeachWriter[String] { >def open(partitionId: Long, version: Long): Boolean = { > println("After db props"); true >} >def process(value: String) = { > val conn = DriverManager.getConnection(db_url) > try{ >conn.createStatement().executeUpdate("INSERT INTO > PUBLIC.checkpoint1 VALUES ('"+value+"')") > } > finally { >conn.close() > } > } >def close(errorOrNull: Throwable) = {} > } > import scala.concurrent.duration._ > val query1 = lines.writeStream > .outputMode("append") > .queryName("checkpoint1") > .trigger(ProcessingTime(30.seconds)) > .foreach(writer) > .start() > val writer2 = new ForeachWriter[String] { > def open(partitionId: Long, version: Long): Boolean = { > println("After db props"); true > } > def process(value: String) = { > val conn = DriverManager.getConnection(db_url) > try{ > conn.createStatement().executeUpdate("INSERT INTO > PUBLIC.checkpoint2 VALUES ('"+value+"')") > } > finally { > conn.close() > } >} > def close(errorOrNull: Throwable) = {} > } > import scala.concurrent.duration._ > val query2 = lines.writeStream > .outputMode("append") > .queryName("checkpoint2") > .trigger(ProcessingTime(30.seconds)) > .foreach(writer2) > .start() > query2.awaitTermination() > query1.awaitTermination() > }} > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20103) Spark structured steaming from kafka - last message processed again after resume from checkpoint
[ https://issues.apache.org/jira/browse/SPARK-20103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-20103: - Description: When the application starts after a failure or a graceful shutdown, it is consistently processing the last message of the previous batch even though it was already processed correctly without failure. We are making sure database writes are idempotent using postgres 9.6 feature. Is this the default behavior of spark? I added a code snippet with 2 streaming queries. One of the query is idempotent; since query2 is not idempotent, we are seeing duplicate entries in table. {code} object StructuredStreaming { def main(args: Array[String]): Unit = { val db_url = "jdbc:postgresql://dynamic-milestone-dev.crv1otzbekk9.us-east-1.rds.amazonaws.com:5432/DYNAMICPOSTGRES?user=dsdbadmin=password" val spark = SparkSession .builder .appName("StructuredKafkaReader") .master("local[*]") .getOrCreate() spark.conf.set("spark.sql.streaming.checkpointLocation", "/tmp/checkpoint_research/") import spark.implicits._ val server = "10.205.82.113:9092" val topic = "checkpoint" val subscribeType="subscribe" val lines = spark .readStream .format("kafka") .option("kafka.bootstrap.servers", server) .option(subscribeType, topic) .load().selectExpr("CAST(value AS STRING)").as[String] lines.printSchema() import org.apache.spark.sql.ForeachWriter val writer = new ForeachWriter[String] { def open(partitionId: Long, version: Long): Boolean = { println("After db props"); true } def process(value: String) = { val conn = DriverManager.getConnection(db_url) try{ conn.createStatement().executeUpdate("INSERT INTO PUBLIC.checkpoint1 VALUES ('"+value+"')") } finally { conn.close() } } def close(errorOrNull: Throwable) = {} } import scala.concurrent.duration._ val query1 = lines.writeStream .outputMode("append") .queryName("checkpoint1") .trigger(ProcessingTime(30.seconds)) .foreach(writer) .start() val writer2 = new ForeachWriter[String] { def open(partitionId: Long, version: Long): Boolean = { println("After db props"); true } def process(value: String) = { val conn = DriverManager.getConnection(db_url) try{ conn.createStatement().executeUpdate("INSERT INTO PUBLIC.checkpoint2 VALUES ('"+value+"')") } finally { conn.close() } } def close(errorOrNull: Throwable) = {} } import scala.concurrent.duration._ val query2 = lines.writeStream .outputMode("append") .queryName("checkpoint2") .trigger(ProcessingTime(30.seconds)) .foreach(writer2) .start() query2.awaitTermination() query1.awaitTermination() }} {code} was: When the application starts after a failure or a graceful shutdown, it is consistently processing the last message of the previous batch even though it was already processed correctly without failure. We are making sure database writes are idempotent using postgres 9.6 feature. Is this the default behavior of spark? I added a code snippet with 2 streaming queries. One of the query is idempotent; since query2 is not idempotent, we are seeing duplicate entries in table. --- object StructuredStreaming { def main(args: Array[String]): Unit = { val db_url = "jdbc:postgresql://dynamic-milestone-dev.crv1otzbekk9.us-east-1.rds.amazonaws.com:5432/DYNAMICPOSTGRES?user=dsdbadmin=password" val spark = SparkSession .builder .appName("StructuredKafkaReader") .master("local[*]") .getOrCreate() spark.conf.set("spark.sql.streaming.checkpointLocation", "/tmp/checkpoint_research/") import spark.implicits._ val server = "10.205.82.113:9092" val topic = "checkpoint" val subscribeType="subscribe" val lines = spark .readStream .format("kafka") .option("kafka.bootstrap.servers", server) .option(subscribeType, topic) .load().selectExpr("CAST(value AS STRING)").as[String] lines.printSchema() import org.apache.spark.sql.ForeachWriter val writer = new ForeachWriter[String] { def open(partitionId: Long, version: Long): Boolean = { println("After db props"); true } def process(value: String) = { val conn = DriverManager.getConnection(db_url) try{ conn.createStatement().executeUpdate("INSERT INTO PUBLIC.checkpoint1 VALUES ('"+value+"')") } finally { conn.close() } } def close(errorOrNull: Throwable) = {} } import scala.concurrent.duration._ val query1 =
[jira] [Updated] (SPARK-20103) Spark structured steaming from kafka - last message processed again after resume from checkpoint
[ https://issues.apache.org/jira/browse/SPARK-20103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-20103: - Docs Text: (was: object StructuredStreaming { def main(args: Array[String]): Unit = { val db_url = "jdbc:postgresql://dynamic-milestone-dev.crv1otzbekk9.us-east-1.rds.amazonaws.com:5432/DYNAMICPOSTGRES?user=dsdbadmin=password" val spark = SparkSession .builder .appName("StructuredKafkaReader") .master("local[*]") .getOrCreate() spark.conf.set("spark.sql.streaming.checkpointLocation", "/tmp/checkpoint_research/") import spark.implicits._ val server = "10.205.82.113:9092" val topic = "checkpoint" val subscribeType="subscribe" val lines = spark .readStream .format("kafka") .option("kafka.bootstrap.servers", server) .option(subscribeType, topic) .load().selectExpr("CAST(value AS STRING)").as[String] lines.printSchema() import org.apache.spark.sql.ForeachWriter val writer = new ForeachWriter[String] { def open(partitionId: Long, version: Long): Boolean = { println("After db props"); true } def process(value: String) = { val conn = DriverManager.getConnection(db_url) try{ conn.createStatement().executeUpdate("INSERT INTO PUBLIC.checkpoint1 VALUES ('"+value+"')") } finally { conn.close() } } def close(errorOrNull: Throwable) = {} } import scala.concurrent.duration._ val query1 = lines.writeStream .outputMode("append") .queryName("checkpoint1") .trigger(ProcessingTime(30.seconds)) .foreach(writer) .start() val writer2 = new ForeachWriter[String] { def open(partitionId: Long, version: Long): Boolean = { println("After db props"); true } def process(value: String) = { val conn = DriverManager.getConnection(db_url) try{ conn.createStatement().executeUpdate("INSERT INTO PUBLIC.checkpoint2 VALUES ('"+value+"')") } finally { conn.close() } } def close(errorOrNull: Throwable) = {} } import scala.concurrent.duration._ val query2 = lines.writeStream .outputMode("append") .queryName("checkpoint2") .trigger(ProcessingTime(30.seconds)) .foreach(writer2) .start() query2.awaitTermination() query1.awaitTermination() }}) > Spark structured steaming from kafka - last message processed again after > resume from checkpoint > > > Key: SPARK-20103 > URL: https://issues.apache.org/jira/browse/SPARK-20103 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0 > Environment: Linux, Spark 2.10 >Reporter: Rajesh Mutha > Labels: spark, streaming > > When the application starts after a failure or a graceful shutdown, it is > consistently processing the last message of the previous batch even though it > was already processed correctly without failure. > We are making sure database writes are idempotent using postgres 9.6 feature. > Is this the default behavior of spark? I added a code snippet with 2 > streaming queries. One of the query is idempotent; since query2 is not > idempotent, we are seeing duplicate entries in table. > {code} > object StructuredStreaming { > def main(args: Array[String]): Unit = { > val db_url = > "jdbc:postgresql://dynamic-milestone-dev.crv1otzbekk9.us-east-1.rds.amazonaws.com:5432/DYNAMICPOSTGRES?user=dsdbadmin=password" > val spark = SparkSession > .builder > .appName("StructuredKafkaReader") > .master("local[*]") > .getOrCreate() > spark.conf.set("spark.sql.streaming.checkpointLocation", > "/tmp/checkpoint_research/") > import spark.implicits._ > val server = "10.205.82.113:9092" > val topic = "checkpoint" > val subscribeType="subscribe" > val lines = spark > .readStream > .format("kafka") > .option("kafka.bootstrap.servers", server) > .option(subscribeType, topic) > .load().selectExpr("CAST(value AS STRING)").as[String] > lines.printSchema() > import org.apache.spark.sql.ForeachWriter > val writer = new ForeachWriter[String] { >def open(partitionId: Long, version: Long): Boolean = { > println("After db props"); true >} >def process(value: String) = { > val conn = DriverManager.getConnection(db_url) > try{ >conn.createStatement().executeUpdate("INSERT INTO > PUBLIC.checkpoint1 VALUES ('"+value+"')") > } > finally { >
[jira] [Resolved] (SPARK-20120) spark-sql CLI support silent mode
[ https://issues.apache.org/jira/browse/SPARK-20120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-20120. - Resolution: Fixed Assignee: Yuming Wang Fix Version/s: 2.2.0 > spark-sql CLI support silent mode > - > > Key: SPARK-20120 > URL: https://issues.apache.org/jira/browse/SPARK-20120 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Yuming Wang >Assignee: Yuming Wang > Fix For: 2.2.0 > > > It is similar to Hive silent mode, just show the query result. see: > https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Cli -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20147) Cloning SessionState does not clone streaming query listeners
[ https://issues.apache.org/jira/browse/SPARK-20147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kunal Khamar updated SPARK-20147: - Description: Cloning session should clone StreamingQueryListeners registered on the StreamingQueryListenerBus. Similar to SPARK-20048, https://github.com/apache/spark/pull/17379 was:Cloning session should clone StreamingQueryListeners registered on the StreamingQueryListenerBus. > Cloning SessionState does not clone streaming query listeners > - > > Key: SPARK-20147 > URL: https://issues.apache.org/jira/browse/SPARK-20147 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Kunal Khamar > > Cloning session should clone StreamingQueryListeners registered on the > StreamingQueryListenerBus. > Similar to SPARK-20048, https://github.com/apache/spark/pull/17379 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20147) Cloning SessionState does not clone streaming query listeners
[ https://issues.apache.org/jira/browse/SPARK-20147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kunal Khamar updated SPARK-20147: - Description: Cloning session should clone StreamingQueryListeners registered on the StreamingQueryListenerBus. > Cloning SessionState does not clone streaming query listeners > - > > Key: SPARK-20147 > URL: https://issues.apache.org/jira/browse/SPARK-20147 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Kunal Khamar > > Cloning session should clone StreamingQueryListeners registered on the > StreamingQueryListenerBus. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20147) Cloning SessionState does not clone streaming query listeners
Kunal Khamar created SPARK-20147: Summary: Cloning SessionState does not clone streaming query listeners Key: SPARK-20147 URL: https://issues.apache.org/jira/browse/SPARK-20147 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 2.1.0 Reporter: Kunal Khamar -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19088) Optimize sequence type deserialization codegen
[ https://issues.apache.org/jira/browse/SPARK-19088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15947972#comment-15947972 ] Apache Spark commented on SPARK-19088: -- User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/17473 > Optimize sequence type deserialization codegen > -- > > Key: SPARK-19088 > URL: https://issues.apache.org/jira/browse/SPARK-19088 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Michal Šenkýř >Assignee: Michal Šenkýř >Priority: Minor > Labels: performance > Fix For: 2.2.0 > > > Sequence type deserialization codegen added in [PR > #16240|https://github.com/apache/spark/pull/16240] should use a proper > builder instead of a conversion (using {{to}}) to avoid an additional pass. > This will require an additional {{MapObjects}}-like operation that will use > the provided builder instead of building an array. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20144) spark.read.parquet no long maintains ordering of the data
[ https://issues.apache.org/jira/browse/SPARK-20144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Jin updated SPARK-20144: --- Summary: spark.read.parquet no long maintains ordering of the data (was: spark.read.parquet no long maintains the ordering the the data) > spark.read.parquet no long maintains ordering of the data > - > > Key: SPARK-20144 > URL: https://issues.apache.org/jira/browse/SPARK-20144 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: Li Jin > > Hi, We are trying to upgrade Spark from 1.6.3 to 2.0.2. One issue we found is > when we read parquet files in 2.0.2, the ordering of rows in the resulting > dataframe is not the same as the ordering of rows in the dataframe that the > parquet file was reproduced with. > This is because FileSourceStrategy.scala combines the parquet files into > fewer partitions and also reordered them. This breaks our workflows because > they assume the ordering of the data. > Is this considered a bug? Also FileSourceStrategy and FileSourceScanExec > changed quite a bit from 2.0.2 to 2.1, so not sure if this is an issue with > 2.1. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20145) "SELECT * FROM range(1)" works, but "SELECT * FROM RANGE(1)" doesn't
[ https://issues.apache.org/jira/browse/SPARK-20145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15947781#comment-15947781 ] sam elamin commented on SPARK-20145: if no one is picking this up, id love to take it > "SELECT * FROM range(1)" works, but "SELECT * FROM RANGE(1)" doesn't > > > Key: SPARK-20145 > URL: https://issues.apache.org/jira/browse/SPARK-20145 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Juliusz Sompolski > > Executed at clean tip of the master branch, with all default settings: > scala> spark.sql("SELECT * FROM range(1)") > res1: org.apache.spark.sql.DataFrame = [id: bigint] > scala> spark.sql("SELECT * FROM RANGE(1)") > org.apache.spark.sql.AnalysisException: could not resolve `RANGE` to a > table-valued function; line 1 pos 14 > at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > at > org.apache.spark.sql.catalyst.analysis.ResolveTableValuedFunctions$$anonfun$apply$1.applyOrElse(ResolveTableValuedFunctions.scala:126) > at > org.apache.spark.sql.catalyst.analysis.ResolveTableValuedFunctions$$anonfun$apply$1.applyOrElse(ResolveTableValuedFunctions.scala:106) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:62) > ... > I believe it should be case insensitive? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20009) Use user-friendly DDL formats for defining a schema in user-facing APIs
[ https://issues.apache.org/jira/browse/SPARK-20009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-20009. - Resolution: Fixed Assignee: Takeshi Yamamuro Fix Version/s: 2.2.0 > Use user-friendly DDL formats for defining a schema in user-facing APIs > > > Key: SPARK-20009 > URL: https://issues.apache.org/jira/browse/SPARK-20009 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro > Fix For: 2.2.0 > > > In https://issues.apache.org/jira/browse/SPARK-19830, we add a new API in the > DDL parser to convert a DDL string into a schema. Then, we can use DDL > formats in existing some APIs, e.g., functions.from_json > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L3062. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20009) Use user-friendly DDL formats for defining a schema in functions.from_json
[ https://issues.apache.org/jira/browse/SPARK-20009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-20009: Summary: Use user-friendly DDL formats for defining a schema in functions.from_json (was: Use user-friendly DDL formats for defining a schema in user-facing APIs) > Use user-friendly DDL formats for defining a schema in functions.from_json > --- > > Key: SPARK-20009 > URL: https://issues.apache.org/jira/browse/SPARK-20009 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro > Fix For: 2.2.0 > > > In https://issues.apache.org/jira/browse/SPARK-19830, we add a new API in the > DDL parser to convert a DDL string into a schema. Then, we can use DDL > formats in existing some APIs, e.g., functions.from_json > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L3062. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20048) Cloning SessionState does not clone query execution listeners
[ https://issues.apache.org/jira/browse/SPARK-20048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell resolved SPARK-20048. --- Resolution: Fixed Assignee: Kunal Khamar Fix Version/s: 2.2.0 > Cloning SessionState does not clone query execution listeners > - > > Key: SPARK-20048 > URL: https://issues.apache.org/jira/browse/SPARK-20048 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Kunal Khamar >Assignee: Kunal Khamar > Fix For: 2.2.0 > > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19999) Test failures in Spark Core due to java.nio.Bits.unaligned()
[ https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15947761#comment-15947761 ] sam elamin commented on SPARK-1: can someone assign this to me, happy to take it over > Test failures in Spark Core due to java.nio.Bits.unaligned() > > > Key: SPARK-1 > URL: https://issues.apache.org/jira/browse/SPARK-1 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0, 2.2.0 > Environment: Ubuntu 14.04 ppc64le > $ java -version > openjdk version "1.8.0_111" > OpenJDK Runtime Environment (build 1.8.0_111-8u111-b14-3~14.04.1-b14) > OpenJDK 64-Bit Server VM (build 25.111-b14, mixed mode) >Reporter: Sonia Garudi > Labels: ppc64le > Attachments: Core.patch > > > There are multiple test failures seen in Spark Core project with the > following error message : > {code:borderStyle=solid} > java.lang.IllegalArgumentException: requirement failed: No support for > unaligned Unsafe. Set spark.memory.offHeap.enabled to false. > {code} > These errors occur due to java.nio.Bits.unaligned(), which does not return > true for the ppc64le arch. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19999) Test failures in Spark Core due to java.nio.Bits.unaligned()
[ https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15947755#comment-15947755 ] Apache Spark commented on SPARK-1: -- User 'samelamin' has created a pull request for this issue: https://github.com/apache/spark/pull/17472 > Test failures in Spark Core due to java.nio.Bits.unaligned() > > > Key: SPARK-1 > URL: https://issues.apache.org/jira/browse/SPARK-1 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0, 2.2.0 > Environment: Ubuntu 14.04 ppc64le > $ java -version > openjdk version "1.8.0_111" > OpenJDK Runtime Environment (build 1.8.0_111-8u111-b14-3~14.04.1-b14) > OpenJDK 64-Bit Server VM (build 25.111-b14, mixed mode) >Reporter: Sonia Garudi > Labels: ppc64le > Attachments: Core.patch > > > There are multiple test failures seen in Spark Core project with the > following error message : > {code:borderStyle=solid} > java.lang.IllegalArgumentException: requirement failed: No support for > unaligned Unsafe. Set spark.memory.offHeap.enabled to false. > {code} > These errors occur due to java.nio.Bits.unaligned(), which does not return > true for the ppc64le arch. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19999) Test failures in Spark Core due to java.nio.Bits.unaligned()
[ https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-1: Assignee: Apache Spark > Test failures in Spark Core due to java.nio.Bits.unaligned() > > > Key: SPARK-1 > URL: https://issues.apache.org/jira/browse/SPARK-1 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0, 2.2.0 > Environment: Ubuntu 14.04 ppc64le > $ java -version > openjdk version "1.8.0_111" > OpenJDK Runtime Environment (build 1.8.0_111-8u111-b14-3~14.04.1-b14) > OpenJDK 64-Bit Server VM (build 25.111-b14, mixed mode) >Reporter: Sonia Garudi >Assignee: Apache Spark > Labels: ppc64le > Attachments: Core.patch > > > There are multiple test failures seen in Spark Core project with the > following error message : > {code:borderStyle=solid} > java.lang.IllegalArgumentException: requirement failed: No support for > unaligned Unsafe. Set spark.memory.offHeap.enabled to false. > {code} > These errors occur due to java.nio.Bits.unaligned(), which does not return > true for the ppc64le arch. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19999) Test failures in Spark Core due to java.nio.Bits.unaligned()
[ https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-1: Assignee: (was: Apache Spark) > Test failures in Spark Core due to java.nio.Bits.unaligned() > > > Key: SPARK-1 > URL: https://issues.apache.org/jira/browse/SPARK-1 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0, 2.2.0 > Environment: Ubuntu 14.04 ppc64le > $ java -version > openjdk version "1.8.0_111" > OpenJDK Runtime Environment (build 1.8.0_111-8u111-b14-3~14.04.1-b14) > OpenJDK 64-Bit Server VM (build 25.111-b14, mixed mode) >Reporter: Sonia Garudi > Labels: ppc64le > Attachments: Core.patch > > > There are multiple test failures seen in Spark Core project with the > following error message : > {code:borderStyle=solid} > java.lang.IllegalArgumentException: requirement failed: No support for > unaligned Unsafe. Set spark.memory.offHeap.enabled to false. > {code} > These errors occur due to java.nio.Bits.unaligned(), which does not return > true for the ppc64le arch. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16938) Cannot resolve column name after a join
[ https://issues.apache.org/jira/browse/SPARK-16938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15947732#comment-15947732 ] sam elamin edited comment on SPARK-16938 at 3/29/17 7:20 PM: - [~cloud_fan] could you please check my comment on the github pr? I am happy picking up this ticket, can someone assign it to me please. was (Author: samelamin): [~cloud_fan] I am happy picking up this ticket, can someone assign it to me please > Cannot resolve column name after a join > --- > > Key: SPARK-16938 > URL: https://issues.apache.org/jira/browse/SPARK-16938 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Mathieu D >Priority: Minor > > Found a change of behavior on spark-2.0.0, which breaks a query in our code > base. > The following works on previous spark versions, 1.6.1 up to 2.0.0-preview : > {code} > val dfa = Seq((1, 2), (2, 3)).toDF("id", "a").alias("dfa") > val dfb = Seq((1, 0), (1, 1)).toDF("id", "b").alias("dfb") > dfa.join(dfb, dfa("id") === dfb("id")).dropDuplicates(Array("dfa.id", > "dfb.id")) > {code} > but fails with spark-2.0.0 with the exception : > {code} > Cannot resolve column name "dfa.id" among (id, a, id, b); > org.apache.spark.sql.AnalysisException: Cannot resolve column name "dfa.id" > among (id, a, id, b); > at > org.apache.spark.sql.Dataset$$anonfun$dropDuplicates$1$$anonfun$36$$anonfun$apply$12.apply(Dataset.scala:1819) > at > org.apache.spark.sql.Dataset$$anonfun$dropDuplicates$1$$anonfun$36$$anonfun$apply$12.apply(Dataset.scala:1819) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.Dataset$$anonfun$dropDuplicates$1$$anonfun$36.apply(Dataset.scala:1818) > at > org.apache.spark.sql.Dataset$$anonfun$dropDuplicates$1$$anonfun$36.apply(Dataset.scala:1817) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:245) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.Dataset$$anonfun$dropDuplicates$1.apply(Dataset.scala:1817) > at > org.apache.spark.sql.Dataset$$anonfun$dropDuplicates$1.apply(Dataset.scala:1814) > at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:2594) > at org.apache.spark.sql.Dataset.dropDuplicates(Dataset.scala:1814) > at org.apache.spark.sql.Dataset.dropDuplicates(Dataset.scala:1840) > ... > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16938) Cannot resolve column name after a join
[ https://issues.apache.org/jira/browse/SPARK-16938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15947732#comment-15947732 ] sam elamin commented on SPARK-16938: [~cloud_fan] I am happy picking up this ticket, can someone assign it to me please > Cannot resolve column name after a join > --- > > Key: SPARK-16938 > URL: https://issues.apache.org/jira/browse/SPARK-16938 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Mathieu D >Priority: Minor > > Found a change of behavior on spark-2.0.0, which breaks a query in our code > base. > The following works on previous spark versions, 1.6.1 up to 2.0.0-preview : > {code} > val dfa = Seq((1, 2), (2, 3)).toDF("id", "a").alias("dfa") > val dfb = Seq((1, 0), (1, 1)).toDF("id", "b").alias("dfb") > dfa.join(dfb, dfa("id") === dfb("id")).dropDuplicates(Array("dfa.id", > "dfb.id")) > {code} > but fails with spark-2.0.0 with the exception : > {code} > Cannot resolve column name "dfa.id" among (id, a, id, b); > org.apache.spark.sql.AnalysisException: Cannot resolve column name "dfa.id" > among (id, a, id, b); > at > org.apache.spark.sql.Dataset$$anonfun$dropDuplicates$1$$anonfun$36$$anonfun$apply$12.apply(Dataset.scala:1819) > at > org.apache.spark.sql.Dataset$$anonfun$dropDuplicates$1$$anonfun$36$$anonfun$apply$12.apply(Dataset.scala:1819) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.Dataset$$anonfun$dropDuplicates$1$$anonfun$36.apply(Dataset.scala:1818) > at > org.apache.spark.sql.Dataset$$anonfun$dropDuplicates$1$$anonfun$36.apply(Dataset.scala:1817) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:245) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.Dataset$$anonfun$dropDuplicates$1.apply(Dataset.scala:1817) > at > org.apache.spark.sql.Dataset$$anonfun$dropDuplicates$1.apply(Dataset.scala:1814) > at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:2594) > at org.apache.spark.sql.Dataset.dropDuplicates(Dataset.scala:1814) > at org.apache.spark.sql.Dataset.dropDuplicates(Dataset.scala:1840) > ... > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20145) "SELECT * FROM range(1)" works, but "SELECT * FROM RANGE(1)" doesn't
[ https://issues.apache.org/jira/browse/SPARK-20145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15947726#comment-15947726 ] Bo Meng commented on SPARK-20145: - >From the current code, I can see builtinFunctions is using the exact match for >looking up ("range" as a key is all lowercase). > "SELECT * FROM range(1)" works, but "SELECT * FROM RANGE(1)" doesn't > > > Key: SPARK-20145 > URL: https://issues.apache.org/jira/browse/SPARK-20145 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Juliusz Sompolski > > Executed at clean tip of the master branch, with all default settings: > scala> spark.sql("SELECT * FROM range(1)") > res1: org.apache.spark.sql.DataFrame = [id: bigint] > scala> spark.sql("SELECT * FROM RANGE(1)") > org.apache.spark.sql.AnalysisException: could not resolve `RANGE` to a > table-valued function; line 1 pos 14 > at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > at > org.apache.spark.sql.catalyst.analysis.ResolveTableValuedFunctions$$anonfun$apply$1.applyOrElse(ResolveTableValuedFunctions.scala:126) > at > org.apache.spark.sql.catalyst.analysis.ResolveTableValuedFunctions$$anonfun$apply$1.applyOrElse(ResolveTableValuedFunctions.scala:106) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:62) > ... > I believe it should be case insensitive? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19955) Update run-tests to support conda
[ https://issues.apache.org/jira/browse/SPARK-19955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] holdenk resolved SPARK-19955. - Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17355 [https://github.com/apache/spark/pull/17355] > Update run-tests to support conda > - > > Key: SPARK-19955 > URL: https://issues.apache.org/jira/browse/SPARK-19955 > Project: Spark > Issue Type: Sub-task > Components: Project Infra, PySpark >Affects Versions: 2.1.1, 2.2.0 >Reporter: holdenk >Assignee: holdenk > Fix For: 2.2.0 > > > The current test scripts only look at system python. On the Jenkins workers > we also have Conda installed, we should support looking for Python versions > in Conda and testing with those. > This could unblock some of the 2.6 deprecation work and more easily enable > testing of pip packaging. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3577) Add task metric to report spill time
[ https://issues.apache.org/jira/browse/SPARK-3577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15947678#comment-15947678 ] Apache Spark commented on SPARK-3577: - User 'sitalkedia' has created a pull request for this issue: https://github.com/apache/spark/pull/17471 > Add task metric to report spill time > > > Key: SPARK-3577 > URL: https://issues.apache.org/jira/browse/SPARK-3577 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core >Affects Versions: 1.1.0 >Reporter: Kay Ousterhout >Priority: Minor > Attachments: spill_size.jpg > > > The {{ExternalSorter}} passes its own {{ShuffleWriteMetrics}} into > {{ExternalSorter}}. The write time recorded in those metrics is never used. > We should probably add task metrics to report this spill time, since for > shuffles, this would have previously been reported as part of shuffle write > time (with the original hash-based sorter). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19955) Update run-tests to support conda
[ https://issues.apache.org/jira/browse/SPARK-19955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] holdenk reassigned SPARK-19955: --- Assignee: holdenk > Update run-tests to support conda > - > > Key: SPARK-19955 > URL: https://issues.apache.org/jira/browse/SPARK-19955 > Project: Spark > Issue Type: Sub-task > Components: Project Infra, PySpark >Affects Versions: 2.1.1, 2.2.0 >Reporter: holdenk >Assignee: holdenk > > The current test scripts only look at system python. On the Jenkins workers > we also have Conda installed, we should support looking for Python versions > in Conda and testing with those. > This could unblock some of the 2.6 deprecation work and more easily enable > testing of pip packaging. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3577) Add task metric to report spill time
[ https://issues.apache.org/jira/browse/SPARK-3577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15947669#comment-15947669 ] Sital Kedia commented on SPARK-3577: I am making a change to report correct spill data size on disk. > Add task metric to report spill time > > > Key: SPARK-3577 > URL: https://issues.apache.org/jira/browse/SPARK-3577 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core >Affects Versions: 1.1.0 >Reporter: Kay Ousterhout >Priority: Minor > Attachments: spill_size.jpg > > > The {{ExternalSorter}} passes its own {{ShuffleWriteMetrics}} into > {{ExternalSorter}}. The write time recorded in those metrics is never used. > We should probably add task metrics to report this spill time, since for > shuffles, this would have previously been reported as part of shuffle write > time (with the original hash-based sorter). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20146) Column comment information is missing for Thrift Server's TableSchema
[ https://issues.apache.org/jira/browse/SPARK-20146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20146: Assignee: (was: Apache Spark) > Column comment information is missing for Thrift Server's TableSchema > - > > Key: SPARK-20146 > URL: https://issues.apache.org/jira/browse/SPARK-20146 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Bo Meng >Priority: Minor > > I found this issue while doing some tests against Thrift Server. > The column comments information were missing while querying the TableSchema. > Currently, all the comments were ignored. > I will post a fix shortly. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20146) Column comment information is missing for Thrift Server's TableSchema
[ https://issues.apache.org/jira/browse/SPARK-20146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15947665#comment-15947665 ] Apache Spark commented on SPARK-20146: -- User 'bomeng' has created a pull request for this issue: https://github.com/apache/spark/pull/17470 > Column comment information is missing for Thrift Server's TableSchema > - > > Key: SPARK-20146 > URL: https://issues.apache.org/jira/browse/SPARK-20146 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Bo Meng >Priority: Minor > > I found this issue while doing some tests against Thrift Server. > The column comments information were missing while querying the TableSchema. > Currently, all the comments were ignored. > I will post a fix shortly. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20146) Column comment information is missing for Thrift Server's TableSchema
[ https://issues.apache.org/jira/browse/SPARK-20146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20146: Assignee: Apache Spark > Column comment information is missing for Thrift Server's TableSchema > - > > Key: SPARK-20146 > URL: https://issues.apache.org/jira/browse/SPARK-20146 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Bo Meng >Assignee: Apache Spark >Priority: Minor > > I found this issue while doing some tests against Thrift Server. > The column comments information were missing while querying the TableSchema. > Currently, all the comments were ignored. > I will post a fix shortly. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20146) Column comment information is missing for Thrift Server's TableSchema
Bo Meng created SPARK-20146: --- Summary: Column comment information is missing for Thrift Server's TableSchema Key: SPARK-20146 URL: https://issues.apache.org/jira/browse/SPARK-20146 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.2.0 Reporter: Bo Meng Priority: Minor I found this issue while doing some tests against Thrift Server. The column comments information were missing while querying the TableSchema. Currently, all the comments were ignored. I will post a fix shortly. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18692) Test Java 8 unidoc build on Jenkins master builder
[ https://issues.apache.org/jira/browse/SPARK-18692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15947627#comment-15947627 ] Josh Rosen commented on SPARK-18692: We can't get the full Jekyll doc build running until we have Jekyll installed on all workers, but the extra code to just test unidoc isn't that much: {code} diff --git a/dev/run-tests.py b/dev/run-tests.py index 04035b3..46d6b8a 100755 --- a/dev/run-tests.py +++ b/dev/run-tests.py @@ -344,6 +344,19 @@ def build_spark_sbt(hadoop_version): exec_sbt(profiles_and_goals) +def build_spark_unidoc_sbt(hadoop_version): +set_title_and_block("Building Unidoc API Documentation", "BLOCK_DOCUMENTATION") +# Enable all of the profiles for the build: +build_profiles = get_hadoop_profiles(hadoop_version) + modules.root.build_profile_flags +sbt_goals = ["unidoc"] +profiles_and_goals = build_profiles + sbt_goals + +print("[info] Building Spark unidoc (w/Hive 1.2.1) using SBT with these arguments: ", + " ".join(profiles_and_goals)) + +exec_sbt(profiles_and_goals) + + def build_spark_assembly_sbt(hadoop_version): # Enable all of the profiles for the build: build_profiles = get_hadoop_profiles(hadoop_version) + modules.root.build_profile_flags @@ -576,6 +589,8 @@ def main(): # Since we did not build assembly/package before running dev/mima, we need to # do it here because the tests still rely on it; see SPARK-13294 for details. build_spark_assembly_sbt(hadoop_version) +# Make sure that Java and Scala API documentation can be generated +build_spark_unidoc_sbt(hadoop_version) # run the test suites run_scala_tests(build_tool, hadoop_version, test_modules, excluded_tags) {code} On my laptop this added about 1.5 minutes of extra run time. One problem that I noticed was that Unidoc appeared to be processing test sources: if we can find a way to exclude those from being processed in the first place then that might significantly speed things up. It turns out that it's also possible to disable Java 8's strict doc validation, so we could consider that as well. The master builder and PR builder should both be running Java 8 right now. The dedicated doc builder jobs are still using Java 7 (for convoluted legacy reasons) but I'll push a conf change to fix that. Assuming that we want to use the stricter validation: [~hyukjin.kwon], could you help to fix the current Javadoc breaks and include the above diff to test the unidoc as part of our dev/run-tests process? I'll be happy to help review and merge this fix. > Test Java 8 unidoc build on Jenkins master builder > -- > > Key: SPARK-18692 > URL: https://issues.apache.org/jira/browse/SPARK-18692 > Project: Spark > Issue Type: Test > Components: Build, Documentation >Reporter: Joseph K. Bradley > Labels: jenkins > > [SPARK-3359] fixed the unidoc build for Java 8, but it is easy to break. It > would be great to add this build to the Spark master builder on Jenkins to > make it easier to identify PRs which break doc builds. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-20132) Add documentation for column string functions
[ https://issues.apache.org/jira/browse/SPARK-20132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Patterson updated SPARK-20132: -- Comment: was deleted (was: I have a commit with the documentation: https://github.com/map222/spark/commit/ac91b654555f9a07021222f2f1a162634d81be5b I will make a more formal PR tonight.) > Add documentation for column string functions > - > > Key: SPARK-20132 > URL: https://issues.apache.org/jira/browse/SPARK-20132 > Project: Spark > Issue Type: Documentation > Components: PySpark, SQL >Affects Versions: 2.1.0 >Reporter: Michael Patterson >Priority: Minor > Labels: documentation, newbie > > Four Column string functions do not have documentation for PySpark: > rlike > like > startswith > endswith > These functions are called through the _bin_op interface, which allows the > passing of a docstring. I have added docstrings with examples to each of the > four functions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20132) Add documentation for column string functions
[ https://issues.apache.org/jira/browse/SPARK-20132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20132: Assignee: (was: Apache Spark) > Add documentation for column string functions > - > > Key: SPARK-20132 > URL: https://issues.apache.org/jira/browse/SPARK-20132 > Project: Spark > Issue Type: Documentation > Components: PySpark, SQL >Affects Versions: 2.1.0 >Reporter: Michael Patterson >Priority: Minor > Labels: documentation, newbie > > Four Column string functions do not have documentation for PySpark: > rlike > like > startswith > endswith > These functions are called through the _bin_op interface, which allows the > passing of a docstring. I have added docstrings with examples to each of the > four functions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20132) Add documentation for column string functions
[ https://issues.apache.org/jira/browse/SPARK-20132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15947538#comment-15947538 ] Apache Spark commented on SPARK-20132: -- User 'map222' has created a pull request for this issue: https://github.com/apache/spark/pull/17469 > Add documentation for column string functions > - > > Key: SPARK-20132 > URL: https://issues.apache.org/jira/browse/SPARK-20132 > Project: Spark > Issue Type: Documentation > Components: PySpark, SQL >Affects Versions: 2.1.0 >Reporter: Michael Patterson >Priority: Minor > Labels: documentation, newbie > > Four Column string functions do not have documentation for PySpark: > rlike > like > startswith > endswith > These functions are called through the _bin_op interface, which allows the > passing of a docstring. I have added docstrings with examples to each of the > four functions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20132) Add documentation for column string functions
[ https://issues.apache.org/jira/browse/SPARK-20132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20132: Assignee: Apache Spark > Add documentation for column string functions > - > > Key: SPARK-20132 > URL: https://issues.apache.org/jira/browse/SPARK-20132 > Project: Spark > Issue Type: Documentation > Components: PySpark, SQL >Affects Versions: 2.1.0 >Reporter: Michael Patterson >Assignee: Apache Spark >Priority: Minor > Labels: documentation, newbie > > Four Column string functions do not have documentation for PySpark: > rlike > like > startswith > endswith > These functions are called through the _bin_op interface, which allows the > passing of a docstring. I have added docstrings with examples to each of the > four functions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20059) HbaseCredentialProvider uses wrong classloader
[ https://issues.apache.org/jira/browse/SPARK-20059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-20059. Resolution: Fixed Assignee: Saisai Shao Fix Version/s: 2.2.0 2.1.1 > HbaseCredentialProvider uses wrong classloader > -- > > Key: SPARK-20059 > URL: https://issues.apache.org/jira/browse/SPARK-20059 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.1.0, 2.2.0 >Reporter: Saisai Shao >Assignee: Saisai Shao > Fix For: 2.1.1, 2.2.0 > > > {{HBaseCredentialProvider}} uses system classloader instead of child > classloader, which will make HBase jars specified with {{--jars}} fail to > work, so here we should use the right class loader. > Besides in yarn cluster mode jars specified with {{--jars}} is not added into > client's class path, which will make it fail to load HBase jars and issue > tokens in our scenario. Also some customized credential provider cannot be > registered into client. > So here I will fix this two issues. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16938) Cannot resolve column name after a join
[ https://issues.apache.org/jira/browse/SPARK-16938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15947485#comment-15947485 ] Dongjoon Hyun commented on SPARK-16938: --- Sure, go ahead. I'm not working on this. > Cannot resolve column name after a join > --- > > Key: SPARK-16938 > URL: https://issues.apache.org/jira/browse/SPARK-16938 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Mathieu D >Priority: Minor > > Found a change of behavior on spark-2.0.0, which breaks a query in our code > base. > The following works on previous spark versions, 1.6.1 up to 2.0.0-preview : > {code} > val dfa = Seq((1, 2), (2, 3)).toDF("id", "a").alias("dfa") > val dfb = Seq((1, 0), (1, 1)).toDF("id", "b").alias("dfb") > dfa.join(dfb, dfa("id") === dfb("id")).dropDuplicates(Array("dfa.id", > "dfb.id")) > {code} > but fails with spark-2.0.0 with the exception : > {code} > Cannot resolve column name "dfa.id" among (id, a, id, b); > org.apache.spark.sql.AnalysisException: Cannot resolve column name "dfa.id" > among (id, a, id, b); > at > org.apache.spark.sql.Dataset$$anonfun$dropDuplicates$1$$anonfun$36$$anonfun$apply$12.apply(Dataset.scala:1819) > at > org.apache.spark.sql.Dataset$$anonfun$dropDuplicates$1$$anonfun$36$$anonfun$apply$12.apply(Dataset.scala:1819) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.Dataset$$anonfun$dropDuplicates$1$$anonfun$36.apply(Dataset.scala:1818) > at > org.apache.spark.sql.Dataset$$anonfun$dropDuplicates$1$$anonfun$36.apply(Dataset.scala:1817) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:245) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.Dataset$$anonfun$dropDuplicates$1.apply(Dataset.scala:1817) > at > org.apache.spark.sql.Dataset$$anonfun$dropDuplicates$1.apply(Dataset.scala:1814) > at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:2594) > at org.apache.spark.sql.Dataset.dropDuplicates(Dataset.scala:1814) > at org.apache.spark.sql.Dataset.dropDuplicates(Dataset.scala:1840) > ... > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16938) Cannot resolve column name after a join
[ https://issues.apache.org/jira/browse/SPARK-16938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15947395#comment-15947395 ] sam elamin commented on SPARK-16938: [~dongjoon] I can pick this up if you dont mind, are you still not working on it? > Cannot resolve column name after a join > --- > > Key: SPARK-16938 > URL: https://issues.apache.org/jira/browse/SPARK-16938 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Mathieu D >Priority: Minor > > Found a change of behavior on spark-2.0.0, which breaks a query in our code > base. > The following works on previous spark versions, 1.6.1 up to 2.0.0-preview : > {code} > val dfa = Seq((1, 2), (2, 3)).toDF("id", "a").alias("dfa") > val dfb = Seq((1, 0), (1, 1)).toDF("id", "b").alias("dfb") > dfa.join(dfb, dfa("id") === dfb("id")).dropDuplicates(Array("dfa.id", > "dfb.id")) > {code} > but fails with spark-2.0.0 with the exception : > {code} > Cannot resolve column name "dfa.id" among (id, a, id, b); > org.apache.spark.sql.AnalysisException: Cannot resolve column name "dfa.id" > among (id, a, id, b); > at > org.apache.spark.sql.Dataset$$anonfun$dropDuplicates$1$$anonfun$36$$anonfun$apply$12.apply(Dataset.scala:1819) > at > org.apache.spark.sql.Dataset$$anonfun$dropDuplicates$1$$anonfun$36$$anonfun$apply$12.apply(Dataset.scala:1819) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.Dataset$$anonfun$dropDuplicates$1$$anonfun$36.apply(Dataset.scala:1818) > at > org.apache.spark.sql.Dataset$$anonfun$dropDuplicates$1$$anonfun$36.apply(Dataset.scala:1817) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:245) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.Dataset$$anonfun$dropDuplicates$1.apply(Dataset.scala:1817) > at > org.apache.spark.sql.Dataset$$anonfun$dropDuplicates$1.apply(Dataset.scala:1814) > at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:2594) > at org.apache.spark.sql.Dataset.dropDuplicates(Dataset.scala:1814) > at org.apache.spark.sql.Dataset.dropDuplicates(Dataset.scala:1840) > ... > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20145) "SELECT * FROM range(1)" works, but "SELECT * FROM RANGE(1)" doesn't
Juliusz Sompolski created SPARK-20145: - Summary: "SELECT * FROM range(1)" works, but "SELECT * FROM RANGE(1)" doesn't Key: SPARK-20145 URL: https://issues.apache.org/jira/browse/SPARK-20145 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Juliusz Sompolski Executed at clean tip of the master branch, with all default settings: scala> spark.sql("SELECT * FROM range(1)") res1: org.apache.spark.sql.DataFrame = [id: bigint] scala> spark.sql("SELECT * FROM RANGE(1)") org.apache.spark.sql.AnalysisException: could not resolve `RANGE` to a table-valued function; line 1 pos 14 at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.ResolveTableValuedFunctions$$anonfun$apply$1.applyOrElse(ResolveTableValuedFunctions.scala:126) at org.apache.spark.sql.catalyst.analysis.ResolveTableValuedFunctions$$anonfun$apply$1.applyOrElse(ResolveTableValuedFunctions.scala:106) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:62) ... I believe it should be case insensitive? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20143) DataType.fromJson should throw an exception with better message
[ https://issues.apache.org/jira/browse/SPARK-20143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20143: Assignee: (was: Apache Spark) > DataType.fromJson should throw an exception with better message > --- > > Key: SPARK-20143 > URL: https://issues.apache.org/jira/browse/SPARK-20143 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Hyukjin Kwon >Priority: Minor > > Currently, > {code} > scala> import org.apache.spark.sql.types.DataType > import org.apache.spark.sql.types.DataType > scala> DataType.fromJson( abcd) > java.util.NoSuchElementException: key not found: abcd > at scala.collection.MapLike$class.default(MapLike.scala:228) > at scala.collection.AbstractMap.default(Map.scala:59) > at scala.collection.MapLike$class.apply(MapLike.scala:141) > at scala.collection.AbstractMap.apply(Map.scala:59) > at org.apache.spark.sql.types.DataType$.nameToType(DataType.scala:118) > at org.apache.spark.sql.types.DataType$.parseDataType(DataType.scala:132) > at org.apache.spark.sql.types.DataType$.fromJson(DataType.scala:104) > ... 48 elided > scala> DataType.fromJson( """{"abcd":"a"}""") > scala.MatchError: JObject(List((abcd,JString(a (of class > org.json4s.JsonAST$JObject) > at org.apache.spark.sql.types.DataType$.parseDataType(DataType.scala:130) > at org.apache.spark.sql.types.DataType$.fromJson(DataType.scala:104) > ... 48 elided > scala> DataType.fromJson( """{"fields": [{"a":123}], "type": "struct"}""") > scala.MatchError: JObject(List((a,JInt(123 (of class > org.json4s.JsonAST$JObject) > at > org.apache.spark.sql.types.DataType$.org$apache$spark$sql$types$DataType$$parseStructField(DataType.scala:169) > at > org.apache.spark.sql.types.DataType$$anonfun$parseDataType$1.apply(DataType.scala:150) > at > org.apache.spark.sql.types.DataType$$anonfun$parseDataType$1.apply(DataType.scala:150) > at scala.collection.immutable.List.map(List.scala:273) > at org.apache.spark.sql.types.DataType$.parseDataType(DataType.scala:150) > at org.apache.spark.sql.types.DataType$.fromJson(DataType.scala:104) > ... 48 elided > {code} > {{DataType.fromJson}} throws non-readable error messages for the json input. > We could improve this rather than throwing {{scala.MatchError}} or > {{java.util.NoSuchElementException}}. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20143) DataType.fromJson should throw an exception with better message
[ https://issues.apache.org/jira/browse/SPARK-20143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20143: Assignee: Apache Spark > DataType.fromJson should throw an exception with better message > --- > > Key: SPARK-20143 > URL: https://issues.apache.org/jira/browse/SPARK-20143 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Minor > > Currently, > {code} > scala> import org.apache.spark.sql.types.DataType > import org.apache.spark.sql.types.DataType > scala> DataType.fromJson( abcd) > java.util.NoSuchElementException: key not found: abcd > at scala.collection.MapLike$class.default(MapLike.scala:228) > at scala.collection.AbstractMap.default(Map.scala:59) > at scala.collection.MapLike$class.apply(MapLike.scala:141) > at scala.collection.AbstractMap.apply(Map.scala:59) > at org.apache.spark.sql.types.DataType$.nameToType(DataType.scala:118) > at org.apache.spark.sql.types.DataType$.parseDataType(DataType.scala:132) > at org.apache.spark.sql.types.DataType$.fromJson(DataType.scala:104) > ... 48 elided > scala> DataType.fromJson( """{"abcd":"a"}""") > scala.MatchError: JObject(List((abcd,JString(a (of class > org.json4s.JsonAST$JObject) > at org.apache.spark.sql.types.DataType$.parseDataType(DataType.scala:130) > at org.apache.spark.sql.types.DataType$.fromJson(DataType.scala:104) > ... 48 elided > scala> DataType.fromJson( """{"fields": [{"a":123}], "type": "struct"}""") > scala.MatchError: JObject(List((a,JInt(123 (of class > org.json4s.JsonAST$JObject) > at > org.apache.spark.sql.types.DataType$.org$apache$spark$sql$types$DataType$$parseStructField(DataType.scala:169) > at > org.apache.spark.sql.types.DataType$$anonfun$parseDataType$1.apply(DataType.scala:150) > at > org.apache.spark.sql.types.DataType$$anonfun$parseDataType$1.apply(DataType.scala:150) > at scala.collection.immutable.List.map(List.scala:273) > at org.apache.spark.sql.types.DataType$.parseDataType(DataType.scala:150) > at org.apache.spark.sql.types.DataType$.fromJson(DataType.scala:104) > ... 48 elided > {code} > {{DataType.fromJson}} throws non-readable error messages for the json input. > We could improve this rather than throwing {{scala.MatchError}} or > {{java.util.NoSuchElementException}}. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20143) DataType.fromJson should throw an exception with better message
[ https://issues.apache.org/jira/browse/SPARK-20143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15947237#comment-15947237 ] Apache Spark commented on SPARK-20143: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/17468 > DataType.fromJson should throw an exception with better message > --- > > Key: SPARK-20143 > URL: https://issues.apache.org/jira/browse/SPARK-20143 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Hyukjin Kwon >Priority: Minor > > Currently, > {code} > scala> import org.apache.spark.sql.types.DataType > import org.apache.spark.sql.types.DataType > scala> DataType.fromJson( abcd) > java.util.NoSuchElementException: key not found: abcd > at scala.collection.MapLike$class.default(MapLike.scala:228) > at scala.collection.AbstractMap.default(Map.scala:59) > at scala.collection.MapLike$class.apply(MapLike.scala:141) > at scala.collection.AbstractMap.apply(Map.scala:59) > at org.apache.spark.sql.types.DataType$.nameToType(DataType.scala:118) > at org.apache.spark.sql.types.DataType$.parseDataType(DataType.scala:132) > at org.apache.spark.sql.types.DataType$.fromJson(DataType.scala:104) > ... 48 elided > scala> DataType.fromJson( """{"abcd":"a"}""") > scala.MatchError: JObject(List((abcd,JString(a (of class > org.json4s.JsonAST$JObject) > at org.apache.spark.sql.types.DataType$.parseDataType(DataType.scala:130) > at org.apache.spark.sql.types.DataType$.fromJson(DataType.scala:104) > ... 48 elided > scala> DataType.fromJson( """{"fields": [{"a":123}], "type": "struct"}""") > scala.MatchError: JObject(List((a,JInt(123 (of class > org.json4s.JsonAST$JObject) > at > org.apache.spark.sql.types.DataType$.org$apache$spark$sql$types$DataType$$parseStructField(DataType.scala:169) > at > org.apache.spark.sql.types.DataType$$anonfun$parseDataType$1.apply(DataType.scala:150) > at > org.apache.spark.sql.types.DataType$$anonfun$parseDataType$1.apply(DataType.scala:150) > at scala.collection.immutable.List.map(List.scala:273) > at org.apache.spark.sql.types.DataType$.parseDataType(DataType.scala:150) > at org.apache.spark.sql.types.DataType$.fromJson(DataType.scala:104) > ... 48 elided > {code} > {{DataType.fromJson}} throws non-readable error messages for the json input. > We could improve this rather than throwing {{scala.MatchError}} or > {{java.util.NoSuchElementException}}. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20144) spark.read.parquet no long maintains the ordering the the data
[ https://issues.apache.org/jira/browse/SPARK-20144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Jin updated SPARK-20144: --- Description: Hi, We are trying to upgrade Spark from 1.6.3 to 2.0.2. One issue we found is when we read parquet files in 2.0.2, the ordering of rows in the resulting dataframe is not the same as the ordering of rows in the dataframe that the parquet file was reproduced with. This is because FileSourceStrategy.scala combines the parquet files into fewer partitions and also reordered them. This breaks our workflows because they assume the ordering of the data. Is this considered a bug? Also FileSourceStrategy and FileSourceScanExec changed quite a bit from 2.0.2 to 2.1, so not sure if this is an issue with 2.1. was: Hi, We are trying to upgrade Spark from 1.6.3 to 2.0.2. One issue we found is when we read parquet files in 2.0.2, the ordering of rows in the resulting dataframe is not the same as the ordering of rows in the dataframe that the parquet file was reproduced with. This is because FileSourceStrategy.scala combines the parquet files into fewer partitions and also reordered them. This breaks our workout because they assume the ordering of the data. Is this considered a bug? Also FileSourceStrategy and FileSourceScanExec changed quite a bit from 2.0.2 to 2.1, so not sure if this is an issue with 2.1. > spark.read.parquet no long maintains the ordering the the data > -- > > Key: SPARK-20144 > URL: https://issues.apache.org/jira/browse/SPARK-20144 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: Li Jin > > Hi, We are trying to upgrade Spark from 1.6.3 to 2.0.2. One issue we found is > when we read parquet files in 2.0.2, the ordering of rows in the resulting > dataframe is not the same as the ordering of rows in the dataframe that the > parquet file was reproduced with. > This is because FileSourceStrategy.scala combines the parquet files into > fewer partitions and also reordered them. This breaks our workflows because > they assume the ordering of the data. > Is this considered a bug? Also FileSourceStrategy and FileSourceScanExec > changed quite a bit from 2.0.2 to 2.1, so not sure if this is an issue with > 2.1. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20144) spark.read.parquet no long maintains the ordering the the data
[ https://issues.apache.org/jira/browse/SPARK-20144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Jin updated SPARK-20144: --- Description: Hi, We are trying to upgrade Spark from 1.6.3 to 2.0.2. One issue we found is when we read parquet files in 2.0.2, the ordering of rows in the resulting dataframe is not the same as the ordering of rows in the dataframe that the parquet file was reproduced with. This is because FileSourceStrategy.scala combines the parquet files into fewer partitions and also reordered them. This breaks our workout because they assume the ordering of the data. Is this considered a bug? Also FileSourceStrategy and FileSourceScanExec changed quite a bit from 2.0.2 to 2.1, so not sure if this is an issue with 2.1. was:Hi, We are trying to upgrade Spark from 1.6.3 to 2.0.2. One issue we found is when we read parquet files in 2.0.2, the ordering of rows in the resulting dataframe is not the same as the ordering of rows in the dataframe that the parquet file was reproduced with. This is because FileSourceStrategy.scala combines the parquet files into fewer partitions and also reordered them. This breaks our workout because they assume the ordering of the data. Is this considered a bug? Also FileSourceStrategy and FileSourceScanExec changed quite a bit from 2.0.2 to 2.1, so not sure if this is an issue with 2.1. > spark.read.parquet no long maintains the ordering the the data > -- > > Key: SPARK-20144 > URL: https://issues.apache.org/jira/browse/SPARK-20144 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: Li Jin > > Hi, We are trying to upgrade Spark from 1.6.3 to 2.0.2. One issue we found is > when we read parquet files in 2.0.2, the ordering of rows in the resulting > dataframe is not the same as the ordering of rows in the dataframe that the > parquet file was reproduced with. > This is because FileSourceStrategy.scala combines the parquet files into > fewer partitions and also reordered them. This breaks our workout because > they assume the ordering of the data. > Is this considered a bug? Also FileSourceStrategy and FileSourceScanExec > changed quite a bit from 2.0.2 to 2.1, so not sure if this is an issue with > 2.1. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20144) spark.read.parquet no long maintains the ordering the the data
Li Jin created SPARK-20144: -- Summary: spark.read.parquet no long maintains the ordering the the data Key: SPARK-20144 URL: https://issues.apache.org/jira/browse/SPARK-20144 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.2 Reporter: Li Jin Hi, We are trying to upgrade Spark from 1.6.3 to 2.0.2. One issue we found is when we read parquet files in 2.0.2, the ordering of rows in the resulting dataframe is not the same as the ordering of rows in the dataframe that the parquet file was reproduced with. This is because FileSourceStrategy.scala combines the parquet files into fewer partitions and also reordered them. This breaks our workout because they assume the ordering of the data. Is this considered a bug? Also FileSourceStrategy and FileSourceScanExec changed quite a bit from 2.0.2 to 2.1, so not sure if this is an issue with 2.1. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20143) DataType.fromJson should throw an exception with better message
Hyukjin Kwon created SPARK-20143: Summary: DataType.fromJson should throw an exception with better message Key: SPARK-20143 URL: https://issues.apache.org/jira/browse/SPARK-20143 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.2.0 Reporter: Hyukjin Kwon Priority: Minor Currently, {code} scala> import org.apache.spark.sql.types.DataType import org.apache.spark.sql.types.DataType scala> DataType.fromJson( abcd) java.util.NoSuchElementException: key not found: abcd at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:59) at scala.collection.MapLike$class.apply(MapLike.scala:141) at scala.collection.AbstractMap.apply(Map.scala:59) at org.apache.spark.sql.types.DataType$.nameToType(DataType.scala:118) at org.apache.spark.sql.types.DataType$.parseDataType(DataType.scala:132) at org.apache.spark.sql.types.DataType$.fromJson(DataType.scala:104) ... 48 elided scala> DataType.fromJson( """{"abcd":"a"}""") scala.MatchError: JObject(List((abcd,JString(a (of class org.json4s.JsonAST$JObject) at org.apache.spark.sql.types.DataType$.parseDataType(DataType.scala:130) at org.apache.spark.sql.types.DataType$.fromJson(DataType.scala:104) ... 48 elided scala> DataType.fromJson( """{"fields": [{"a":123}], "type": "struct"}""") scala.MatchError: JObject(List((a,JInt(123 (of class org.json4s.JsonAST$JObject) at org.apache.spark.sql.types.DataType$.org$apache$spark$sql$types$DataType$$parseStructField(DataType.scala:169) at org.apache.spark.sql.types.DataType$$anonfun$parseDataType$1.apply(DataType.scala:150) at org.apache.spark.sql.types.DataType$$anonfun$parseDataType$1.apply(DataType.scala:150) at scala.collection.immutable.List.map(List.scala:273) at org.apache.spark.sql.types.DataType$.parseDataType(DataType.scala:150) at org.apache.spark.sql.types.DataType$.fromJson(DataType.scala:104) ... 48 elided {code} {{DataType.fromJson}} throws non-readable error messages for the json input. We could improve this rather than throwing {{scala.MatchError}} or {{java.util.NoSuchElementException}}. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20142) Move RewriteDistinctAggregates later into query execution
Juliusz Sompolski created SPARK-20142: - Summary: Move RewriteDistinctAggregates later into query execution Key: SPARK-20142 URL: https://issues.apache.org/jira/browse/SPARK-20142 Project: Spark Issue Type: Improvement Components: Optimizer Affects Versions: 2.1.0 Reporter: Juliusz Sompolski Priority: Minor The rewrite of distinct aggregates complicates the later analysis of them by later optimizer rules. Move it to later. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14174) Accelerate KMeans via Mini-Batch EM
[ https://issues.apache.org/jira/browse/SPARK-14174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15947114#comment-15947114 ] Nick Pentreath commented on SPARK-14174: The actual fix in the PR is pretty small - essentially just adding an {{rdd.sample}} call (similar to the old {{mllib}} gradient descent impl). So if we can see some good speed improvements on a relatively large class of input datasets, this seems like an easy win. From the performance tests above it seems like there's a significant win even for low-dimensional vectors. For higher dimensions the improvement may be as large or perhaps larger. [~podongfeng] it may be best to add a few different cases to the performance tests to illustrate the behavior for different cases (and if not for certain cases, we should document that): # small dimension, dense # high dimension, dense # small dimension, sparse # high dimension, sparse [~rnowling] do you have time to check out the PR here? It seems similar in spirit to what you had done and just uses the built-in RDD sampling (which I think [~derrickburns] mentioned in SPARK-2308). > Accelerate KMeans via Mini-Batch EM > --- > > Key: SPARK-14174 > URL: https://issues.apache.org/jira/browse/SPARK-14174 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: zhengruifeng > > The MiniBatchKMeans is a variant of the KMeans algorithm which uses > mini-batches to reduce the computation time, while still attempting to > optimise the same objective function. Mini-batches are subsets of the input > data, randomly sampled in each training iteration. These mini-batches > drastically reduce the amount of computation required to converge to a local > solution. In contrast to other algorithms that reduce the convergence time of > k-means, mini-batch k-means produces results that are generally only slightly > worse than the standard algorithm. > I have implemented mini-batch kmeans in Mllib, and the acceleration is realy > significant. > The MiniBatch KMeans is named XMeans in following lines. > {code} > val path = "/tmp/mnist8m.scale" > val data = MLUtils.loadLibSVMFile(sc, path) > val vecs = data.map(_.features).persist() > val km = KMeans.train(data=vecs, k=10, maxIterations=10, runs=1, > initializationMode="k-means||", seed=123l) > km.computeCost(vecs) > res0: Double = 3.317029898599564E8 > val xm = XMeans.train(data=vecs, k=10, maxIterations=10, runs=1, > initializationMode="k-means||", miniBatchFraction=0.1, seed=123l) > xm.computeCost(vecs) > res1: Double = 3.3169865959604424E8 > val xm2 = XMeans.train(data=vecs, k=10, maxIterations=10, runs=1, > initializationMode="k-means||", miniBatchFraction=0.01, seed=123l) > xm2.computeCost(vecs) > res2: Double = 3.317195831216454E8 > {code} > The above three training all reached the max number of iterations 10. > We can see that the WSSSEs are almost the same. While their speed perfermence > have significant difference: > {code} > KMeans2876sec > MiniBatch KMeans (fraction=0.1) 263sec > MiniBatch KMeans (fraction=0.01) 90sec > {code} > With appropriate fraction, the bigger the dataset is, the higher speedup is. > The data used above have 8,100,000 samples, 784 features. It can be > downloaded here > (https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/mnist8m.scale.bz2) > Comparison of the K-Means and MiniBatchKMeans on sklearn : > http://scikit-learn.org/stable/auto_examples/cluster/plot_mini_batch_kmeans.html#example-cluster-plot-mini-batch-kmeans-py -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20141) jdbc query gives ORA-00903
[ https://issues.apache.org/jira/browse/SPARK-20141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15947106#comment-15947106 ] Sean Owen commented on SPARK-20141: --- That sounds like an Oracle error. There's no detail that suggests there is a Spark error here. > jdbc query gives ORA-00903 > -- > > Key: SPARK-20141 > URL: https://issues.apache.org/jira/browse/SPARK-20141 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.0.2 > Environment: Windows7 >Reporter: sergio > Labels: windows > Attachments: exception.png > > > Error while querying to external oracle database. > It works this way and then I can work with jdbcDF: > val jdbcDF = sqlContext.read.format("jdbc").options( > Map("url" -> "jdbc:oracle:thin:@//crmdbmr.cgs.comp.ru:1521/crmrmir", > "user" -> "my_login", > "password" -> "my_password", > "dbtable" -> "siebel.table1")).load() > while when trying to send some query, it fails > val jdbcDF = sqlContext.read.format("jdbc").options( > Map("url" -> "jdbc:oracle:thin:@//crmdbmr.cgs.comp.ru:1521/crmrmir", > "user" -> "my_login", > "password" -> "my_password", > "dbtable" -> "select * from siebel.table1 where call_id= > '1-1TMC4D4U'")).load() > This query works fine in SQLDeveloper, or when i registerTempTable, but when > I put direct query instead of schema.table, it gives this error: > java.sql.SQLSyntaxErrorException: ORA-00903: > It looks like spark sends wrong query. > I tried everything in "JDBC To Other Databases": > http://spark.apache.org/docs/latest/sql-programming-guide.html -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20140) Remove hardcoded kinesis retry wait and max retries
[ https://issues.apache.org/jira/browse/SPARK-20140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15947099#comment-15947099 ] Apache Spark commented on SPARK-20140: -- User 'yssharma' has created a pull request for this issue: https://github.com/apache/spark/pull/17467 > Remove hardcoded kinesis retry wait and max retries > --- > > Key: SPARK-20140 > URL: https://issues.apache.org/jira/browse/SPARK-20140 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.1.0 >Reporter: Yash Sharma > Labels: kinesis, recovery > > The pull requests proposes to remove the hardcoded values for Amazon Kinesis > - MIN_RETRY_WAIT_TIME_MS, MAX_RETRIES. > This change is critical for kinesis checkpoint recovery when the kinesis > backed rdd is huge. > Following happens in a typical kinesis recovery : > - kinesis throttles large number of requests while recovering > - retries in case of throttling are not able to recover due to the small wait > period > - kinesis throttles per second, the wait period should be configurable for > recovery > The patch picks the spark kinesis configs from: > - spark.streaming.kinesis.retry.wait.time > - spark.streaming.kinesis.retry.max.attempts -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20140) Remove hardcoded kinesis retry wait and max retries
[ https://issues.apache.org/jira/browse/SPARK-20140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20140: Assignee: Apache Spark > Remove hardcoded kinesis retry wait and max retries > --- > > Key: SPARK-20140 > URL: https://issues.apache.org/jira/browse/SPARK-20140 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.1.0 >Reporter: Yash Sharma >Assignee: Apache Spark > Labels: kinesis, recovery > > The pull requests proposes to remove the hardcoded values for Amazon Kinesis > - MIN_RETRY_WAIT_TIME_MS, MAX_RETRIES. > This change is critical for kinesis checkpoint recovery when the kinesis > backed rdd is huge. > Following happens in a typical kinesis recovery : > - kinesis throttles large number of requests while recovering > - retries in case of throttling are not able to recover due to the small wait > period > - kinesis throttles per second, the wait period should be configurable for > recovery > The patch picks the spark kinesis configs from: > - spark.streaming.kinesis.retry.wait.time > - spark.streaming.kinesis.retry.max.attempts -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20140) Remove hardcoded kinesis retry wait and max retries
[ https://issues.apache.org/jira/browse/SPARK-20140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20140: Assignee: (was: Apache Spark) > Remove hardcoded kinesis retry wait and max retries > --- > > Key: SPARK-20140 > URL: https://issues.apache.org/jira/browse/SPARK-20140 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.1.0 >Reporter: Yash Sharma > Labels: kinesis, recovery > > The pull requests proposes to remove the hardcoded values for Amazon Kinesis > - MIN_RETRY_WAIT_TIME_MS, MAX_RETRIES. > This change is critical for kinesis checkpoint recovery when the kinesis > backed rdd is huge. > Following happens in a typical kinesis recovery : > - kinesis throttles large number of requests while recovering > - retries in case of throttling are not able to recover due to the small wait > period > - kinesis throttles per second, the wait period should be configurable for > recovery > The patch picks the spark kinesis configs from: > - spark.streaming.kinesis.retry.wait.time > - spark.streaming.kinesis.retry.max.attempts -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20141) jdbc query gives ORA-00903
[ https://issues.apache.org/jira/browse/SPARK-20141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sergio updated SPARK-20141: --- Attachment: exception.png > jdbc query gives ORA-00903 > -- > > Key: SPARK-20141 > URL: https://issues.apache.org/jira/browse/SPARK-20141 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.0.2 > Environment: Windows7 >Reporter: sergio > Labels: windows > Attachments: exception.png > > > Error while querying to external oracle database. > It works this way and then I can work with jdbcDF: > val jdbcDF = sqlContext.read.format("jdbc").options( > Map("url" -> "jdbc:oracle:thin:@//crmdbmr.cgs.comp.ru:1521/crmrmir", > "user" -> "my_login", > "password" -> "my_password", > "dbtable" -> "siebel.table1")).load() > while when trying to send some query, it fails > val jdbcDF = sqlContext.read.format("jdbc").options( > Map("url" -> "jdbc:oracle:thin:@//crmdbmr.cgs.comp.ru:1521/crmrmir", > "user" -> "my_login", > "password" -> "my_password", > "dbtable" -> "select * from siebel.table1 where call_id= > '1-1TMC4D4U'")).load() > This query works fine in SQLDeveloper, or when i registerTempTable, but when > I put direct query instead of schema.table, it gives this error: > java.sql.SQLSyntaxErrorException: ORA-00903: > It looks like spark sends wrong query. > I tried everything in "JDBC To Other Databases": > http://spark.apache.org/docs/latest/sql-programming-guide.html -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20141) jdbc query gives ORA-00903
sergio created SPARK-20141: -- Summary: jdbc query gives ORA-00903 Key: SPARK-20141 URL: https://issues.apache.org/jira/browse/SPARK-20141 Project: Spark Issue Type: Bug Components: Input/Output Affects Versions: 2.0.2 Environment: Windows7 Reporter: sergio Error while querying to external oracle database. It works this way and then I can work with jdbcDF: val jdbcDF = sqlContext.read.format("jdbc").options( Map("url" -> "jdbc:oracle:thin:@//crmdbmr.cgs.comp.ru:1521/crmrmir", "user" -> "my_login", "password" -> "my_password", "dbtable" -> "siebel.table1")).load() while when trying to send some query, it fails val jdbcDF = sqlContext.read.format("jdbc").options( Map("url" -> "jdbc:oracle:thin:@//crmdbmr.cgs.comp.ru:1521/crmrmir", "user" -> "my_login", "password" -> "my_password", "dbtable" -> "select * from siebel.table1 where call_id= '1-1TMC4D4U'")).load() This query works fine in SQLDeveloper, or when i registerTempTable, but when I put direct query instead of schema.table, it gives this error: java.sql.SQLSyntaxErrorException: ORA-00903: It looks like spark sends wrong query. I tried everything in "JDBC To Other Databases": http://spark.apache.org/docs/latest/sql-programming-guide.html -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20139) Spark UI reports partial success for completed stage while log shows all tasks are finished
[ https://issues.apache.org/jira/browse/SPARK-20139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Etti Gur updated SPARK-20139: - Description: Spark UI reports partial success for completed stage while log shows all tasks are finished - i.e.: We have a stage that is presented under completed stages on spark UI, but the successful tasks are shown like so: (146372/524964) not as you'd expect (524964/524964) Looking at the application master log shows all tasks in that stage are successful: 17/03/29 09:45:49 INFO TaskSetManager: Finished task 522973.0 in stage 0.0 (TID 522973) in 1163910 ms on ip-10-1-15-34.ec2.internal (executor 116) (524963/524964) 17/03/29 09:45:49 INFO TaskSetManager: Finished task 12508.0 in stage 2.0 (TID 537472) in 241250 ms on ip-10-1-15-14.ec2.internal (executor 38) (20234/20262) 17/03/29 09:45:49 INFO TaskSetManager: Finished task 12465.0 in stage 2.0 (TID 537429) in 241994 ms on ip-10-1-15-106.ec2.internal (executor 133) (20235/20262) 17/03/29 09:45:49 INFO TaskSetManager: Finished task 15079.0 in stage 2.0 (TID 540043) in 202889 ms on ip-10-1-15-173.ec2.internal (executor 295) (20236/20262) 17/03/29 09:45:49 INFO TaskSetManager: Finished task 19828.0 in stage 2.0 (TID 544792) in 137845 ms on ip-10-1-15-147.ec2.internal (executor 43) (20237/20262) 17/03/29 09:45:50 INFO TaskSetManager: Finished task 19072.0 in stage 2.0 (TID 544036) in 147363 ms on ip-10-1-15-19.ec2.internal (executor 175) (20238/20262) 17/03/29 09:45:50 INFO TaskSetManager: Finished task 524146.0 in stage 0.0 (TID 524146) in 889950 ms on ip-10-1-15-72.ec2.internal (executor 74) (524964/524964) Also in the log we get an error: 17/03/29 08:24:16 ERROR LiveListenerBus: Dropping SparkListenerEvent because no remaining room in event queue. This likely means one of the SparkListeners is too slow and cannot keep up with the rate at which tasks are being started by the scheduler. This looks like the stage is indeed completed with all its tasks but UI shows like not all tasks really finished. was: Spark UI reports partial success for completed stage while log shows all tasks are finished - i.e.: We have a stage that is presented under completed stages on spark UI, but the successful tasks are shown like so: (146372/524964) not as you'd expect (524964/524964) Looking at the application master log shows all tasks in that stage are successful: 17/03/29 09:45:49 INFO TaskSetManager: Finished task 522973.0 in stage 0.0 (TID 522973) in 1163910 ms on ip-10-1-15-34.ec2.internal (executor 116) (524963/524964) 17/03/29 09:45:49 INFO TaskSetManager: Finished task 12508.0 in stage 2.0 (TID 537472) in 241250 ms on ip-10-1-15-14.ec2.internal (executor 38) (20234/20262) 17/03/29 09:45:49 INFO TaskSetManager: Finished task 12465.0 in stage 2.0 (TID 537429) in 241994 ms on ip-10-1-15-106.ec2.internal (executor 133) (20235/20262) 17/03/29 09:45:49 INFO TaskSetManager: Finished task 15079.0 in stage 2.0 (TID 540043) in 202889 ms on ip-10-1-15-173.ec2.internal (executor 295) (20236/20262) 17/03/29 09:45:49 INFO TaskSetManager: Finished task 19828.0 in stage 2.0 (TID 544792) in 137845 ms on ip-10-1-15-147.ec2.internal (executor 43) (20237/20262) 17/03/29 09:45:50 INFO TaskSetManager: Finished task 19072.0 in stage 2.0 (TID 544036) in 147363 ms on ip-10-1-15-19.ec2.internal (executor 175) (20238/20262) 17/03/29 09:45:50 INFO TaskSetManager: Finished task 524146.0 in stage 0.0 (TID 524146) in 889950 ms on ip-10-1-15-72.ec2.internal (executor 74) *(524964/524964)* This looks like the stage is indeed completed with all its tasks but UI shows like not all tasks really finished. > Spark UI reports partial success for completed stage while log shows all > tasks are finished > --- > > Key: SPARK-20139 > URL: https://issues.apache.org/jira/browse/SPARK-20139 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.1.0 >Reporter: Etti Gur > Attachments: screenshot-1.png > > > Spark UI reports partial success for completed stage while log shows all > tasks are finished - i.e.: > We have a stage that is presented under completed stages on spark UI, > but the successful tasks are shown like so: (146372/524964) not as you'd > expect (524964/524964) > Looking at the application master log shows all tasks in that stage are > successful: > 17/03/29 09:45:49 INFO TaskSetManager: Finished task 522973.0 in stage 0.0 > (TID 522973) in 1163910 ms on ip-10-1-15-34.ec2.internal (executor 116) > (524963/524964) > 17/03/29 09:45:49 INFO TaskSetManager: Finished task 12508.0 in stage 2.0 > (TID 537472) in 241250 ms on ip-10-1-15-14.ec2.internal (executor 38) > (20234/20262) > 17/03/29 09:45:49 INFO TaskSetManager: Finished task 12465.0 in stage 2.0 > (TID 537429) in
[jira] [Resolved] (SPARK-19556) Broadcast data is not encrypted when I/O encryption is on
[ https://issues.apache.org/jira/browse/SPARK-19556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-19556. - Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17295 [https://github.com/apache/spark/pull/17295] > Broadcast data is not encrypted when I/O encryption is on > - > > Key: SPARK-19556 > URL: https://issues.apache.org/jira/browse/SPARK-19556 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin > Fix For: 2.2.0 > > > {{TorrentBroadcast}} uses a couple of "back doors" into the block manager to > write and read data: > {code} > if (!blockManager.putBytes(pieceId, bytes, MEMORY_AND_DISK_SER, > tellMaster = true)) { > throw new SparkException(s"Failed to store $pieceId of $broadcastId > in local BlockManager") > } > {code} > {code} > bm.getLocalBytes(pieceId) match { > case Some(block) => > blocks(pid) = block > releaseLock(pieceId) > case None => > bm.getRemoteBytes(pieceId) match { > case Some(b) => > if (checksumEnabled) { > val sum = calcChecksum(b.chunks(0)) > if (sum != checksums(pid)) { > throw new SparkException(s"corrupt remote block $pieceId of > $broadcastId:" + > s" $sum != ${checksums(pid)}") > } > } > // We found the block from remote executors/driver's > BlockManager, so put the block > // in this executor's BlockManager. > if (!bm.putBytes(pieceId, b, StorageLevel.MEMORY_AND_DISK_SER, > tellMaster = true)) { > throw new SparkException( > s"Failed to store $pieceId of $broadcastId in local > BlockManager") > } > blocks(pid) = b > case None => > throw new SparkException(s"Failed to get $pieceId of > $broadcastId") > } > } > {code} > The thing these block manager methods have in common is that they bypass the > encryption code; so broadcast data is stored unencrypted in the block > manager, causing unencrypted data to be written to disk if those blocks need > to be evicted from memory. > The correct fix here is actually not to change {{TorrentBroadcast}}, but to > fix the block manager so that: > - data stored in memory is not encrypted > - data written to disk is encrypted > This would simplify the code paths that use BlockManager / SerializerManager > APIs (e.g. see SPARK-19520), but requires some tricky changes inside the > BlockManager to still be able to use file channels to avoid reading whole > blocks back into memory so they can be decrypted. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12261) pyspark crash for large dataset
[ https://issues.apache.org/jira/browse/SPARK-12261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15934218#comment-15934218 ] Tomas Pranckevicius edited comment on SPARK-12261 at 3/29/17 12:28 PM: --- Thank you Shea for the details. These solutions will not necessarily apply to my situation, but this is clear - these solutions does not solve my problem. There is new project called IBM systemML that might solve these issues, because most probably the current version of MLlib does not support automatic optimization based on data and cluster characteristics to ensure efficiency and scalability. So lets see what is happening on Apache SystemML. More: http://systemml.apache.org was (Author: tomas pranckevicius): Thank you Shea for the details. These solutions will not necessarily apply to my situation, but this is clear - these solutions does not solve my problem. > pyspark crash for large dataset > --- > > Key: SPARK-12261 > URL: https://issues.apache.org/jira/browse/SPARK-12261 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.5.2 > Environment: windows >Reporter: zihao > > I tried to import a local text(over 100mb) file via textFile in pyspark, when > i ran data.take(), it failed and gave error messages including: > 15/12/10 17:17:43 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; > aborting job > Traceback (most recent call last): > File "E:/spark_python/test3.py", line 9, in > lines.take(5) > File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\rdd.py", line 1299, > in take > res = self.context.runJob(self, takeUpToNumLeft, p) > File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\context.py", line > 916, in runJob > port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, > partitions) > File "C:\Anaconda2\lib\site-packages\py4j\java_gateway.py", line 813, in > __call__ > answer, self.gateway_client, self.target_id, self.name) > File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\sql\utils.py", line > 36, in deco > return f(*a, **kw) > File "C:\Anaconda2\lib\site-packages\py4j\protocol.py", line 308, in > get_return_value > format(target_id, ".", name), value) > py4j.protocol.Py4JJavaError: An error occurred while calling > z:org.apache.spark.api.python.PythonRDD.runJob. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 > (TID 0, localhost): java.net.SocketException: Connection reset by peer: > socket write error > Then i ran the same code for a small text file, this time .take() worked fine. > How can i solve this problem? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19556) Broadcast data is not encrypted when I/O encryption is on
[ https://issues.apache.org/jira/browse/SPARK-19556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-19556: --- Assignee: Marcelo Vanzin > Broadcast data is not encrypted when I/O encryption is on > - > > Key: SPARK-19556 > URL: https://issues.apache.org/jira/browse/SPARK-19556 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin > > {{TorrentBroadcast}} uses a couple of "back doors" into the block manager to > write and read data: > {code} > if (!blockManager.putBytes(pieceId, bytes, MEMORY_AND_DISK_SER, > tellMaster = true)) { > throw new SparkException(s"Failed to store $pieceId of $broadcastId > in local BlockManager") > } > {code} > {code} > bm.getLocalBytes(pieceId) match { > case Some(block) => > blocks(pid) = block > releaseLock(pieceId) > case None => > bm.getRemoteBytes(pieceId) match { > case Some(b) => > if (checksumEnabled) { > val sum = calcChecksum(b.chunks(0)) > if (sum != checksums(pid)) { > throw new SparkException(s"corrupt remote block $pieceId of > $broadcastId:" + > s" $sum != ${checksums(pid)}") > } > } > // We found the block from remote executors/driver's > BlockManager, so put the block > // in this executor's BlockManager. > if (!bm.putBytes(pieceId, b, StorageLevel.MEMORY_AND_DISK_SER, > tellMaster = true)) { > throw new SparkException( > s"Failed to store $pieceId of $broadcastId in local > BlockManager") > } > blocks(pid) = b > case None => > throw new SparkException(s"Failed to get $pieceId of > $broadcastId") > } > } > {code} > The thing these block manager methods have in common is that they bypass the > encryption code; so broadcast data is stored unencrypted in the block > manager, causing unencrypted data to be written to disk if those blocks need > to be evicted from memory. > The correct fix here is actually not to change {{TorrentBroadcast}}, but to > fix the block manager so that: > - data stored in memory is not encrypted > - data written to disk is encrypted > This would simplify the code paths that use BlockManager / SerializerManager > APIs (e.g. see SPARK-19520), but requires some tricky changes inside the > BlockManager to still be able to use file channels to avoid reading whole > blocks back into memory so they can be decrypted. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20140) Remove hardcoded kinesis retry wait and max retries
[ https://issues.apache.org/jira/browse/SPARK-20140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15947015#comment-15947015 ] Yash Sharma commented on SPARK-20140: - Proposing : https://github.com/apache/spark/pull/17467 Please review. > Remove hardcoded kinesis retry wait and max retries > --- > > Key: SPARK-20140 > URL: https://issues.apache.org/jira/browse/SPARK-20140 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.1.0 >Reporter: Yash Sharma > Labels: kinesis, recovery > > The pull requests proposes to remove the hardcoded values for Amazon Kinesis > - MIN_RETRY_WAIT_TIME_MS, MAX_RETRIES. > This change is critical for kinesis checkpoint recovery when the kinesis > backed rdd is huge. > Following happens in a typical kinesis recovery : > - kinesis throttles large number of requests while recovering > - retries in case of throttling are not able to recover due to the small wait > period > - kinesis throttles per second, the wait period should be configurable for > recovery > The patch picks the spark kinesis configs from: > - spark.streaming.kinesis.retry.wait.time > - spark.streaming.kinesis.retry.max.attempts -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20140) Remove hardcoded kinesis retry wait and max retries
Yash Sharma created SPARK-20140: --- Summary: Remove hardcoded kinesis retry wait and max retries Key: SPARK-20140 URL: https://issues.apache.org/jira/browse/SPARK-20140 Project: Spark Issue Type: Bug Components: DStreams Affects Versions: 2.1.0 Reporter: Yash Sharma The pull requests proposes to remove the hardcoded values for Amazon Kinesis - MIN_RETRY_WAIT_TIME_MS, MAX_RETRIES. This change is critical for kinesis checkpoint recovery when the kinesis backed rdd is huge. Following happens in a typical kinesis recovery : - kinesis throttles large number of requests while recovering - retries in case of throttling are not able to recover due to the small wait period - kinesis throttles per second, the wait period should be configurable for recovery The patch picks the spark kinesis configs from: - spark.streaming.kinesis.retry.wait.time - spark.streaming.kinesis.retry.max.attempts -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20139) Spark UI reports partial success for completed stage while log shows all tasks are finished
[ https://issues.apache.org/jira/browse/SPARK-20139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Etti Gur updated SPARK-20139: - Attachment: screenshot-1.png > Spark UI reports partial success for completed stage while log shows all > tasks are finished > --- > > Key: SPARK-20139 > URL: https://issues.apache.org/jira/browse/SPARK-20139 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.1.0 >Reporter: Etti Gur > Attachments: screenshot-1.png > > > Spark UI reports partial success for completed stage while log shows all > tasks are finished - i.e.: > We have a stage that is presented under completed stages on spark UI, > but the successful tasks are shown like so: (146372/524964) not as you'd > expect (524964/524964) > Looking at the application master log shows all tasks in that stage are > successful: > 17/03/29 09:45:49 INFO TaskSetManager: Finished task 522973.0 in stage 0.0 > (TID 522973) in 1163910 ms on ip-10-1-15-34.ec2.internal (executor 116) > (524963/524964) > 17/03/29 09:45:49 INFO TaskSetManager: Finished task 12508.0 in stage 2.0 > (TID 537472) in 241250 ms on ip-10-1-15-14.ec2.internal (executor 38) > (20234/20262) > 17/03/29 09:45:49 INFO TaskSetManager: Finished task 12465.0 in stage 2.0 > (TID 537429) in 241994 ms on ip-10-1-15-106.ec2.internal (executor 133) > (20235/20262) > 17/03/29 09:45:49 INFO TaskSetManager: Finished task 15079.0 in stage 2.0 > (TID 540043) in 202889 ms on ip-10-1-15-173.ec2.internal (executor 295) > (20236/20262) > 17/03/29 09:45:49 INFO TaskSetManager: Finished task 19828.0 in stage 2.0 > (TID 544792) in 137845 ms on ip-10-1-15-147.ec2.internal (executor 43) > (20237/20262) > 17/03/29 09:45:50 INFO TaskSetManager: Finished task 19072.0 in stage 2.0 > (TID 544036) in 147363 ms on ip-10-1-15-19.ec2.internal (executor 175) > (20238/20262) > 17/03/29 09:45:50 INFO TaskSetManager: Finished task 524146.0 in stage 0.0 > (TID 524146) in 889950 ms on ip-10-1-15-72.ec2.internal (executor 74) > *(524964/524964)* > This looks like the stage is indeed completed with all its tasks but UI shows > like not all tasks really finished. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20139) Spark UI reports partial success for completed stage while log shows all tasks are finished
Etti Gur created SPARK-20139: Summary: Spark UI reports partial success for completed stage while log shows all tasks are finished Key: SPARK-20139 URL: https://issues.apache.org/jira/browse/SPARK-20139 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 2.1.0 Reporter: Etti Gur Spark UI reports partial success for completed stage while log shows all tasks are finished - i.e.: We have a stage that is presented under completed stages on spark UI, but the successful tasks are shown like so: (146372/524964) not as you'd expect (524964/524964) Looking at the application master log shows all tasks in that stage are successful: 17/03/29 09:45:49 INFO TaskSetManager: Finished task 522973.0 in stage 0.0 (TID 522973) in 1163910 ms on ip-10-1-15-34.ec2.internal (executor 116) (524963/524964) 17/03/29 09:45:49 INFO TaskSetManager: Finished task 12508.0 in stage 2.0 (TID 537472) in 241250 ms on ip-10-1-15-14.ec2.internal (executor 38) (20234/20262) 17/03/29 09:45:49 INFO TaskSetManager: Finished task 12465.0 in stage 2.0 (TID 537429) in 241994 ms on ip-10-1-15-106.ec2.internal (executor 133) (20235/20262) 17/03/29 09:45:49 INFO TaskSetManager: Finished task 15079.0 in stage 2.0 (TID 540043) in 202889 ms on ip-10-1-15-173.ec2.internal (executor 295) (20236/20262) 17/03/29 09:45:49 INFO TaskSetManager: Finished task 19828.0 in stage 2.0 (TID 544792) in 137845 ms on ip-10-1-15-147.ec2.internal (executor 43) (20237/20262) 17/03/29 09:45:50 INFO TaskSetManager: Finished task 19072.0 in stage 2.0 (TID 544036) in 147363 ms on ip-10-1-15-19.ec2.internal (executor 175) (20238/20262) 17/03/29 09:45:50 INFO TaskSetManager: Finished task 524146.0 in stage 0.0 (TID 524146) in 889950 ms on ip-10-1-15-72.ec2.internal (executor 74) *(524964/524964)* This looks like the stage is indeed completed with all its tasks but UI shows like not all tasks really finished. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18971) Netty issue may cause the shuffle client hang
[ https://issues.apache.org/jira/browse/SPARK-18971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15946882#comment-15946882 ] Emlyn Corrin commented on SPARK-18971: -- Will this fix go into Spark 2.1.1? > Netty issue may cause the shuffle client hang > - > > Key: SPARK-18971 > URL: https://issues.apache.org/jira/browse/SPARK-18971 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Minor > Fix For: 2.2.0 > > > Check https://github.com/netty/netty/issues/6153 for details > You should be able to see the following similar stack track in the executor > thread dump. > {code} > "shuffle-client-7-4" daemon prio=5 tid=97 RUNNABLE > at io.netty.util.Recycler$Stack.scavengeSome(Recycler.java:504) > at io.netty.util.Recycler$Stack.scavenge(Recycler.java:454) > at io.netty.util.Recycler$Stack.pop(Recycler.java:435) > at io.netty.util.Recycler.get(Recycler.java:144) > at > io.netty.buffer.PooledUnsafeDirectByteBuf.newInstance(PooledUnsafeDirectByteBuf.java:39) > at > io.netty.buffer.PoolArena$DirectArena.newByteBuf(PoolArena.java:727) > at io.netty.buffer.PoolArena.allocate(PoolArena.java:140) > at > io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:271) > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:177) > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:168) > at > io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:129) > at > io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:117) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:652) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:575) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:489) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:451) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:140) > at > io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20138) Add imports to snippets in Spark SQL, DataFrames and Datasets Guide doc
[ https://issues.apache.org/jira/browse/SPARK-20138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15946818#comment-15946818 ] Sean Owen commented on SPARK-20138: --- I think the Spark imports were purposely excluded for brevity. However there may be some cases where the imports aren't obvious because they aren't from Spark. I think that could be valuable to add in some cases. > Add imports to snippets in Spark SQL, DataFrames and Datasets Guide doc > --- > > Key: SPARK-20138 > URL: https://issues.apache.org/jira/browse/SPARK-20138 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Affects Versions: 2.2.0 >Reporter: Jacek Laskowski >Priority: Trivial > > Given [the question on > StackOverflow|http://stackoverflow.com/q/43089100/1305344] it seems it'd be > helpful to add imports to the snippets to make _some_ people's lives easier. > {quote} > When I try to load data using the second method in the link, I get the > following error. > scala> val connectionProperties = new Properties() > :44: error: not found: type Properties >val connectionProperties = new Properties() > {quote} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20138) Add imports to snippets in Spark SQL, DataFrames and Datasets Guide doc
Jacek Laskowski created SPARK-20138: --- Summary: Add imports to snippets in Spark SQL, DataFrames and Datasets Guide doc Key: SPARK-20138 URL: https://issues.apache.org/jira/browse/SPARK-20138 Project: Spark Issue Type: Improvement Components: Documentation, SQL Affects Versions: 2.2.0 Reporter: Jacek Laskowski Priority: Trivial Given [the question on StackOverflow|http://stackoverflow.com/q/43089100/1305344] it seems it'd be helpful to add imports to the snippets to make _some_ people's lives easier. {quote} When I try to load data using the second method in the link, I get the following error. scala> val connectionProperties = new Properties() :44: error: not found: type Properties val connectionProperties = new Properties() {quote} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20135) spark thriftserver2: no job running but containers not release on yarn
[ https://issues.apache.org/jira/browse/SPARK-20135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15946748#comment-15946748 ] Sean Owen commented on SPARK-20135: --- There isn't enough detail here. It may be normal operation depending on your timeouts and settings. It isn't even clear you have enabled dynamic allocation. The mailing list is the right place to start with questions. > spark thriftserver2: no job running but containers not release on yarn > -- > > Key: SPARK-20135 > URL: https://issues.apache.org/jira/browse/SPARK-20135 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 > Environment: spark 2.0.1 with hadoop 2.6.0 >Reporter: bruce xu > Attachments: 0329-1.png, 0329-2.png, 0329-3.png > > > i opened the executor dynamic allocation feature, however it doesn't work > sometimes. > i set the initial executor num 50, after job finished the cores and mem > resource did not release. > from the spark web UI, the active job/running task/stage num is 0 , but the > executors page show cores 1276, active task 7288. > from the yarn web UI, the thriftserver job's running containers is 639 > without releasing. > this may be a bug. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org