[jira] [Commented] (SPARK-14048) Aggregation operations on structs fail when the structs have fields with special characters
[ https://issues.apache.org/jira/browse/SPARK-14048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15333221#comment-15333221 ] Sean Zhong commented on SPARK-14048: [~simeons] Can you try the following script on your environment and see if it reports error? {code} val rdd = sc.makeRDD( """{"st": {"x.y": 1}, "age": 10}""" :: """{"st": {"x.y": 2}, "age": 10}""" :: """{"st": {"x.y": 2}, "age": 20}""" :: Nil) sqlContext.read.json(rdd).registerTempTable("test") sqlContext.sql("select first(st) as st from test group by age").show() {code} > Aggregation operations on structs fail when the structs have fields with > special characters > --- > > Key: SPARK-14048 > URL: https://issues.apache.org/jira/browse/SPARK-14048 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 > Environment: Databricks w/ 1.6.0 >Reporter: Simeon Simeonov > Labels: sql > Attachments: bug_structs_with_backticks.html > > > Consider a schema where a struct has field names with special characters, > e.g., > {code} > |-- st: struct (nullable = true) > ||-- x.y: long (nullable = true) > {code} > Schema such as these are frequently generated by the JSON schema generator, > which seems to never want to map JSON data to {{MapType}} always preferring > to use {{StructType}}. > In SparkSQL, referring to these fields requires backticks, e.g., > {{st.`x.y`}}. There is no problem manipulating these structs unless one is > using an aggregation function. It seems that, under the covers, the code is > not escaping fields with special characters correctly. > For example, > {code} > select first(st) as st from tbl group by something > {code} > generates > {code} > org.apache.spark.sql.catalyst.util.DataTypeException: Unsupported dataType: > struct. If you have a struct and a field name of it has any > special characters, please use backticks (`) to quote that field name, e.g. > `x+y`. Please note that backtick itself is not supported in a field name. > at > org.apache.spark.sql.catalyst.util.DataTypeParser$class.toDataType(DataTypeParser.scala:100) > at > org.apache.spark.sql.catalyst.util.DataTypeParser$$anon$1.toDataType(DataTypeParser.scala:112) > at > org.apache.spark.sql.catalyst.util.DataTypeParser$.parse(DataTypeParser.scala:116) > at > org.apache.spark.sql.hive.HiveMetastoreTypes$.toDataType(HiveMetastoreCatalog.scala:884) > at > com.databricks.backend.daemon.driver.OutputAggregator$$anonfun$toJsonSchema$1.apply(OutputAggregator.scala:395) > at > com.databricks.backend.daemon.driver.OutputAggregator$$anonfun$toJsonSchema$1.apply(OutputAggregator.scala:394) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > com.databricks.backend.daemon.driver.OutputAggregator$.toJsonSchema(OutputAggregator.scala:394) > at > com.databricks.backend.daemon.driver.OutputAggregator$.maybeApplyOutputAggregation(OutputAggregator.scala:122) > at > com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation0(OutputAggregator.scala:82) > at > com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation(OutputAggregator.scala:42) > at > com.databricks.backend.daemon.driver.DriverLocal.executeSql(DriverLocal.scala:306) > at > com.databricks.backend.daemon.driver.DriverLocal.execute(DriverLocal.scala:161) > at > com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$3.apply(DriverWrapper.scala:467) > at > com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$3.apply(DriverWrapper.scala:467) > at scala.util.Try$.apply(Try.scala:161) > at > com.databricks.backend.daemon.driver.DriverWrapper.executeCommand(DriverWrapper.scala:464) > at > com.databricks.backend.daemon.driver.DriverWrapper.runInner(DriverWrapper.scala:365) > at > com.databricks.backend.daemon.driver.DriverWrapper.run(DriverWrapper.scala:196) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15982) DataFrameReader.orc() should support varargs like json, csv, and parquet
[ https://issues.apache.org/jira/browse/SPARK-15982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15333218#comment-15333218 ] Sandeep Singh commented on SPARK-15982: --- [~tdas] I can take this up, if you have not started on this already > DataFrameReader.orc() should support varargs like json, csv, and parquet > > > Key: SPARK-15982 > URL: https://issues.apache.org/jira/browse/SPARK-15982 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Tathagata Das >Assignee: Tathagata Das > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15984) WARN message "o.a.h.y.s.resourcemanager.rmapp.RMAppImpl: The specific max attempts: 0 for application: 8 is invalid" when starting application on YARN
Jacek Laskowski created SPARK-15984: --- Summary: WARN message "o.a.h.y.s.resourcemanager.rmapp.RMAppImpl: The specific max attempts: 0 for application: 8 is invalid" when starting application on YARN Key: SPARK-15984 URL: https://issues.apache.org/jira/browse/SPARK-15984 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 2.0.0 Reporter: Jacek Laskowski Priority: Minor When executing {{spark-shell}} on Spark on YARN 2.7.2 on Mac OS as follows: {code} YARN_CONF_DIR=hadoop-conf ./bin/spark-shell --master yarn -c spark.shuffle.service.enabled=true --deploy-mode client -c spark.scheduler.mode=FAIR {code} it ends up with the following WARN in the logs: {code} 2016-06-16 08:33:05,308 INFO org.apache.hadoop.yarn.server.resourcemanager.ClientRMService: Allocated new applicationId: 8 2016-06-16 08:33:07,305 WARN org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: The specific max attempts: 0 for application: 8 is invalid, because it is out of the range [1, 2]. Use the global max attempts instead. {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15786) joinWith bytecode generation calling ByteBuffer.wrap with InternalRow
[ https://issues.apache.org/jira/browse/SPARK-15786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-15786. Resolution: Duplicate Fix Version/s: 2.0.0 > joinWith bytecode generation calling ByteBuffer.wrap with InternalRow > - > > Key: SPARK-15786 > URL: https://issues.apache.org/jira/browse/SPARK-15786 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1, 2.0.0 >Reporter: Richard Marscher >Assignee: Sean Zhong > Fix For: 2.0.0 > > > {code}java.lang.RuntimeException: Error while decoding: > java.util.concurrent.ExecutionException: java.lang.Exception: failed to > compile: org.codehaus.commons.compiler.CompileException: File > 'generated.java', Line 36, Column 107: No applicable constructor/method found > for actual parameters "org.apache.spark.sql.catalyst.InternalRow"; candidates > are: "public static java.nio.ByteBuffer java.nio.ByteBuffer.wrap(byte[])", > "public static java.nio.ByteBuffer java.nio.ByteBuffer.wrap(byte[], int, > int)"{code} > I have been trying to use joinWith along with Option data types to get an > approximation of the RDD semantics for outer joins with Dataset to have a > nicer API for Scala. However, using the Dataset.as[] syntax leads to bytecode > generation trying to pass an InternalRow object into the ByteBuffer.wrap > function which expects byte[] with or without a couple int qualifiers. > I have a notebook reproducing this against 2.0 preview in Databricks > Community Edition: > https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/160347920874755/1039589581260901/673639177603143/latest.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15786) joinWith bytecode generation calling ByteBuffer.wrap with InternalRow
[ https://issues.apache.org/jira/browse/SPARK-15786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-15786: --- Assignee: Sean Zhong > joinWith bytecode generation calling ByteBuffer.wrap with InternalRow > - > > Key: SPARK-15786 > URL: https://issues.apache.org/jira/browse/SPARK-15786 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1, 2.0.0 >Reporter: Richard Marscher >Assignee: Sean Zhong > > {code}java.lang.RuntimeException: Error while decoding: > java.util.concurrent.ExecutionException: java.lang.Exception: failed to > compile: org.codehaus.commons.compiler.CompileException: File > 'generated.java', Line 36, Column 107: No applicable constructor/method found > for actual parameters "org.apache.spark.sql.catalyst.InternalRow"; candidates > are: "public static java.nio.ByteBuffer java.nio.ByteBuffer.wrap(byte[])", > "public static java.nio.ByteBuffer java.nio.ByteBuffer.wrap(byte[], int, > int)"{code} > I have been trying to use joinWith along with Option data types to get an > approximation of the RDD semantics for outer joins with Dataset to have a > nicer API for Scala. However, using the Dataset.as[] syntax leads to bytecode > generation trying to pass an InternalRow object into the ByteBuffer.wrap > function which expects byte[] with or without a couple int qualifiers. > I have a notebook reproducing this against 2.0 preview in Databricks > Community Edition: > https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/160347920874755/1039589581260901/673639177603143/latest.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15786) joinWith bytecode generation calling ByteBuffer.wrap with InternalRow
[ https://issues.apache.org/jira/browse/SPARK-15786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15333205#comment-15333205 ] Sean Zhong commented on SPARK-15786: Hi [~rmarscher] The reason is that you use the Kryo encoder in a wrong way, you cannot cast {{Dataset\[A\]}} to {{Dataset\[B\]}} using a Kryo encoder. In Spark 2.0, we now have a stricter rule to detect this kind of cast failure after PR https://github.com/apache/spark/pull/13632. Now, it reports error like this: {code} scala> ds.as[(Option[(Int, Int)], Option[(Int, Int)])] org.apache.spark.sql.AnalysisException: cannot resolve 'CAST(`_1` AS BINARY)' due to data type mismatch: cannot cast StructType(StructField(_1,IntegerType,false), StructField(_2,IntegerType,false)) to BinaryType; {code} If you remove all related Kryo lines in your test code, then no exception should be thrown, your code should work fine. > joinWith bytecode generation calling ByteBuffer.wrap with InternalRow > - > > Key: SPARK-15786 > URL: https://issues.apache.org/jira/browse/SPARK-15786 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1, 2.0.0 >Reporter: Richard Marscher > > {code}java.lang.RuntimeException: Error while decoding: > java.util.concurrent.ExecutionException: java.lang.Exception: failed to > compile: org.codehaus.commons.compiler.CompileException: File > 'generated.java', Line 36, Column 107: No applicable constructor/method found > for actual parameters "org.apache.spark.sql.catalyst.InternalRow"; candidates > are: "public static java.nio.ByteBuffer java.nio.ByteBuffer.wrap(byte[])", > "public static java.nio.ByteBuffer java.nio.ByteBuffer.wrap(byte[], int, > int)"{code} > I have been trying to use joinWith along with Option data types to get an > approximation of the RDD semantics for outer joins with Dataset to have a > nicer API for Scala. However, using the Dataset.as[] syntax leads to bytecode > generation trying to pass an InternalRow object into the ByteBuffer.wrap > function which expects byte[] with or without a couple int qualifiers. > I have a notebook reproducing this against 2.0 preview in Databricks > Community Edition: > https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/160347920874755/1039589581260901/673639177603143/latest.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10618) Refactoring and adding test for Mesos coarse-grained Scheduler
[ https://issues.apache.org/jira/browse/SPARK-10618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-10618. --- Resolution: Won't Fix > Refactoring and adding test for Mesos coarse-grained Scheduler > -- > > Key: SPARK-10618 > URL: https://issues.apache.org/jira/browse/SPARK-10618 > Project: Spark > Issue Type: Test > Components: Spark Core >Affects Versions: 1.5.0 >Reporter: Akash Mishra >Priority: Trivial > > Various condition for checking if Mesos offer is valid for Scheduling or not > is cluttered in the resourceOffer method and has no unit test. This is a > refactoring JIRA to extract the logic in a method and test that method. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-15815) Hang while enable blacklistExecutor and DynamicExecutorAllocator
[ https://issues.apache.org/jira/browse/SPARK-15815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15333198#comment-15333198 ] SuYan edited comment on SPARK-15815 at 6/16/16 6:24 AM: eh...yes, still have the uncertainty to got another executors, how can we wait some time to decide to abort the tasksets...like 5 min, or 10 min. actually, For me the primary task is to make sure the job can finished successful instead of failed and give up the sunk cost, so I prefer to reset the condition which will make job hang, like make dynamic active again, or kill the blacklist Executor and request new, or wait executor to be allocated even if need to wait some time due to resource shortage. was (Author: suyan): eh...yes, still have the uncertainty to got another executors, how can we wait some time to decide to abort the tasksets...like 5 min, or 10 min. actually, For me the primary task is to make sure the job can finished successful instead of failed and give up the sunk cost, so I prefer to reset the condition to make job hang, like make dynamic active again, or kill the blacklist Executor and request new, or wait executor to be allocated even if need to wait some time due to resource shortage. > Hang while enable blacklistExecutor and DynamicExecutorAllocator > - > > Key: SPARK-15815 > URL: https://issues.apache.org/jira/browse/SPARK-15815 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Affects Versions: 1.6.1 >Reporter: SuYan >Priority: Minor > > Enable BlacklistExecutor with some time large than 120s and enabled > DynamicAllocate with minExecutors = 0 > 1. Assume there only left 1 task running in Executor A, and other Executor > are all timeout. > 2. the task failed, so task will not scheduled in current Executor A due to > enable blacklistTime. > 3. For ExecutorAllocateManager, it always request targetNumExecutor=1 > executors, due to we already have executor A, so the oldTargetNumExecutor == > targetNumExecutor = 1, so will never add more Executors...even if Executor A > was timeout. it became endless request delta=0 executors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15815) Hang while enable blacklistExecutor and DynamicExecutorAllocator
[ https://issues.apache.org/jira/browse/SPARK-15815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15333198#comment-15333198 ] SuYan commented on SPARK-15815: --- eh...yes, still have the uncertainty to got another executors, how can we wait some time to decide to abort the tasksets...like 5 min, or 10 min. actually, For me the primary task is to make sure the job can finished successful instead of failed and give up the sunk cost, so I prefer to reset the condition to make job hang, like make dynamic active again, or kill the blacklist Executor and request new, or wait executor to be allocated even if need to wait some time due to resource shortage. > Hang while enable blacklistExecutor and DynamicExecutorAllocator > - > > Key: SPARK-15815 > URL: https://issues.apache.org/jira/browse/SPARK-15815 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Affects Versions: 1.6.1 >Reporter: SuYan >Priority: Minor > > Enable BlacklistExecutor with some time large than 120s and enabled > DynamicAllocate with minExecutors = 0 > 1. Assume there only left 1 task running in Executor A, and other Executor > are all timeout. > 2. the task failed, so task will not scheduled in current Executor A due to > enable blacklistTime. > 3. For ExecutorAllocateManager, it always request targetNumExecutor=1 > executors, due to we already have executor A, so the oldTargetNumExecutor == > targetNumExecutor = 1, so will never add more Executors...even if Executor A > was timeout. it became endless request delta=0 executors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12113) Add timing metrics to blocking phases for spark sql
[ https://issues.apache.org/jira/browse/SPARK-12113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15333160#comment-15333160 ] Takeshi Yamamuro commented on SPARK-12113: -- [~rxin] okay, I'll rework based on the #10116 patch. > Add timing metrics to blocking phases for spark sql > --- > > Key: SPARK-12113 > URL: https://issues.apache.org/jira/browse/SPARK-12113 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Nong Li > > It's currently not easy to look at the SQL page to get any sense of how long > different parts of the plan take. This is in general difficult to do with the > row at a time pipelining. We can however, include the timing information for > blocking phases. Including these will be useful to get a sense of what is > going on. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15983) Remove FileFormat.prepareRead()
[ https://issues.apache.org/jira/browse/SPARK-15983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-15983: --- Summary: Remove FileFormat.prepareRead() (was: Remove FileFormat.prepareRead) > Remove FileFormat.prepareRead() > --- > > Key: SPARK-15983 > URL: https://issues.apache.org/jira/browse/SPARK-15983 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Cheng Lian > > Interface method {{FileFormat.prepareRead()}} was added in [PR > #12088|https://github.com/apache/spark/pull/12088] to handle a special case > in the LibSVM data source. > However, the semantics of this interface method isn't intuitive: it returns a > modified version of the data source options map. Considering that the LibSVM > case can be easily handled using schema metadata inside {{inferSchema}}, we > can remove this interface method to keep the {{FileFormat}} interface clean. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15983) Remove FileFormat.prepareRead
[ https://issues.apache.org/jira/browse/SPARK-15983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15983: Assignee: Cheng Lian (was: Apache Spark) > Remove FileFormat.prepareRead > - > > Key: SPARK-15983 > URL: https://issues.apache.org/jira/browse/SPARK-15983 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Cheng Lian > > Interface method {{FileFormat.prepareRead()}} was added in [PR > #12088|https://github.com/apache/spark/pull/12088] to handle a special case > in the LibSVM data source. > However, the semantics of this interface method isn't intuitive: it returns a > modified version of the data source options map. Considering that the LibSVM > case can be easily handled using schema metadata inside {{inferSchema}}, we > can remove this interface method to keep the {{FileFormat}} interface clean. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15983) Remove FileFormat.prepareRead
[ https://issues.apache.org/jira/browse/SPARK-15983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15983: Assignee: Apache Spark (was: Cheng Lian) > Remove FileFormat.prepareRead > - > > Key: SPARK-15983 > URL: https://issues.apache.org/jira/browse/SPARK-15983 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Apache Spark > > Interface method {{FileFormat.prepareRead()}} was added in [PR > #12088|https://github.com/apache/spark/pull/12088] to handle a special case > in the LibSVM data source. > However, the semantics of this interface method isn't intuitive: it returns a > modified version of the data source options map. Considering that the LibSVM > case can be easily handled using schema metadata inside {{inferSchema}}, we > can remove this interface method to keep the {{FileFormat}} interface clean. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15983) Remove FileFormat.prepareRead
[ https://issues.apache.org/jira/browse/SPARK-15983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15333141#comment-15333141 ] Apache Spark commented on SPARK-15983: -- User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/13698 > Remove FileFormat.prepareRead > - > > Key: SPARK-15983 > URL: https://issues.apache.org/jira/browse/SPARK-15983 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Cheng Lian > > Interface method {{FileFormat.prepareRead()}} was added in [PR > #12088|https://github.com/apache/spark/pull/12088] to handle a special case > in the LibSVM data source. > However, the semantics of this interface method isn't intuitive: it returns a > modified version of the data source options map. Considering that the LibSVM > case can be easily handled using schema metadata inside {{inferSchema}}, we > can remove this interface method to keep the {{FileFormat}} interface clean. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15983) Remove FileFormat.prepareRead
[ https://issues.apache.org/jira/browse/SPARK-15983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-15983: --- Description: Interface method {{FileFormat.prepareRead()}} was added in [PR #12088|https://github.com/apache/spark/pull/12088] to handle a special case in the LibSVM data source. However, the semantics of this interface method isn't intuitive: it returns a modified version of the data source options map. Considering that the LibSVM case can be easily handled using schema metadata inside {{inferSchema}}, we can remove this interface method to keep the {{FileFormat}} interface clean. was: Interface method {{FileFormat.prepareRead()}} was added in [PR #12088|https://github.com/apache/spark/pull/12088] to handle a special case in the LibSVM data source. However, the semantics of this interface method isn't intuitive: it returns a modified version of the data source options map. Considering that the LibSVM case can be easily handled using schema metadata inside inferSchema, we can remove this interface method to keep the FileFormat interface clean. > Remove FileFormat.prepareRead > - > > Key: SPARK-15983 > URL: https://issues.apache.org/jira/browse/SPARK-15983 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Cheng Lian > > Interface method {{FileFormat.prepareRead()}} was added in [PR > #12088|https://github.com/apache/spark/pull/12088] to handle a special case > in the LibSVM data source. > However, the semantics of this interface method isn't intuitive: it returns a > modified version of the data source options map. Considering that the LibSVM > case can be easily handled using schema metadata inside {{inferSchema}}, we > can remove this interface method to keep the {{FileFormat}} interface clean. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15983) Remove FileFormat.prepareRead
Cheng Lian created SPARK-15983: -- Summary: Remove FileFormat.prepareRead Key: SPARK-15983 URL: https://issues.apache.org/jira/browse/SPARK-15983 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.0.0 Reporter: Cheng Lian Assignee: Cheng Lian Interface method {{FileFormat.prepareRead()}} was added in [PR #12088|https://github.com/apache/spark/pull/12088] to handle a special case in the LibSVM data source. However, the semantics of this interface method isn't intuitive: it returns a modified version of the data source options map. Considering that the LibSVM case can be easily handled using schema metadata inside inferSchema, we can remove this interface method to keep the FileFormat interface clean. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12922) Implement gapply() on DataFrame in SparkR
[ https://issues.apache.org/jira/browse/SPARK-12922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15333125#comment-15333125 ] Narine Kokhlikyan edited comment on SPARK-12922 at 6/16/16 5:25 AM: FYI, [~olarayej], [~aloknsingh], [~vijayrb] :) was (Author: narine): FYI, [~olarayej], [~aloknsingh], [~vijayrb]! > Implement gapply() on DataFrame in SparkR > - > > Key: SPARK-12922 > URL: https://issues.apache.org/jira/browse/SPARK-12922 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 1.6.0 >Reporter: Sun Rui >Assignee: Narine Kokhlikyan > Fix For: 2.0.0 > > > gapply() applies an R function on groups grouped by one or more columns of a > DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() > in the Dataset API. > Two API styles are supported: > 1. > {code} > gd <- groupBy(df, col1, ...) > gapply(gd, function(grouping_key, group) {}, schema) > {code} > 2. > {code} > gapply(df, grouping_columns, function(grouping_key, group) {}, schema) > {code} > R function input: grouping keys value, a local data.frame of this grouped > data > R function output: local data.frame > Schema specifies the Row format of the output of the R function. It must > match the R function's output. > Note that map-side combination (partial aggregation) is not supported, user > could do map-side combination via dapply(). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12922) Implement gapply() on DataFrame in SparkR
[ https://issues.apache.org/jira/browse/SPARK-12922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15333125#comment-15333125 ] Narine Kokhlikyan commented on SPARK-12922: --- FYI, [~olarayej], [~aloknsingh], [~vijayrb]! > Implement gapply() on DataFrame in SparkR > - > > Key: SPARK-12922 > URL: https://issues.apache.org/jira/browse/SPARK-12922 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 1.6.0 >Reporter: Sun Rui >Assignee: Narine Kokhlikyan > Fix For: 2.0.0 > > > gapply() applies an R function on groups grouped by one or more columns of a > DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() > in the Dataset API. > Two API styles are supported: > 1. > {code} > gd <- groupBy(df, col1, ...) > gapply(gd, function(grouping_key, group) {}, schema) > {code} > 2. > {code} > gapply(df, grouping_columns, function(grouping_key, group) {}, schema) > {code} > R function input: grouping keys value, a local data.frame of this grouped > data > R function output: local data.frame > Schema specifies the Row format of the output of the R function. It must > match the R function's output. > Note that map-side combination (partial aggregation) is not supported, user > could do map-side combination via dapply(). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15919) DStream "saveAsTextFile" doesn't update the prefix after each checkpoint
[ https://issues.apache.org/jira/browse/SPARK-15919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15333119#comment-15333119 ] Aamir Abbas commented on SPARK-15919: - I have tried the solution you suggested, i-e window() function. Here's my code. {code} Duration batchInterval = new Duration(30); // 5 minutes javaStream.window(batchInterval, batchInterval).dstream().saveAsTextFiles(getBaseOutputPath(), ""); {code} The actual output of this snippet is that it gets the base output path once, creates folders in that path, and saves each record from RDDs as a separate file. The expected output was to get new base output path every time the window() function is applied, and save all the records from RDDs in a single file. Please let me know if I am applying the window() function wrongly, and how to do that right. > DStream "saveAsTextFile" doesn't update the prefix after each checkpoint > > > Key: SPARK-15919 > URL: https://issues.apache.org/jira/browse/SPARK-15919 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 1.6.1 > Environment: Amazon EMR >Reporter: Aamir Abbas > > I have a Spark streaming job that reads a data stream, and saves it as a text > file after a predefined time interval. In the function > stream.dstream().repartition(1).saveAsTextFiles(getOutputPath(), ""); > The function getOutputPath() generates a new path every time the function is > called, depending on the current system time. > However, the output path prefix remains the same for all the batches, which > effectively means that function is not called again for the next batch of the > stream, although the files are being saved after each checkpoint interval. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15981) Fix bug in python DataStreamReader
[ https://issues.apache.org/jira/browse/SPARK-15981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15981: Assignee: Tathagata Das (was: Apache Spark) > Fix bug in python DataStreamReader > -- > > Key: SPARK-15981 > URL: https://issues.apache.org/jira/browse/SPARK-15981 > Project: Spark > Issue Type: Sub-task > Components: SQL, Streaming >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Blocker > > Bug in Python DataStreamReader API made it unusable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15981) Fix bug in python DataStreamReader
[ https://issues.apache.org/jira/browse/SPARK-15981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15333113#comment-15333113 ] Apache Spark commented on SPARK-15981: -- User 'tdas' has created a pull request for this issue: https://github.com/apache/spark/pull/13703 > Fix bug in python DataStreamReader > -- > > Key: SPARK-15981 > URL: https://issues.apache.org/jira/browse/SPARK-15981 > Project: Spark > Issue Type: Sub-task > Components: SQL, Streaming >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Blocker > > Bug in Python DataStreamReader API made it unusable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15981) Fix bug in python DataStreamReader
[ https://issues.apache.org/jira/browse/SPARK-15981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15981: Assignee: Apache Spark (was: Tathagata Das) > Fix bug in python DataStreamReader > -- > > Key: SPARK-15981 > URL: https://issues.apache.org/jira/browse/SPARK-15981 > Project: Spark > Issue Type: Sub-task > Components: SQL, Streaming >Reporter: Tathagata Das >Assignee: Apache Spark >Priority: Blocker > > Bug in Python DataStreamReader API made it unusable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15906) Complementary Naive Bayes Algorithm Implementation
[ https://issues.apache.org/jira/browse/SPARK-15906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] MIN-FU YANG updated SPARK-15906: Description: Improve the Naive Bayes algorithm on skew data according to "Tackling the Poor Assumptions of Naive Bayes Text Classifers" chapter 3.2 http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf Mahout & WEKA both have Complementary Naive Bayes implementations. https://mahout.apache.org/users/classification/bayesian.html http://weka.sourceforge.net/doc.packages/complementNaiveBayes/weka/classifiers/bayes/ComplementNaiveBayes.html Besides, this paper is referenced by other papers & books 600+ times, I think it's result is solid. https://scholar.google.com.tw/scholar?rlz=1C5CHFA_enTW567TW567&safe=high&um=1&ie=UTF-8&lr&cites=1197073324019480518 was: Improve the Naive Bayes algorithm on skew data according to "Tackling the Poor Assumptions of Naive Bayes Text Classifers" chapter 3.2 http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf Mahout & WEKA both have Complementary Naive Bayes implementations. https://mahout.apache.org/users/classification/bayesian.html http://weka.sourceforge.net/doc.packages/complementNaiveBayes/weka/classifiers/bayes/ComplementNaiveBayes.html Besides, this paper is referenced by other papers & books 600+ times, I think it's result is solid. > Complementary Naive Bayes Algorithm Implementation > -- > > Key: SPARK-15906 > URL: https://issues.apache.org/jira/browse/SPARK-15906 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: MIN-FU YANG >Priority: Minor > > Improve the Naive Bayes algorithm on skew data according to > "Tackling the Poor Assumptions of Naive Bayes Text Classifers" chapter 3.2 > http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf > Mahout & WEKA both have Complementary Naive Bayes implementations. > https://mahout.apache.org/users/classification/bayesian.html > http://weka.sourceforge.net/doc.packages/complementNaiveBayes/weka/classifiers/bayes/ComplementNaiveBayes.html > Besides, this paper is referenced by other papers & books 600+ times, I think > it's result is solid. > https://scholar.google.com.tw/scholar?rlz=1C5CHFA_enTW567TW567&safe=high&um=1&ie=UTF-8&lr&cites=1197073324019480518 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15906) Complementary Naive Bayes Algorithm Implementation
[ https://issues.apache.org/jira/browse/SPARK-15906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] MIN-FU YANG updated SPARK-15906: Description: Improve the Naive Bayes algorithm on skew data according to "Tackling the Poor Assumptions of Naive Bayes Text Classifers" chapter 3.2 http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf Mahout & WEKA both have Complementary Naive Bayes implementations. https://mahout.apache.org/users/classification/bayesian.html http://weka.sourceforge.net/doc.packages/complementNaiveBayes/weka/classifiers/bayes/ComplementNaiveBayes.html Besides, this paper is referenced by other papers & books 600+ times, I think it's result is solid. was: Improve the Naive Bayes algorithm on skew data according to "Tackling the Poor Assumptions of Naive Bayes Text Classifers" chapter 3.2 http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf > Complementary Naive Bayes Algorithm Implementation > -- > > Key: SPARK-15906 > URL: https://issues.apache.org/jira/browse/SPARK-15906 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: MIN-FU YANG >Priority: Minor > > Improve the Naive Bayes algorithm on skew data according to > "Tackling the Poor Assumptions of Naive Bayes Text Classifers" chapter 3.2 > http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf > Mahout & WEKA both have Complementary Naive Bayes implementations. > https://mahout.apache.org/users/classification/bayesian.html > http://weka.sourceforge.net/doc.packages/complementNaiveBayes/weka/classifiers/bayes/ComplementNaiveBayes.html > Besides, this paper is referenced by other papers & books 600+ times, I think > it's result is solid. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15906) Complementary Naive Bayes Algorithm Implementation
[ https://issues.apache.org/jira/browse/SPARK-15906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15333098#comment-15333098 ] MIN-FU YANG commented on SPARK-15906: - OK, the description is updated. > Complementary Naive Bayes Algorithm Implementation > -- > > Key: SPARK-15906 > URL: https://issues.apache.org/jira/browse/SPARK-15906 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: MIN-FU YANG >Priority: Minor > > Improve the Naive Bayes algorithm on skew data according to > "Tackling the Poor Assumptions of Naive Bayes Text Classifers" chapter 3.2 > http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf > Mahout & WEKA both have Complementary Naive Bayes implementations. > https://mahout.apache.org/users/classification/bayesian.html > http://weka.sourceforge.net/doc.packages/complementNaiveBayes/weka/classifiers/bayes/ComplementNaiveBayes.html > Besides, this paper is referenced by other papers & books 600+ times, I think > it's result is solid. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15982) DataFrameReader.orc() should support varargs like json, csv, and parquet
Tathagata Das created SPARK-15982: - Summary: DataFrameReader.orc() should support varargs like json, csv, and parquet Key: SPARK-15982 URL: https://issues.apache.org/jira/browse/SPARK-15982 Project: Spark Issue Type: Bug Components: SQL Reporter: Tathagata Das Assignee: Tathagata Das -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12922) Implement gapply() on DataFrame in SparkR
[ https://issues.apache.org/jira/browse/SPARK-12922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman resolved SPARK-12922. --- Resolution: Fixed Assignee: Narine Kokhlikyan Fix Version/s: 2.0.0 Resolved by https://github.com/apache/spark/pull/12836 > Implement gapply() on DataFrame in SparkR > - > > Key: SPARK-12922 > URL: https://issues.apache.org/jira/browse/SPARK-12922 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 1.6.0 >Reporter: Sun Rui >Assignee: Narine Kokhlikyan > Fix For: 2.0.0 > > > gapply() applies an R function on groups grouped by one or more columns of a > DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() > in the Dataset API. > Two API styles are supported: > 1. > {code} > gd <- groupBy(df, col1, ...) > gapply(gd, function(grouping_key, group) {}, schema) > {code} > 2. > {code} > gapply(df, grouping_columns, function(grouping_key, group) {}, schema) > {code} > R function input: grouping keys value, a local data.frame of this grouped > data > R function output: local data.frame > Schema specifies the Row format of the output of the R function. It must > match the R function's output. > Note that map-side combination (partial aggregation) is not supported, user > could do map-side combination via dapply(). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15981) Fix bug in python DataStreamReader
Tathagata Das created SPARK-15981: - Summary: Fix bug in python DataStreamReader Key: SPARK-15981 URL: https://issues.apache.org/jira/browse/SPARK-15981 Project: Spark Issue Type: Sub-task Components: SQL, Streaming Reporter: Tathagata Das Assignee: Tathagata Das Priority: Blocker Bug in Python DataStreamReader API made it unusable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15980) Add PushPredicateThroughObjectConsumer rule to Optimizer.
[ https://issues.apache.org/jira/browse/SPARK-15980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15980: Assignee: (was: Apache Spark) > Add PushPredicateThroughObjectConsumer rule to Optimizer. > - > > Key: SPARK-15980 > URL: https://issues.apache.org/jira/browse/SPARK-15980 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Takuya Ueshin > > I added {{PushPredicateThroughObjectConsumer}} rule to push-down predicates > through {{ObjectConsumer}}. > And as an example, I implemented push-down typed filter through > {{SerializeFromObject}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15980) Add PushPredicateThroughObjectConsumer rule to Optimizer.
[ https://issues.apache.org/jira/browse/SPARK-15980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15980: Assignee: Apache Spark > Add PushPredicateThroughObjectConsumer rule to Optimizer. > - > > Key: SPARK-15980 > URL: https://issues.apache.org/jira/browse/SPARK-15980 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Takuya Ueshin >Assignee: Apache Spark > > I added {{PushPredicateThroughObjectConsumer}} rule to push-down predicates > through {{ObjectConsumer}}. > And as an example, I implemented push-down typed filter through > {{SerializeFromObject}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15980) Add PushPredicateThroughObjectConsumer rule to Optimizer.
[ https://issues.apache.org/jira/browse/SPARK-15980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15333089#comment-15333089 ] Apache Spark commented on SPARK-15980: -- User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/13702 > Add PushPredicateThroughObjectConsumer rule to Optimizer. > - > > Key: SPARK-15980 > URL: https://issues.apache.org/jira/browse/SPARK-15980 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Takuya Ueshin > > I added {{PushPredicateThroughObjectConsumer}} rule to push-down predicates > through {{ObjectConsumer}}. > And as an example, I implemented push-down typed filter through > {{SerializeFromObject}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15980) Add PushPredicateThroughObjectConsumer rule to Optimizer.
Takuya Ueshin created SPARK-15980: - Summary: Add PushPredicateThroughObjectConsumer rule to Optimizer. Key: SPARK-15980 URL: https://issues.apache.org/jira/browse/SPARK-15980 Project: Spark Issue Type: Improvement Components: SQL Reporter: Takuya Ueshin I added {{PushPredicateThroughObjectConsumer}} rule to push-down predicates through {{ObjectConsumer}}. And as an example, I implemented push-down typed filter through {{SerializeFromObject}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14048) Aggregation operations on structs fail when the structs have fields with special characters
[ https://issues.apache.org/jira/browse/SPARK-14048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15333076#comment-15333076 ] Simeon Simeonov commented on SPARK-14048: - Yes, I get the exact same failure with 1.6.1. > Aggregation operations on structs fail when the structs have fields with > special characters > --- > > Key: SPARK-14048 > URL: https://issues.apache.org/jira/browse/SPARK-14048 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 > Environment: Databricks w/ 1.6.0 >Reporter: Simeon Simeonov > Labels: sql > Attachments: bug_structs_with_backticks.html > > > Consider a schema where a struct has field names with special characters, > e.g., > {code} > |-- st: struct (nullable = true) > ||-- x.y: long (nullable = true) > {code} > Schema such as these are frequently generated by the JSON schema generator, > which seems to never want to map JSON data to {{MapType}} always preferring > to use {{StructType}}. > In SparkSQL, referring to these fields requires backticks, e.g., > {{st.`x.y`}}. There is no problem manipulating these structs unless one is > using an aggregation function. It seems that, under the covers, the code is > not escaping fields with special characters correctly. > For example, > {code} > select first(st) as st from tbl group by something > {code} > generates > {code} > org.apache.spark.sql.catalyst.util.DataTypeException: Unsupported dataType: > struct. If you have a struct and a field name of it has any > special characters, please use backticks (`) to quote that field name, e.g. > `x+y`. Please note that backtick itself is not supported in a field name. > at > org.apache.spark.sql.catalyst.util.DataTypeParser$class.toDataType(DataTypeParser.scala:100) > at > org.apache.spark.sql.catalyst.util.DataTypeParser$$anon$1.toDataType(DataTypeParser.scala:112) > at > org.apache.spark.sql.catalyst.util.DataTypeParser$.parse(DataTypeParser.scala:116) > at > org.apache.spark.sql.hive.HiveMetastoreTypes$.toDataType(HiveMetastoreCatalog.scala:884) > at > com.databricks.backend.daemon.driver.OutputAggregator$$anonfun$toJsonSchema$1.apply(OutputAggregator.scala:395) > at > com.databricks.backend.daemon.driver.OutputAggregator$$anonfun$toJsonSchema$1.apply(OutputAggregator.scala:394) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > com.databricks.backend.daemon.driver.OutputAggregator$.toJsonSchema(OutputAggregator.scala:394) > at > com.databricks.backend.daemon.driver.OutputAggregator$.maybeApplyOutputAggregation(OutputAggregator.scala:122) > at > com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation0(OutputAggregator.scala:82) > at > com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation(OutputAggregator.scala:42) > at > com.databricks.backend.daemon.driver.DriverLocal.executeSql(DriverLocal.scala:306) > at > com.databricks.backend.daemon.driver.DriverLocal.execute(DriverLocal.scala:161) > at > com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$3.apply(DriverWrapper.scala:467) > at > com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$3.apply(DriverWrapper.scala:467) > at scala.util.Try$.apply(Try.scala:161) > at > com.databricks.backend.daemon.driver.DriverWrapper.executeCommand(DriverWrapper.scala:464) > at > com.databricks.backend.daemon.driver.DriverWrapper.runInner(DriverWrapper.scala:365) > at > com.databricks.backend.daemon.driver.DriverWrapper.run(DriverWrapper.scala:196) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14048) Aggregation operations on structs fail when the structs have fields with special characters
[ https://issues.apache.org/jira/browse/SPARK-14048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15333076#comment-15333076 ] Simeon Simeonov edited comment on SPARK-14048 at 6/16/16 4:46 AM: -- Yes, I get the exact same failure with 1.6.1 running on Databricks. was (Author: simeons): Yes, I get the exact same failure with 1.6.1. > Aggregation operations on structs fail when the structs have fields with > special characters > --- > > Key: SPARK-14048 > URL: https://issues.apache.org/jira/browse/SPARK-14048 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 > Environment: Databricks w/ 1.6.0 >Reporter: Simeon Simeonov > Labels: sql > Attachments: bug_structs_with_backticks.html > > > Consider a schema where a struct has field names with special characters, > e.g., > {code} > |-- st: struct (nullable = true) > ||-- x.y: long (nullable = true) > {code} > Schema such as these are frequently generated by the JSON schema generator, > which seems to never want to map JSON data to {{MapType}} always preferring > to use {{StructType}}. > In SparkSQL, referring to these fields requires backticks, e.g., > {{st.`x.y`}}. There is no problem manipulating these structs unless one is > using an aggregation function. It seems that, under the covers, the code is > not escaping fields with special characters correctly. > For example, > {code} > select first(st) as st from tbl group by something > {code} > generates > {code} > org.apache.spark.sql.catalyst.util.DataTypeException: Unsupported dataType: > struct. If you have a struct and a field name of it has any > special characters, please use backticks (`) to quote that field name, e.g. > `x+y`. Please note that backtick itself is not supported in a field name. > at > org.apache.spark.sql.catalyst.util.DataTypeParser$class.toDataType(DataTypeParser.scala:100) > at > org.apache.spark.sql.catalyst.util.DataTypeParser$$anon$1.toDataType(DataTypeParser.scala:112) > at > org.apache.spark.sql.catalyst.util.DataTypeParser$.parse(DataTypeParser.scala:116) > at > org.apache.spark.sql.hive.HiveMetastoreTypes$.toDataType(HiveMetastoreCatalog.scala:884) > at > com.databricks.backend.daemon.driver.OutputAggregator$$anonfun$toJsonSchema$1.apply(OutputAggregator.scala:395) > at > com.databricks.backend.daemon.driver.OutputAggregator$$anonfun$toJsonSchema$1.apply(OutputAggregator.scala:394) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > com.databricks.backend.daemon.driver.OutputAggregator$.toJsonSchema(OutputAggregator.scala:394) > at > com.databricks.backend.daemon.driver.OutputAggregator$.maybeApplyOutputAggregation(OutputAggregator.scala:122) > at > com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation0(OutputAggregator.scala:82) > at > com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation(OutputAggregator.scala:42) > at > com.databricks.backend.daemon.driver.DriverLocal.executeSql(DriverLocal.scala:306) > at > com.databricks.backend.daemon.driver.DriverLocal.execute(DriverLocal.scala:161) > at > com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$3.apply(DriverWrapper.scala:467) > at > com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$3.apply(DriverWrapper.scala:467) > at scala.util.Try$.apply(Try.scala:161) > at > com.databricks.backend.daemon.driver.DriverWrapper.executeCommand(DriverWrapper.scala:464) > at > com.databricks.backend.daemon.driver.DriverWrapper.runInner(DriverWrapper.scala:365) > at > com.databricks.backend.daemon.driver.DriverWrapper.run(DriverWrapper.scala:196) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15817) Spark client picking hive 1.2.1 by default which failed to alter a table name
[ https://issues.apache.org/jira/browse/SPARK-15817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15333069#comment-15333069 ] Nataraj Gorantla commented on SPARK-15817: -- Can some one please add an update . Thanks, Nataraj > Spark client picking hive 1.2.1 by default which failed to alter a table name > - > > Key: SPARK-15817 > URL: https://issues.apache.org/jira/browse/SPARK-15817 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 1.6.1 >Reporter: Nataraj Gorantla > > Some of our scala scripts are failing with below error. > FAILED: Execution Error, return code 1 from > org.apache.hadoop.hive.ql.exec.DDLTask. Unable to alter table. Invalid > method name: 'alter_table_with_cascade' > msg: org.apache.spark.sql.execution.QueryExecutionException: FAILED: > Spark when invoked is trying to initiate Hive 1.2.1 by default. We have Hive > 0.14 installed. Some backgroud investigation from our side explained this. > Analysis > "alter_table_with_cascade" error occurs because of metastore version mismatch > of Spark. > To correct this error set proper version of metastore in Spark config. > I tried to add a couple of parameters to spark-default-conf file. > spark.sql.hive.metastore.version 0.14.0 > #spark.sql.hive.metastore.jars maven > spark.sql.hive.metastore.jars =/usr/hdp/current/hive-client/lib > Still I see issues. Can you please let me know if you have any alternative to > fix this issue. > Thanks, > Nataraj G -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15824) Run 'with ... insert ... select' failed when use spark thriftserver
[ https://issues.apache.org/jira/browse/SPARK-15824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-15824. - Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 13678 [https://github.com/apache/spark/pull/13678] > Run 'with ... insert ... select' failed when use spark thriftserver > --- > > Key: SPARK-15824 > URL: https://issues.apache.org/jira/browse/SPARK-15824 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Weizhong >Priority: Minor > Fix For: 2.0.0 > > > {code:sql} > create table src(k int, v int); > create table src_parquet(k int, v int); > with v as (select 1, 2) insert into table src_parquet from src; > {code} > Will throw exception: spark.sql.execution.id is already set. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15824) Run 'with ... insert ... select' failed when use spark thriftserver
[ https://issues.apache.org/jira/browse/SPARK-15824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-15824: Assignee: Herman van Hovell > Run 'with ... insert ... select' failed when use spark thriftserver > --- > > Key: SPARK-15824 > URL: https://issues.apache.org/jira/browse/SPARK-15824 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Weizhong >Assignee: Herman van Hovell >Priority: Minor > Fix For: 2.0.0 > > > {code:sql} > create table src(k int, v int); > create table src_parquet(k int, v int); > with v as (select 1, 2) insert into table src_parquet from src; > {code} > Will throw exception: spark.sql.execution.id is already set. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12492) SQL page of Spark-sql is always blank
[ https://issues.apache.org/jira/browse/SPARK-12492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-12492: - Issue Type: Improvement (was: Bug) > SQL page of Spark-sql is always blank > -- > > Key: SPARK-12492 > URL: https://issues.apache.org/jira/browse/SPARK-12492 > Project: Spark > Issue Type: Improvement > Components: SQL, Web UI >Reporter: meiyoula >Assignee: KaiXinXIaoLei > Fix For: 2.0.0 > > Attachments: screenshot-1.png > > > When I run a sql query in spark-sql, the Execution page of SQL tab is always > blank. But the JDBCServer is not blank. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12492) SQL page of Spark-sql is always blank
[ https://issues.apache.org/jira/browse/SPARK-12492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-12492. -- Resolution: Fixed Assignee: KaiXinXIaoLei Fix Version/s: 2.0.0 > SQL page of Spark-sql is always blank > -- > > Key: SPARK-12492 > URL: https://issues.apache.org/jira/browse/SPARK-12492 > Project: Spark > Issue Type: Bug > Components: SQL, Web UI >Reporter: meiyoula >Assignee: KaiXinXIaoLei > Fix For: 2.0.0 > > Attachments: screenshot-1.png > > > When I run a sql query in spark-sql, the Execution page of SQL tab is always > blank. But the JDBCServer is not blank. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15639) Try to push down filter at RowGroups level for parquet reader
[ https://issues.apache.org/jira/browse/SPARK-15639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15333030#comment-15333030 ] Apache Spark commented on SPARK-15639: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/13701 > Try to push down filter at RowGroups level for parquet reader > - > > Key: SPARK-15639 > URL: https://issues.apache.org/jira/browse/SPARK-15639 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh > > When we use vecterized parquet reader, although the base reader (i.e., > SpecificParquetRecordReaderBase) will retrieve pushed-down filters for > RowGroups-level filtering, we seems not really set up the filters to be > pushed down. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15977) TRUNCATE TABLE does not work with Datasource tables outside of Hive
[ https://issues.apache.org/jira/browse/SPARK-15977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15977: Assignee: Herman van Hovell (was: Apache Spark) > TRUNCATE TABLE does not work with Datasource tables outside of Hive > --- > > Key: SPARK-15977 > URL: https://issues.apache.org/jira/browse/SPARK-15977 > Project: Spark > Issue Type: Bug >Reporter: Herman van Hovell >Assignee: Herman van Hovell > > The {{TRUNCATE TABLE}} command does not work with datasource tables without > Hive support. For example the following doesn't work: > {noformat} > DROP TABLE IF EXISTS test > CREATE TABLE test(a INT, b STRING) USING JSON > INSERT INTO test VALUES (1, 'a'), (2, 'b'), (3, 'c') > SELECT * FROM test > TRUNCATE TABLE test > SELECT * FROM test > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15977) TRUNCATE TABLE does not work with Datasource tables outside of Hive
[ https://issues.apache.org/jira/browse/SPARK-15977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15333022#comment-15333022 ] Apache Spark commented on SPARK-15977: -- User 'hvanhovell' has created a pull request for this issue: https://github.com/apache/spark/pull/13697 > TRUNCATE TABLE does not work with Datasource tables outside of Hive > --- > > Key: SPARK-15977 > URL: https://issues.apache.org/jira/browse/SPARK-15977 > Project: Spark > Issue Type: Bug >Reporter: Herman van Hovell >Assignee: Herman van Hovell > > The {{TRUNCATE TABLE}} command does not work with datasource tables without > Hive support. For example the following doesn't work: > {noformat} > DROP TABLE IF EXISTS test > CREATE TABLE test(a INT, b STRING) USING JSON > INSERT INTO test VALUES (1, 'a'), (2, 'b'), (3, 'c') > SELECT * FROM test > TRUNCATE TABLE test > SELECT * FROM test > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15977) TRUNCATE TABLE does not work with Datasource tables outside of Hive
[ https://issues.apache.org/jira/browse/SPARK-15977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15977: Assignee: Apache Spark (was: Herman van Hovell) > TRUNCATE TABLE does not work with Datasource tables outside of Hive > --- > > Key: SPARK-15977 > URL: https://issues.apache.org/jira/browse/SPARK-15977 > Project: Spark > Issue Type: Bug >Reporter: Herman van Hovell >Assignee: Apache Spark > > The {{TRUNCATE TABLE}} command does not work with datasource tables without > Hive support. For example the following doesn't work: > {noformat} > DROP TABLE IF EXISTS test > CREATE TABLE test(a INT, b STRING) USING JSON > INSERT INTO test VALUES (1, 'a'), (2, 'b'), (3, 'c') > SELECT * FROM test > TRUNCATE TABLE test > SELECT * FROM test > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15979) Rename various Parquet support classes
[ https://issues.apache.org/jira/browse/SPARK-15979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15332993#comment-15332993 ] Apache Spark commented on SPARK-15979: -- User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/13700 > Rename various Parquet support classes > -- > > Key: SPARK-15979 > URL: https://issues.apache.org/jira/browse/SPARK-15979 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 2.0.0 > > > This patch renames various Parquet support classes from CatalystAbc to > ParquetAbc. This new naming makes more sense for two reasons: > 1. These are not optimizer related (i.e. Catalyst) classes. > 2. We are in the Spark code base, and as a result it'd be more clear to call > out these are Parquet support classes, rather than some Spark classes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15851) Spark 2.0 does not compile in Windows 7
[ https://issues.apache.org/jira/browse/SPARK-15851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-15851. - Resolution: Fixed Assignee: Reynold Xin Fix Version/s: 2.0.0 > Spark 2.0 does not compile in Windows 7 > --- > > Key: SPARK-15851 > URL: https://issues.apache.org/jira/browse/SPARK-15851 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.0.0 > Environment: Windows 7 >Reporter: Alexander Ulanov >Assignee: Reynold Xin > Fix For: 2.0.0 > > > Spark does not compile in Windows 7. > "mvn compile" fails on spark-core due to trying to execute a bash script > spark-build-info. > Work around: > 1)Install win-bash and put in path > 2)Change line 350 of core/pom.xml > > > > > > Error trace: > [ERROR] Failed to execute goal > org.apache.maven.plugins:maven-antrun-plugin:1.8:run (default) on project > spark-core_2.11: An Ant BuildException has occured: Execute failed: > java.io.IOException: Cannot run program > "C:\dev\spark\core\..\build\spark-build-info" (in directory > "C:\dev\spark\core"): CreateProcess error=193, %1 is not a valid Win32 > application > [ERROR] around Ant part ... executable="C:\dev\spark\core/../build/spark-build-info">... @ 4:73 in > C:\dev\spark\core\target\antrun\build-main.xml -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13498) JDBCRDD should update some input metrics
[ https://issues.apache.org/jira/browse/SPARK-13498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-13498. - Resolution: Fixed Assignee: Wayne Song Fix Version/s: 2.0.0 > JDBCRDD should update some input metrics > > > Key: SPARK-13498 > URL: https://issues.apache.org/jira/browse/SPARK-13498 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wayne Song >Assignee: Wayne Song >Priority: Minor > Fix For: 2.0.0 > > > The JDBCRDD does not update any input metrics, which makes it difficult to > see its progress in the web UI. It should be simple to at least update > recordsRead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15979) Rename various Parquet support classes
[ https://issues.apache.org/jira/browse/SPARK-15979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15979: Assignee: Apache Spark (was: Reynold Xin) > Rename various Parquet support classes > -- > > Key: SPARK-15979 > URL: https://issues.apache.org/jira/browse/SPARK-15979 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Apache Spark > Fix For: 2.0.0 > > > This patch renames various Parquet support classes from CatalystAbc to > ParquetAbc. This new naming makes more sense for two reasons: > 1. These are not optimizer related (i.e. Catalyst) classes. > 2. We are in the Spark code base, and as a result it'd be more clear to call > out these are Parquet support classes, rather than some Spark classes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15979) Rename various Parquet support classes
[ https://issues.apache.org/jira/browse/SPARK-15979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15979: Assignee: Reynold Xin (was: Apache Spark) > Rename various Parquet support classes > -- > > Key: SPARK-15979 > URL: https://issues.apache.org/jira/browse/SPARK-15979 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 2.0.0 > > > This patch renames various Parquet support classes from CatalystAbc to > ParquetAbc. This new naming makes more sense for two reasons: > 1. These are not optimizer related (i.e. Catalyst) classes. > 2. We are in the Spark code base, and as a result it'd be more clear to call > out these are Parquet support classes, rather than some Spark classes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15979) Rename various Parquet support classes
Reynold Xin created SPARK-15979: --- Summary: Rename various Parquet support classes Key: SPARK-15979 URL: https://issues.apache.org/jira/browse/SPARK-15979 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin This patch renames various Parquet support classes from CatalystAbc to ParquetAbc. This new naming makes more sense for two reasons: 1. These are not optimizer related (i.e. Catalyst) classes. 2. We are in the Spark code base, and as a result it'd be more clear to call out these are Parquet support classes, rather than some Spark classes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15958) Make initial buffer size for the Sorter configurable
[ https://issues.apache.org/jira/browse/SPARK-15958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15958: Assignee: Apache Spark > Make initial buffer size for the Sorter configurable > > > Key: SPARK-15958 > URL: https://issues.apache.org/jira/browse/SPARK-15958 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Sital Kedia >Assignee: Apache Spark > > Currently the initial buffer size in the sorter is hard coded inside the code > (https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/execution/UnsafeExternalRowSorter.java#L88) > and is too small for large workload. As a result, the sorter spends > significant time expanding the buffer size and copying the data. It would be > useful to have it configurable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15979) Rename various Parquet support classes
[ https://issues.apache.org/jira/browse/SPARK-15979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-15979. - Resolution: Fixed > Rename various Parquet support classes > -- > > Key: SPARK-15979 > URL: https://issues.apache.org/jira/browse/SPARK-15979 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 2.0.0 > > > This patch renames various Parquet support classes from CatalystAbc to > ParquetAbc. This new naming makes more sense for two reasons: > 1. These are not optimizer related (i.e. Catalyst) classes. > 2. We are in the Spark code base, and as a result it'd be more clear to call > out these are Parquet support classes, rather than some Spark classes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15979) Rename various Parquet support classes
[ https://issues.apache.org/jira/browse/SPARK-15979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15332968#comment-15332968 ] Apache Spark commented on SPARK-15979: -- User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/13696 > Rename various Parquet support classes > -- > > Key: SPARK-15979 > URL: https://issues.apache.org/jira/browse/SPARK-15979 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 2.0.0 > > > This patch renames various Parquet support classes from CatalystAbc to > ParquetAbc. This new naming makes more sense for two reasons: > 1. These are not optimizer related (i.e. Catalyst) classes. > 2. We are in the Spark code base, and as a result it'd be more clear to call > out these are Parquet support classes, rather than some Spark classes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15958) Make initial buffer size for the Sorter configurable
[ https://issues.apache.org/jira/browse/SPARK-15958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15958: Assignee: (was: Apache Spark) > Make initial buffer size for the Sorter configurable > > > Key: SPARK-15958 > URL: https://issues.apache.org/jira/browse/SPARK-15958 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Sital Kedia > > Currently the initial buffer size in the sorter is hard coded inside the code > (https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/execution/UnsafeExternalRowSorter.java#L88) > and is too small for large workload. As a result, the sorter spends > significant time expanding the buffer size and copying the data. It would be > useful to have it configurable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15958) Make initial buffer size for the Sorter configurable
[ https://issues.apache.org/jira/browse/SPARK-15958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15332966#comment-15332966 ] Apache Spark commented on SPARK-15958: -- User 'sitalkedia' has created a pull request for this issue: https://github.com/apache/spark/pull/13699 > Make initial buffer size for the Sorter configurable > > > Key: SPARK-15958 > URL: https://issues.apache.org/jira/browse/SPARK-15958 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Sital Kedia > > Currently the initial buffer size in the sorter is hard coded inside the code > (https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/execution/UnsafeExternalRowSorter.java#L88) > and is too small for large workload. As a result, the sorter spends > significant time expanding the buffer size and copying the data. It would be > useful to have it configurable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-15690) Fast single-node (single-process) in-memory shuffle
[ https://issues.apache.org/jira/browse/SPARK-15690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15332960#comment-15332960 ] Joseph Fourny edited comment on SPARK-15690 at 6/16/16 3:00 AM: I am trying to develop single-node clusters on large servers (30+ CPU cores) with 2-3 TB or RAM. Our use cases involve small to medium size datasets, but with a huge amount of concurrent jobs (shared, multi-tenant environments). Efficiency and sub-second response times are the primary requirements. This shuffle between stages is the current bottleneck. Writing anything to disk is just a waste of time if all computations are done in the same JVM (or a small set of JVMs on the same machine). We tried using RAMFS to avoid disk I/O, but still a lot of CPU time is spent in compression and serialization. I would rather not hack my way out of this one. Is it wishful thinking to have this officially supported? was (Author: josephfourny): +1 on this. I am trying to develop single-node clusters on large servers (30+ CPU cores) with 2-3 TB or RAM. Our use cases involve small to medium size datasets, but with a huge amount of concurrent jobs (shared, multi-tenant environments). Efficiency and sub-second response times are the primary requirements. This shuffle between stages is the current bottleneck. Writing anything to disk is just a waste of time if all computations are done in the same JVM (or a small set of JVMs on the same machine). We tried using RAMFS to avoid disk I/O, but still a lot of CPU time is spent in compression and serialization. I would rather not hack my way out of this one. Is it wishful thinking to have this officially supported? > Fast single-node (single-process) in-memory shuffle > --- > > Key: SPARK-15690 > URL: https://issues.apache.org/jira/browse/SPARK-15690 > Project: Spark > Issue Type: New Feature > Components: Shuffle, SQL >Reporter: Reynold Xin > > Spark's current shuffle implementation sorts all intermediate data by their > partition id, and then write the data to disk. This is not a big bottleneck > because the network throughput on commodity clusters tend to be low. However, > an increasing number of Spark users are using the system to process data on a > single-node. When in a single node operating against intermediate data that > fits in memory, the existing shuffle code path can become a big bottleneck. > The goal of this ticket is to change Spark so it can use in-memory radix sort > to do data shuffling on a single node, and still gracefully fallback to disk > if the data size does not fit in memory. Given the number of partitions is > usually small (say less than 256), it'd require only a single pass do to the > radix sort with pretty decent CPU efficiency. > Note that there have been many in-memory shuffle attempts in the past. This > ticket has a smaller scope (single-process), and aims to actually > productionize this code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15690) Fast single-node (single-process) in-memory shuffle
[ https://issues.apache.org/jira/browse/SPARK-15690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15332960#comment-15332960 ] Joseph Fourny commented on SPARK-15690: --- +1 on this. I am trying to develop single-node clusters on large servers (30+ CPU cores) with 2-3 TB or RAM. Our use cases involve small to medium size datasets, but with a huge amount of concurrent jobs (shared, multi-tenant environments). Efficiency and sub-second response times are the primary requirements. This shuffle between stages is the current bottleneck. Writing anything to disk is just a waste of time if all computations are done in the same JVM (or a small set of JVMs on the same machine). We tried using RAMFS to avoid disk I/O, but still a lot of CPU time is spent in compression and serialization. I would rather not hack my way out of this one. Is it wishful thinking to have this officially supported? > Fast single-node (single-process) in-memory shuffle > --- > > Key: SPARK-15690 > URL: https://issues.apache.org/jira/browse/SPARK-15690 > Project: Spark > Issue Type: New Feature > Components: Shuffle, SQL >Reporter: Reynold Xin > > Spark's current shuffle implementation sorts all intermediate data by their > partition id, and then write the data to disk. This is not a big bottleneck > because the network throughput on commodity clusters tend to be low. However, > an increasing number of Spark users are using the system to process data on a > single-node. When in a single node operating against intermediate data that > fits in memory, the existing shuffle code path can become a big bottleneck. > The goal of this ticket is to change Spark so it can use in-memory radix sort > to do data shuffling on a single node, and still gracefully fallback to disk > if the data size does not fit in memory. Given the number of partitions is > usually small (say less than 256), it'd require only a single pass do to the > radix sort with pretty decent CPU efficiency. > Note that there have been many in-memory shuffle attempts in the past. This > ticket has a smaller scope (single-process), and aims to actually > productionize this code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15978) Some improvement of "Show Tables"
[ https://issues.apache.org/jira/browse/SPARK-15978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15978: Assignee: Apache Spark > Some improvement of "Show Tables" > - > > Key: SPARK-15978 > URL: https://issues.apache.org/jira/browse/SPARK-15978 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Bo Meng >Assignee: Apache Spark >Priority: Minor > > I've found some minor issues in "show tables" command: > 1. In the SessionCatalog.scala, listTables(db: String) method will call > listTables(formatDatabaseName(db), "*") to list all the tables for certain > db, but in the method listTables(db: String, pattern: String), this db name > is formatted once more. So I think we should remove formatDatabaseName() in > the caller. > 2. I suggest to add sort to listTables(db: String) in InMemoryCatalog.scala, > just like listDatabases(). > I will make a PR shortly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15978) Some improvement of "Show Tables"
[ https://issues.apache.org/jira/browse/SPARK-15978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15978: Assignee: (was: Apache Spark) > Some improvement of "Show Tables" > - > > Key: SPARK-15978 > URL: https://issues.apache.org/jira/browse/SPARK-15978 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Bo Meng >Priority: Minor > > I've found some minor issues in "show tables" command: > 1. In the SessionCatalog.scala, listTables(db: String) method will call > listTables(formatDatabaseName(db), "*") to list all the tables for certain > db, but in the method listTables(db: String, pattern: String), this db name > is formatted once more. So I think we should remove formatDatabaseName() in > the caller. > 2. I suggest to add sort to listTables(db: String) in InMemoryCatalog.scala, > just like listDatabases(). > I will make a PR shortly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15978) Some improvement of "Show Tables"
[ https://issues.apache.org/jira/browse/SPARK-15978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15978: Assignee: Apache Spark > Some improvement of "Show Tables" > - > > Key: SPARK-15978 > URL: https://issues.apache.org/jira/browse/SPARK-15978 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Bo Meng >Assignee: Apache Spark >Priority: Minor > > I've found some minor issues in "show tables" command: > 1. In the SessionCatalog.scala, listTables(db: String) method will call > listTables(formatDatabaseName(db), "*") to list all the tables for certain > db, but in the method listTables(db: String, pattern: String), this db name > is formatted once more. So I think we should remove formatDatabaseName() in > the caller. > 2. I suggest to add sort to listTables(db: String) in InMemoryCatalog.scala, > just like listDatabases(). > I will make a PR shortly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15978) Some improvement of "Show Tables"
[ https://issues.apache.org/jira/browse/SPARK-15978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15332802#comment-15332802 ] Apache Spark commented on SPARK-15978: -- User 'bomeng' has created a pull request for this issue: https://github.com/apache/spark/pull/13695 > Some improvement of "Show Tables" > - > > Key: SPARK-15978 > URL: https://issues.apache.org/jira/browse/SPARK-15978 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Bo Meng >Priority: Minor > > I've found some minor issues in "show tables" command: > 1. In the SessionCatalog.scala, listTables(db: String) method will call > listTables(formatDatabaseName(db), "*") to list all the tables for certain > db, but in the method listTables(db: String, pattern: String), this db name > is formatted once more. So I think we should remove formatDatabaseName() in > the caller. > 2. I suggest to add sort to listTables(db: String) in InMemoryCatalog.scala, > just like listDatabases(). > I will make a PR shortly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15782) --packages doesn't work with the spark-shell
[ https://issues.apache.org/jira/browse/SPARK-15782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15332797#comment-15332797 ] Nezih Yigitbasi commented on SPARK-15782: - reopened, will submit a PR including Marcelo's fix on top of mine. > --packages doesn't work with the spark-shell > > > Key: SPARK-15782 > URL: https://issues.apache.org/jira/browse/SPARK-15782 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.0.0 >Reporter: Nezih Yigitbasi >Assignee: Nezih Yigitbasi >Priority: Blocker > Fix For: 2.0.0 > > > When {{--packages}} is specified with {{spark-shell}} the classes from those > packages cannot be found, which I think is due to some of the changes in > {{SPARK-12343}}. In particular {{SPARK-12343}} removes a line that sets the > {{spark.jars}} system property in client mode, which is used by the repl main > class to set the classpath. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-15782) --packages doesn't work with the spark-shell
[ https://issues.apache.org/jira/browse/SPARK-15782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nezih Yigitbasi reopened SPARK-15782: - > --packages doesn't work with the spark-shell > > > Key: SPARK-15782 > URL: https://issues.apache.org/jira/browse/SPARK-15782 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.0.0 >Reporter: Nezih Yigitbasi >Assignee: Nezih Yigitbasi >Priority: Blocker > Fix For: 2.0.0 > > > When {{--packages}} is specified with {{spark-shell}} the classes from those > packages cannot be found, which I think is due to some of the changes in > {{SPARK-12343}}. In particular {{SPARK-12343}} removes a line that sets the > {{spark.jars}} system property in client mode, which is used by the repl main > class to set the classpath. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15782) --packages doesn't work with the spark-shell
[ https://issues.apache.org/jira/browse/SPARK-15782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15782: Assignee: Apache Spark (was: Nezih Yigitbasi) > --packages doesn't work with the spark-shell > > > Key: SPARK-15782 > URL: https://issues.apache.org/jira/browse/SPARK-15782 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.0.0 >Reporter: Nezih Yigitbasi >Assignee: Apache Spark >Priority: Blocker > Fix For: 2.0.0 > > > When {{--packages}} is specified with {{spark-shell}} the classes from those > packages cannot be found, which I think is due to some of the changes in > {{SPARK-12343}}. In particular {{SPARK-12343}} removes a line that sets the > {{spark.jars}} system property in client mode, which is used by the repl main > class to set the classpath. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15782) --packages doesn't work with the spark-shell
[ https://issues.apache.org/jira/browse/SPARK-15782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15782: Assignee: Nezih Yigitbasi (was: Apache Spark) > --packages doesn't work with the spark-shell > > > Key: SPARK-15782 > URL: https://issues.apache.org/jira/browse/SPARK-15782 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.0.0 >Reporter: Nezih Yigitbasi >Assignee: Nezih Yigitbasi >Priority: Blocker > Fix For: 2.0.0 > > > When {{--packages}} is specified with {{spark-shell}} the classes from those > packages cannot be found, which I think is due to some of the changes in > {{SPARK-12343}}. In particular {{SPARK-12343}} removes a line that sets the > {{spark.jars}} system property in client mode, which is used by the repl main > class to set the classpath. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15978) Some improvement of "Show Tables"
Bo Meng created SPARK-15978: --- Summary: Some improvement of "Show Tables" Key: SPARK-15978 URL: https://issues.apache.org/jira/browse/SPARK-15978 Project: Spark Issue Type: Bug Components: SQL Reporter: Bo Meng Priority: Minor I've found some minor issues in "show tables" command: 1. In the SessionCatalog.scala, listTables(db: String) method will call listTables(formatDatabaseName(db), "*") to list all the tables for certain db, but in the method listTables(db: String, pattern: String), this db name is formatted once more. So I think we should remove formatDatabaseName() in the caller. 2. I suggest to add sort to listTables(db: String) in InMemoryCatalog.scala, just like listDatabases(). I will make a PR shortly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13498) JDBCRDD should update some input metrics
[ https://issues.apache.org/jira/browse/SPARK-13498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15332776#comment-15332776 ] Apache Spark commented on SPARK-13498: -- User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/13694 > JDBCRDD should update some input metrics > > > Key: SPARK-13498 > URL: https://issues.apache.org/jira/browse/SPARK-13498 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wayne Song >Priority: Minor > > The JDBCRDD does not update any input metrics, which makes it difficult to > see its progress in the web UI. It should be simple to at least update > recordsRead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14048) Aggregation operations on structs fail when the structs have fields with special characters
[ https://issues.apache.org/jira/browse/SPARK-14048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15332775#comment-15332775 ] Sean Zhong commented on SPARK-14048: [~simeons] Are you able to reproduce this case any longer? I cannot reproduce this on 1.6 by using the following script on databricks cloud community edition. {code} val rdd = sc.makeRDD( """{"st": {"x.y": 1}, "age": 10}""" :: """{"st": {"x.y": 2}, "age": 10}""" :: """{"st": {"x.y": 2}, "age": 20}""" :: Nil) sqlContext.read.json(rdd).registerTempTable("test") sqlContext.sql("select first(st) as st from test group by age").show() {code} > Aggregation operations on structs fail when the structs have fields with > special characters > --- > > Key: SPARK-14048 > URL: https://issues.apache.org/jira/browse/SPARK-14048 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 > Environment: Databricks w/ 1.6.0 >Reporter: Simeon Simeonov > Labels: sql > Attachments: bug_structs_with_backticks.html > > > Consider a schema where a struct has field names with special characters, > e.g., > {code} > |-- st: struct (nullable = true) > ||-- x.y: long (nullable = true) > {code} > Schema such as these are frequently generated by the JSON schema generator, > which seems to never want to map JSON data to {{MapType}} always preferring > to use {{StructType}}. > In SparkSQL, referring to these fields requires backticks, e.g., > {{st.`x.y`}}. There is no problem manipulating these structs unless one is > using an aggregation function. It seems that, under the covers, the code is > not escaping fields with special characters correctly. > For example, > {code} > select first(st) as st from tbl group by something > {code} > generates > {code} > org.apache.spark.sql.catalyst.util.DataTypeException: Unsupported dataType: > struct. If you have a struct and a field name of it has any > special characters, please use backticks (`) to quote that field name, e.g. > `x+y`. Please note that backtick itself is not supported in a field name. > at > org.apache.spark.sql.catalyst.util.DataTypeParser$class.toDataType(DataTypeParser.scala:100) > at > org.apache.spark.sql.catalyst.util.DataTypeParser$$anon$1.toDataType(DataTypeParser.scala:112) > at > org.apache.spark.sql.catalyst.util.DataTypeParser$.parse(DataTypeParser.scala:116) > at > org.apache.spark.sql.hive.HiveMetastoreTypes$.toDataType(HiveMetastoreCatalog.scala:884) > at > com.databricks.backend.daemon.driver.OutputAggregator$$anonfun$toJsonSchema$1.apply(OutputAggregator.scala:395) > at > com.databricks.backend.daemon.driver.OutputAggregator$$anonfun$toJsonSchema$1.apply(OutputAggregator.scala:394) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > com.databricks.backend.daemon.driver.OutputAggregator$.toJsonSchema(OutputAggregator.scala:394) > at > com.databricks.backend.daemon.driver.OutputAggregator$.maybeApplyOutputAggregation(OutputAggregator.scala:122) > at > com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation0(OutputAggregator.scala:82) > at > com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation(OutputAggregator.scala:42) > at > com.databricks.backend.daemon.driver.DriverLocal.executeSql(DriverLocal.scala:306) > at > com.databricks.backend.daemon.driver.DriverLocal.execute(DriverLocal.scala:161) > at > com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$3.apply(DriverWrapper.scala:467) > at > com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$3.apply(DriverWrapper.scala:467) > at scala.util.Try$.apply(Try.scala:161) > at > com.databricks.backend.daemon.driver.DriverWrapper.executeCommand(DriverWrapper.scala:464) > at > com.databricks.backend.daemon.driver.DriverWrapper.runInner(DriverWrapper.scala:365) > at > com.databricks.backend.daemon.driver.DriverWrapper.run(DriverWrapper.scala:196) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15977) TRUNCATE TABLE does not work with Datasource tables outside of Hive
Herman van Hovell created SPARK-15977: - Summary: TRUNCATE TABLE does not work with Datasource tables outside of Hive Key: SPARK-15977 URL: https://issues.apache.org/jira/browse/SPARK-15977 Project: Spark Issue Type: Bug Reporter: Herman van Hovell Assignee: Herman van Hovell The {{TRUNCATE TABLE}} command does not work with datasource tables without Hive support. For example the following doesn't work: {noformat} DROP TABLE IF EXISTS test CREATE TABLE test(a INT, b STRING) USING JSON INSERT INTO test VALUES (1, 'a'), (2, 'b'), (3, 'c') SELECT * FROM test TRUNCATE TABLE test SELECT * FROM test {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7848) Update SparkStreaming docs to incorporate FAQ and/or bullets w/ "knobs" information.
[ https://issues.apache.org/jira/browse/SPARK-7848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-7848. Resolution: Fixed Assignee: Nirman Narang Fix Version/s: 2.0.0 > Update SparkStreaming docs to incorporate FAQ and/or bullets w/ "knobs" > information. > > > Key: SPARK-7848 > URL: https://issues.apache.org/jira/browse/SPARK-7848 > Project: Spark > Issue Type: Documentation > Components: Streaming >Reporter: jay vyas >Assignee: Nirman Narang > Fix For: 2.0.0 > > > A recent email on the maligning list detailed a bunch of great "knobs" to > remember for spark streaming. > Lets integrate this into the docs where appropriate. > I'll paste the raw text in a comment field below -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8813) Combine files when there're many small files in table
[ https://issues.apache.org/jira/browse/SPARK-8813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-8813. Resolution: Fixed Assignee: Michael Armbrust Fix Version/s: 2.0.0 > Combine files when there're many small files in table > - > > Key: SPARK-8813 > URL: https://issues.apache.org/jira/browse/SPARK-8813 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.4.0 >Reporter: Yadong Qi >Assignee: Michael Armbrust > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15974) Create a socket on YARN AM start-up
[ https://issues.apache.org/jira/browse/SPARK-15974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15332749#comment-15332749 ] Mingyu Kim commented on SPARK-15974: I agree this is not ideal. A lot of times setting up a server with an Socket won't be an unreasonable thing to do, though. The alternative would be to have Spark program pass some information to Spark AM during the start-up. (Having Spark program set port to YARN is not possible as discussed on the thread linked above.) This can probably done through the use of static variables in the Spark program class. None of these sound particularly great to me, but here are some options I can think of, - Spark program class optionally has Map initialize() method, which returns some named objects back to Spark AM. "rpc-port" could be one of the key names supported, and we can imagine adding more keys later. Spark program class will need to store some information (in the case of RPC port, a Server object or Socket) as a static var for main method to use. - Pass something like a SettableFuture to the main method so that Spark AM can wait for some initialization to be done. This means that command line args need to be augmented with this one extra thing, which is confusing, or that the SettableFuture needs to be passed to Spark program class through some other method and then stored as a static var in Spark program class for the main method to use. Another option would be to change the way spark-submitted applications are written so that the class implements an interface with an explicit initialize method, as opposed to a class with the main method, which allows us to avoid playing with the static variables, but this will be a pretty big compatibility break for Spark. > Create a socket on YARN AM start-up > --- > > Key: SPARK-15974 > URL: https://issues.apache.org/jira/browse/SPARK-15974 > Project: Spark > Issue Type: New Feature > Components: YARN >Reporter: Mingyu Kim > > YARN provides a way for AppilcationMaster to register a RPC port so that a > client outside the YARN cluster can reach the application for any RPCs, but > Spark’s YARN AMs simply register a dummy port number of 0. For the Spark > programs that starts up a server, this makes it hard for the submitter to > discover the server port securely. Spark's ApplicationMaster should > optionally create a ServerSocket and pass it to the Spark user program. This > socket initialization should be disabled by default. > Some discussion on dev@spark thread: > http://apache-spark-developers-list.1001551.n3.nabble.com/Utilizing-YARN-AM-RPC-port-field-td17892.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12114) ColumnPruning rule fails in case of "Project <- Filter <- Join"
[ https://issues.apache.org/jira/browse/SPARK-12114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-12114. - Resolution: Fixed Fix Version/s: 2.0.0 > ColumnPruning rule fails in case of "Project <- Filter <- Join" > --- > > Key: SPARK-12114 > URL: https://issues.apache.org/jira/browse/SPARK-12114 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Min Qiu > Fix For: 2.0.0 > > > For the query > {code} > SELECT c_name, c_custkey, o_orderkey, o_orderdate, >o_totalprice, sum(l_quantity) > FROM customer join orders join lineitem > on c_custkey = o_custkey AND o_orderkey = l_orderkey > left outer join (SELECT l_orderkey tmp_orderkey > FROM lineitem > GROUP BY l_orderkey > HAVING sum(l_quantity) > 300) tmp > on o_orderkey = tmp_orderkey > WHERE tmp_orderkey IS NOT NULL > GROUP BY c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice > ORDER BY o_totalprice DESC, o_orderdate > {code} > The optimizedPlan is > {code} > Sort \[o_totalprice#48 DESC,o_orderdate#49 ASC] > > Aggregate > \[c_name#38,c_custkey#37,o_orderkey#45,o_orderdate#49,o_totalprice#48], > \[c_name#38,c_custkey#37,o_orderkey#45, > o_orderdate#49,o_totalprice#48,SUM(l_quantity#58) AS _c5#36] > {color: green}Project > \[c_name#38,o_orderdate#49,c_custkey#37,o_orderkey#45,o_totalprice#48,l_quantity#58] >Filter IS NOT NULL tmp_orderkey#35 > Join LeftOuter, Some((o_orderkey#45 = tmp_orderkey#35)){color} > Join Inner, Some((c_custkey#37 = o_custkey#46)) > MetastoreRelation default, customer, None > Join Inner, Some((o_orderkey#45 = l_orderkey#54)) >MetastoreRelation default, orders, None >MetastoreRelation default, lineitem, None > Project \[tmp_orderkey#35] > Filter havingCondition#86 >Aggregate \[l_orderkey#70], \[(SUM(l_quantity#74) > 300.0) AS > havingCondition#86,l_orderkey#70 AS tmp_orderkey#35] > Project \[l_orderkey#70,l_quantity#74] > MetastoreRelation default, lineitem, None > {code} > Due to the pattern highlighted in green that the ColumnPruning rule fails to > deal with, all columns of lineitem and orders tables are scanned. The > unneeded columns are also involved in the data Shuffling. The performance is > extremely bad if any one of the two tables is big. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12032) Filter can't be pushed down to correct Join because of bad order of Join
[ https://issues.apache.org/jira/browse/SPARK-12032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15332721#comment-15332721 ] Apache Spark commented on SPARK-12032: -- User 'flyson' has created a pull request for this issue: https://github.com/apache/spark/pull/10258 > Filter can't be pushed down to correct Join because of bad order of Join > > > Key: SPARK-12032 > URL: https://issues.apache.org/jira/browse/SPARK-12032 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu >Priority: Critical > Fix For: 2.0.0 > > > For this query: > {code} > select d.d_year, count(*) cnt >FROM store_sales, date_dim d, customer c >WHERE ss_customer_sk = c.c_customer_sk AND c.c_first_shipto_date_sk = > d.d_date_sk >group by d.d_year > {code} > Current optimized plan is > {code} > == Optimized Logical Plan == > Aggregate [d_year#147], [d_year#147,(count(1),mode=Complete,isDistinct=false) > AS cnt#425L] > Project [d_year#147] > Join Inner, Some(((ss_customer_sk#283 = c_customer_sk#101) && > (c_first_shipto_date_sk#106 = d_date_sk#141))) >Project [d_date_sk#141,d_year#147,ss_customer_sk#283] > Join Inner, None > Project [ss_customer_sk#283] > Relation[] ParquetRelation[store_sales] > Project [d_date_sk#141,d_year#147] > Relation[] ParquetRelation[date_dim] >Project [c_customer_sk#101,c_first_shipto_date_sk#106] > Relation[] ParquetRelation[customer] > {code} > It will join store_sales and date_dim together without any condition, the > condition c.c_first_shipto_date_sk = d.d_date_sk is not pushed to it because > the bad order of joins. > The optimizer should re-order the joins, join date_dim after customer, then > it can pushed down the condition correctly. > The plan should be > {code} > Aggregate [d_year#147], [d_year#147,(count(1),mode=Complete,isDistinct=false) > AS cnt#425L] > Project [d_year#147] > Join Inner, Some((c_first_shipto_date_sk#106 = d_date_sk#141)) >Project [c_first_shipto_date_sk#106] > Join Inner, Some((ss_customer_sk#283 = c_customer_sk#101)) > Project [ss_customer_sk#283] > Relation[store_sales] > Project [c_first_shipto_date_sk#106,c_customer_sk#101] > Relation[customer] >Project [d_year#147,d_date_sk#141] > Relation[date_dim] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12329) spark-sql prints out SET commands to stdout instead of stderr
[ https://issues.apache.org/jira/browse/SPARK-12329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12329: Assignee: Apache Spark > spark-sql prints out SET commands to stdout instead of stderr > - > > Key: SPARK-12329 > URL: https://issues.apache.org/jira/browse/SPARK-12329 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Ashwin Shankar >Assignee: Apache Spark >Priority: Minor > > When I run "$spark-sql -f ", I see that few "SET key value" messages > get printed on stdout instead of stderr. These messages should go to stderr. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12329) spark-sql prints out SET commands to stdout instead of stderr
[ https://issues.apache.org/jira/browse/SPARK-12329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12329: Assignee: (was: Apache Spark) > spark-sql prints out SET commands to stdout instead of stderr > - > > Key: SPARK-12329 > URL: https://issues.apache.org/jira/browse/SPARK-12329 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Ashwin Shankar >Priority: Minor > > When I run "$spark-sql -f ", I see that few "SET key value" messages > get printed on stdout instead of stderr. These messages should go to stderr. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9689) Cache doesn't refresh for HadoopFsRelation based table
[ https://issues.apache.org/jira/browse/SPARK-9689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-9689. Resolution: Fixed Assignee: (was: Cheng Hao) Fix Version/s: 2.0.0 I think this one has been fixed in 2.0 already. > Cache doesn't refresh for HadoopFsRelation based table > -- > > Key: SPARK-9689 > URL: https://issues.apache.org/jira/browse/SPARK-9689 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1, 1.5.0 >Reporter: Cheng Hao > Fix For: 2.0.0 > > > {code:title=example|borderStyle=solid} > // create a HadoopFsRelation based table > sql(s""" > |CREATE TEMPORARY TABLE jsonTable (a int, b string) > |USING org.apache.spark.sql.json.DefaultSource > |OPTIONS ( > | path '${path.toString}' > |)""".stripMargin) > > // give the value from table jt > sql( > s""" > |INSERT OVERWRITE TABLE jsonTable SELECT a, b FROM jt > """.stripMargin) > // cache the HadoopFsRelation Table > sqlContext.cacheTable("jsonTable") > > // update the HadoopFsRelation Table > sql( > s""" > |INSERT OVERWRITE TABLE jsonTable SELECT a * 2, b FROM jt > """.stripMargin) > // Even this will fail > sql("SELECT a, b FROM jsonTable").collect() > // This will fail, as the cache doesn't refresh > checkAnswer( > sql("SELECT a, b FROM jsonTable"), > sql("SELECT a * 2, b FROM jt").collect()) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4944) Table Not Found exception in "Create Table Like registered RDD table"
[ https://issues.apache.org/jira/browse/SPARK-4944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15332704#comment-15332704 ] Derek Sabry commented on SPARK-4944: This email account is inactive. Please contact another person at the company or pe...@fb.com. > Table Not Found exception in "Create Table Like registered RDD table" > - > > Key: SPARK-4944 > URL: https://issues.apache.org/jira/browse/SPARK-4944 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Cheng Hao > > {code} > rdd_table.saveAsParquetFile("/user/spark/my_data.parquet") > hiveContext.registerRDDAsTable(rdd_table, "rdd_table") > hiveContext.sql("CREATE EXTERNAL TABLE my_data LIKE rdd_table LOCATION > '/user/spark/my_data.parquet'") > {code} > {noformat} > org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution > Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Table not > found rdd_table > at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:322) > at > org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:284) > at > org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult$lzycompute(NativeCommand.scala:35) > at > org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult(NativeCommand.scala:35) > at > org.apache.spark.sql.hive.execution.NativeCommand.execute(NativeCommand.scala:38) > at > org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:382) > at > org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:382) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4944) Table Not Found exception in "Create Table Like registered RDD table"
[ https://issues.apache.org/jira/browse/SPARK-4944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-4944. Resolution: Auto Closed > Table Not Found exception in "Create Table Like registered RDD table" > - > > Key: SPARK-4944 > URL: https://issues.apache.org/jira/browse/SPARK-4944 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Cheng Hao > > {code} > rdd_table.saveAsParquetFile("/user/spark/my_data.parquet") > hiveContext.registerRDDAsTable(rdd_table, "rdd_table") > hiveContext.sql("CREATE EXTERNAL TABLE my_data LIKE rdd_table LOCATION > '/user/spark/my_data.parquet'") > {code} > {noformat} > org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution > Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Table not > found rdd_table > at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:322) > at > org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:284) > at > org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult$lzycompute(NativeCommand.scala:35) > at > org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult(NativeCommand.scala:35) > at > org.apache.spark.sql.hive.execution.NativeCommand.execute(NativeCommand.scala:38) > at > org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:382) > at > org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:382) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15976) Make Stage Numbering determinstic
[ https://issues.apache.org/jira/browse/SPARK-15976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15332686#comment-15332686 ] Imran Rashid commented on SPARK-15976: -- cc [~kayousterhout] [~markhamstra] > Make Stage Numbering determinstic > - > > Key: SPARK-15976 > URL: https://issues.apache.org/jira/browse/SPARK-15976 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Affects Versions: 2.0.0 >Reporter: Imran Rashid > > Stage numbering in spark is non-deterministic. It never was deterministic, > but it *appeared* to be so in most cases. After SPARK-15927, it is far more > random. Reliable stage numbering would be helpful for internal unit tests, > and also for any client code which uses {{SparkListener}} to monitor a job > and gauge progress. > FWIW, I had never even realized that the order was non-deterministic before, > and have written plenty of code which assumes some stage numbering. I expect > users may be bitten by this too. We might even want to try to restore the > "usual" ordering from before SPARK-15927. > Finally it would be nice to restore some of the tests turned off here if > possible: https://github.com/apache/spark/pull/13688 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15177) SparkR 2.0 QA: New R APIs and API docs for mllib.R
[ https://issues.apache.org/jira/browse/SPARK-15177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-15177: -- Issue Type: Documentation (was: Improvement) > SparkR 2.0 QA: New R APIs and API docs for mllib.R > -- > > Key: SPARK-15177 > URL: https://issues.apache.org/jira/browse/SPARK-15177 > Project: Spark > Issue Type: Documentation > Components: ML, SparkR >Reporter: Yanbo Liang >Priority: Blocker > > Audit new public R APIs in mllib.R. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15129) Clarify conventions for calling Spark and MLlib from R
[ https://issues.apache.org/jira/browse/SPARK-15129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-15129: -- Assignee: Gayathri Murali > Clarify conventions for calling Spark and MLlib from R > -- > > Key: SPARK-15129 > URL: https://issues.apache.org/jira/browse/SPARK-15129 > Project: Spark > Issue Type: Documentation > Components: Documentation, ML, SparkR >Reporter: Joseph K. Bradley >Assignee: Gayathri Murali >Priority: Blocker > > Since some R API modifications happened in 2.0, we need to make the new > standards clear in the user guide. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15976) Make Stage Numbering determinstic
Imran Rashid created SPARK-15976: Summary: Make Stage Numbering determinstic Key: SPARK-15976 URL: https://issues.apache.org/jira/browse/SPARK-15976 Project: Spark Issue Type: Improvement Components: Scheduler Affects Versions: 2.0.0 Reporter: Imran Rashid Stage numbering in spark is non-deterministic. It never was deterministic, but it *appeared* to be so in most cases. After SPARK-15927, it is far more random. Reliable stage numbering would be helpful for internal unit tests, and also for any client code which uses {{SparkListener}} to monitor a job and gauge progress. FWIW, I had never even realized that the order was non-deterministic before, and have written plenty of code which assumes some stage numbering. I expect users may be bitten by this too. We might even want to try to restore the "usual" ordering from before SPARK-15927. Finally it would be nice to restore some of the tests turned off here if possible: https://github.com/apache/spark/pull/13688 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15782) --packages doesn't work with the spark-shell
[ https://issues.apache.org/jira/browse/SPARK-15782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15332674#comment-15332674 ] Apache Spark commented on SPARK-15782: -- User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/13693 > --packages doesn't work with the spark-shell > > > Key: SPARK-15782 > URL: https://issues.apache.org/jira/browse/SPARK-15782 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.0.0 >Reporter: Nezih Yigitbasi >Assignee: Nezih Yigitbasi >Priority: Blocker > Fix For: 2.0.0 > > > When {{--packages}} is specified with {{spark-shell}} the classes from those > packages cannot be found, which I think is due to some of the changes in > {{SPARK-12343}}. In particular {{SPARK-12343}} removes a line that sets the > {{spark.jars}} system property in client mode, which is used by the repl main > class to set the classpath. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15974) Create a socket on YARN AM start-up
[ https://issues.apache.org/jira/browse/SPARK-15974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15332637#comment-15332637 ] Marcelo Vanzin commented on SPARK-15974: bq. Spark's ApplicationMaster should optionally create a ServerSocket and pass it to the Spark user program. That makes a ton of assumptions about how the user code starts to listen for connections. If Spark is to support something like this, there should be some other way of telling Spark (or YARN directly) what the port is. > Create a socket on YARN AM start-up > --- > > Key: SPARK-15974 > URL: https://issues.apache.org/jira/browse/SPARK-15974 > Project: Spark > Issue Type: New Feature > Components: YARN >Reporter: Mingyu Kim > > YARN provides a way for AppilcationMaster to register a RPC port so that a > client outside the YARN cluster can reach the application for any RPCs, but > Spark’s YARN AMs simply register a dummy port number of 0. For the Spark > programs that starts up a server, this makes it hard for the submitter to > discover the server port securely. Spark's ApplicationMaster should > optionally create a ServerSocket and pass it to the Spark user program. This > socket initialization should be disabled by default. > Some discussion on dev@spark thread: > http://apache-spark-developers-list.1001551.n3.nabble.com/Utilizing-YARN-AM-RPC-port-field-td17892.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15934) Return binary mode in ThriftServer
[ https://issues.apache.org/jira/browse/SPARK-15934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-15934. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 13667 [https://github.com/apache/spark/pull/13667] > Return binary mode in ThriftServer > -- > > Key: SPARK-15934 > URL: https://issues.apache.org/jira/browse/SPARK-15934 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Egor Pahomov >Priority: Critical > Fix For: 2.0.0 > > > In spark-2.0.0 preview binary mode was turned off (SPARK-15095). > It was greatly irresponsible step due to the fact, that in 1.6.1 binary mode > was default and it turned off in 2.0.0. > Just to describe magnitude of harm not fixing this bug would do in my > organization: > * Tableau works only though Thrift Server and only with binary format. > Tableau would not work with spark-2.0.0 at all! > * I have bunch of analysts in my organization with configured sql > clients(DataGrip and Squirrel). I would need to go one by one to change > connection string for them(DataGrip). Squirrel simply do not work with http - > some jar hell in my case. > * let me not mention all other stuff which connects to our data > infrastructure through ThriftServer as gateway. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15776) Type coercion incorrect
[ https://issues.apache.org/jira/browse/SPARK-15776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-15776. - Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 13651 [https://github.com/apache/spark/pull/13651] > Type coercion incorrect > --- > > Key: SPARK-15776 > URL: https://issues.apache.org/jira/browse/SPARK-15776 > Project: Spark > Issue Type: Bug > Components: SQL > Environment: Spark based on commit > 26c1089c37149061f838129bb53330ded68ff4c9 >Reporter: Weizhong >Priority: Minor > Fix For: 2.0.0 > > > {code:sql} > CREATE TABLE cdr ( > debet_dt int , > srv_typ_cdstring , > b_brnd_cd smallint , > call_dur int > ) > ROW FORMAT delimited fields terminated by ',' > STORED AS TEXTFILE; > {code} > {code:sql} > SELECT debet_dt, >SUM(CASE WHEN srv_typ_cd LIKE '0%' THEN call_dur / 60 ELSE 0 END) > FROM cdr > GROUP BY debet_dt > ORDER BY debet_dt; > {code} > {noformat} > == Analyzed Logical Plan == > debet_dt: int, sum(CASE WHEN srv_typ_cd LIKE 0% THEN (call_dur / 60) ELSE 0 > END): bigint > Project [debet_dt#16, sum(CASE WHEN srv_typ_cd LIKE 0% THEN (call_dur / 60) > ELSE 0 END)#27L] > +- Sort [debet_dt#16 ASC], true >+- Aggregate [debet_dt#16], [debet_dt#16, sum(cast(CASE WHEN srv_typ_cd#18 > LIKE 0% THEN (cast(call_dur#21 as double) / cast(60 as double)) ELSE cast(0 > as double) END as bigint)) AS sum(CASE WHEN srv_typ_cd LIKE 0% THEN (call_dur > / 60) ELSE 0 END)#27L] > +- MetastoreRelation default, cdr > {noformat} > {code:sql} > SELECT debet_dt, >SUM(CASE WHEN b_brnd_cd IN(1) THEN call_dur / 60 ELSE 0 END) > FROM cdr > GROUP BY debet_dt > ORDER BY debet_dt; > {code} > {noformat} > == Analyzed Logical Plan == > debet_dt: int, sum(CASE WHEN (CAST(b_brnd_cd AS INT) IN (CAST(1 AS INT))) > THEN (CAST(call_dur AS DOUBLE) / CAST(60 AS DOUBLE)) ELSE CAST(0 AS DOUBLE) > END): double > Project [debet_dt#76, sum(CASE WHEN (CAST(b_brnd_cd AS INT) IN (CAST(1 AS > INT))) THEN (CAST(call_dur AS DOUBLE) / CAST(60 AS DOUBLE)) ELSE CAST(0 AS > DOUBLE) END)#87] > +- Sort [debet_dt#76 ASC], true >+- Aggregate [debet_dt#76], [debet_dt#76, sum(CASE WHEN cast(b_brnd_cd#80 > as int) IN (cast(1 as int)) THEN (cast(call_dur#81 as double) / cast(60 as > double)) ELSE cast(0 as double) END) AS sum(CASE WHEN (CAST(b_brnd_cd AS INT) > IN (CAST(1 AS INT))) THEN (CAST(call_dur AS DOUBLE) / CAST(60 AS DOUBLE)) > ELSE CAST(0 AS DOUBLE) END)#87] > +- MetastoreRelation default, cdr > {noformat} > The only difference is WHEN condition, but will result different output > column type(one is bigint, one is double) > We need to apply "Division" before "FunctionArgumentConversion", like below: > {code:java} > val typeCoercionRules = > PropagateTypes :: > InConversion :: > WidenSetOperationTypes :: > PromoteStrings :: > DecimalPrecision :: > BooleanEquality :: > StringToIntegralCasts :: > Division :: > FunctionArgumentConversion :: > CaseWhenCoercion :: > IfCoercion :: > PropagateTypes :: > ImplicitTypeCasts :: > DateTimeOperations :: > Nil > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15776) Type coercion incorrect
[ https://issues.apache.org/jira/browse/SPARK-15776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-15776: Assignee: Sean Zhong > Type coercion incorrect > --- > > Key: SPARK-15776 > URL: https://issues.apache.org/jira/browse/SPARK-15776 > Project: Spark > Issue Type: Bug > Components: SQL > Environment: Spark based on commit > 26c1089c37149061f838129bb53330ded68ff4c9 >Reporter: Weizhong >Assignee: Sean Zhong >Priority: Minor > Fix For: 2.0.0 > > > {code:sql} > CREATE TABLE cdr ( > debet_dt int , > srv_typ_cdstring , > b_brnd_cd smallint , > call_dur int > ) > ROW FORMAT delimited fields terminated by ',' > STORED AS TEXTFILE; > {code} > {code:sql} > SELECT debet_dt, >SUM(CASE WHEN srv_typ_cd LIKE '0%' THEN call_dur / 60 ELSE 0 END) > FROM cdr > GROUP BY debet_dt > ORDER BY debet_dt; > {code} > {noformat} > == Analyzed Logical Plan == > debet_dt: int, sum(CASE WHEN srv_typ_cd LIKE 0% THEN (call_dur / 60) ELSE 0 > END): bigint > Project [debet_dt#16, sum(CASE WHEN srv_typ_cd LIKE 0% THEN (call_dur / 60) > ELSE 0 END)#27L] > +- Sort [debet_dt#16 ASC], true >+- Aggregate [debet_dt#16], [debet_dt#16, sum(cast(CASE WHEN srv_typ_cd#18 > LIKE 0% THEN (cast(call_dur#21 as double) / cast(60 as double)) ELSE cast(0 > as double) END as bigint)) AS sum(CASE WHEN srv_typ_cd LIKE 0% THEN (call_dur > / 60) ELSE 0 END)#27L] > +- MetastoreRelation default, cdr > {noformat} > {code:sql} > SELECT debet_dt, >SUM(CASE WHEN b_brnd_cd IN(1) THEN call_dur / 60 ELSE 0 END) > FROM cdr > GROUP BY debet_dt > ORDER BY debet_dt; > {code} > {noformat} > == Analyzed Logical Plan == > debet_dt: int, sum(CASE WHEN (CAST(b_brnd_cd AS INT) IN (CAST(1 AS INT))) > THEN (CAST(call_dur AS DOUBLE) / CAST(60 AS DOUBLE)) ELSE CAST(0 AS DOUBLE) > END): double > Project [debet_dt#76, sum(CASE WHEN (CAST(b_brnd_cd AS INT) IN (CAST(1 AS > INT))) THEN (CAST(call_dur AS DOUBLE) / CAST(60 AS DOUBLE)) ELSE CAST(0 AS > DOUBLE) END)#87] > +- Sort [debet_dt#76 ASC], true >+- Aggregate [debet_dt#76], [debet_dt#76, sum(CASE WHEN cast(b_brnd_cd#80 > as int) IN (cast(1 as int)) THEN (cast(call_dur#81 as double) / cast(60 as > double)) ELSE cast(0 as double) END) AS sum(CASE WHEN (CAST(b_brnd_cd AS INT) > IN (CAST(1 AS INT))) THEN (CAST(call_dur AS DOUBLE) / CAST(60 AS DOUBLE)) > ELSE CAST(0 AS DOUBLE) END)#87] > +- MetastoreRelation default, cdr > {noformat} > The only difference is WHEN condition, but will result different output > column type(one is bigint, one is double) > We need to apply "Division" before "FunctionArgumentConversion", like below: > {code:java} > val typeCoercionRules = > PropagateTypes :: > InConversion :: > WidenSetOperationTypes :: > PromoteStrings :: > DecimalPrecision :: > BooleanEquality :: > StringToIntegralCasts :: > Division :: > FunctionArgumentConversion :: > CaseWhenCoercion :: > IfCoercion :: > PropagateTypes :: > ImplicitTypeCasts :: > DateTimeOperations :: > Nil > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15319) Fix SparkR doc layout for corr and other DataFrame stats functions
[ https://issues.apache.org/jira/browse/SPARK-15319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-15319: -- Issue Type: Documentation (was: Bug) > Fix SparkR doc layout for corr and other DataFrame stats functions > -- > > Key: SPARK-15319 > URL: https://issues.apache.org/jira/browse/SPARK-15319 > Project: Spark > Issue Type: Documentation > Components: SparkR >Affects Versions: 1.6.1, 2.0.0 >Reporter: Felix Cheung >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15319) Fix SparkR doc layout for corr and other DataFrame stats functions
[ https://issues.apache.org/jira/browse/SPARK-15319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-15319: -- Affects Version/s: 2.0.0 > Fix SparkR doc layout for corr and other DataFrame stats functions > -- > > Key: SPARK-15319 > URL: https://issues.apache.org/jira/browse/SPARK-15319 > Project: Spark > Issue Type: Documentation > Components: SparkR >Affects Versions: 1.6.1, 2.0.0 >Reporter: Felix Cheung >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15974) Create a socket on YARN AM start-up
Mingyu Kim created SPARK-15974: -- Summary: Create a socket on YARN AM start-up Key: SPARK-15974 URL: https://issues.apache.org/jira/browse/SPARK-15974 Project: Spark Issue Type: New Feature Components: YARN Reporter: Mingyu Kim YARN provides a way for AppilcationMaster to register a RPC port so that a client outside the YARN cluster can reach the application for any RPCs, but Spark’s YARN AMs simply register a dummy port number of 0. For the Spark programs that starts up a server, this makes it hard for the submitter to discover the server port securely. Spark's ApplicationMaster should optionally create a ServerSocket and pass it to the Spark user program. This socket initialization should be disabled by default. Some discussion on dev@spark thread: http://apache-spark-developers-list.1001551.n3.nabble.com/Utilizing-YARN-AM-RPC-port-field-td17892.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15851) Spark 2.0 does not compile in Windows 7
[ https://issues.apache.org/jira/browse/SPARK-15851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15332612#comment-15332612 ] Apache Spark commented on SPARK-15851: -- User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/13691 > Spark 2.0 does not compile in Windows 7 > --- > > Key: SPARK-15851 > URL: https://issues.apache.org/jira/browse/SPARK-15851 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.0.0 > Environment: Windows 7 >Reporter: Alexander Ulanov > > Spark does not compile in Windows 7. > "mvn compile" fails on spark-core due to trying to execute a bash script > spark-build-info. > Work around: > 1)Install win-bash and put in path > 2)Change line 350 of core/pom.xml > > > > > > Error trace: > [ERROR] Failed to execute goal > org.apache.maven.plugins:maven-antrun-plugin:1.8:run (default) on project > spark-core_2.11: An Ant BuildException has occured: Execute failed: > java.io.IOException: Cannot run program > "C:\dev\spark\core\..\build\spark-build-info" (in directory > "C:\dev\spark\core"): CreateProcess error=193, %1 is not a valid Win32 > application > [ERROR] around Ant part ... executable="C:\dev\spark\core/../build/spark-build-info">... @ 4:73 in > C:\dev\spark\core\target\antrun\build-main.xml -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15975) Improper Popen.wait() return code handling in dev/run-tests
Josh Rosen created SPARK-15975: -- Summary: Improper Popen.wait() return code handling in dev/run-tests Key: SPARK-15975 URL: https://issues.apache.org/jira/browse/SPARK-15975 Project: Spark Issue Type: Bug Components: Project Infra Affects Versions: 1.6.0 Reporter: Josh Rosen Assignee: Josh Rosen In dev/run-tests.py there's a line where we effectively do {code} retcode = some_popen_instance.wait() if retcode > 0: err # else do nothing {code} but this code is subtlety wrong because Popen's return code will be negative if the child process was terminated by a signal: https://docs.python.org/2/library/subprocess.html#subprocess.Popen.returncode We should change this to {{retcode != 0}} so that we properly error out and exit due to termination by signal. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15975) Improper Popen.wait() return code handling in dev/run-tests
[ https://issues.apache.org/jira/browse/SPARK-15975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15332620#comment-15332620 ] Apache Spark commented on SPARK-15975: -- User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/13692 > Improper Popen.wait() return code handling in dev/run-tests > --- > > Key: SPARK-15975 > URL: https://issues.apache.org/jira/browse/SPARK-15975 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 1.6.0 >Reporter: Josh Rosen >Assignee: Josh Rosen > > In dev/run-tests.py there's a line where we effectively do > {code} > retcode = some_popen_instance.wait() > if retcode > 0: > err > # else do nothing > {code} > but this code is subtlety wrong because Popen's return code will be negative > if the child process was terminated by a signal: > https://docs.python.org/2/library/subprocess.html#subprocess.Popen.returncode > We should change this to {{retcode != 0}} so that we properly error out and > exit due to termination by signal. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15975) Improper Popen.wait() return code handling in dev/run-tests
[ https://issues.apache.org/jira/browse/SPARK-15975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15975: Assignee: Josh Rosen (was: Apache Spark) > Improper Popen.wait() return code handling in dev/run-tests > --- > > Key: SPARK-15975 > URL: https://issues.apache.org/jira/browse/SPARK-15975 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 1.6.0 >Reporter: Josh Rosen >Assignee: Josh Rosen > > In dev/run-tests.py there's a line where we effectively do > {code} > retcode = some_popen_instance.wait() > if retcode > 0: > err > # else do nothing > {code} > but this code is subtlety wrong because Popen's return code will be negative > if the child process was terminated by a signal: > https://docs.python.org/2/library/subprocess.html#subprocess.Popen.returncode > We should change this to {{retcode != 0}} so that we properly error out and > exit due to termination by signal. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15975) Improper Popen.wait() return code handling in dev/run-tests
[ https://issues.apache.org/jira/browse/SPARK-15975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15975: Assignee: Apache Spark (was: Josh Rosen) > Improper Popen.wait() return code handling in dev/run-tests > --- > > Key: SPARK-15975 > URL: https://issues.apache.org/jira/browse/SPARK-15975 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 1.6.0 >Reporter: Josh Rosen >Assignee: Apache Spark > > In dev/run-tests.py there's a line where we effectively do > {code} > retcode = some_popen_instance.wait() > if retcode > 0: > err > # else do nothing > {code} > but this code is subtlety wrong because Popen's return code will be negative > if the child process was terminated by a signal: > https://docs.python.org/2/library/subprocess.html#subprocess.Popen.returncode > We should change this to {{retcode != 0}} so that we properly error out and > exit due to termination by signal. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15930) Add Row count property to FPGrowth model
[ https://issues.apache.org/jira/browse/SPARK-15930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15332588#comment-15332588 ] Gayathri Murali commented on SPARK-15930: - [~yuhaoyan] If you havent already started working on this, I can send the PR. > Add Row count property to FPGrowth model > > > Key: SPARK-15930 > URL: https://issues.apache.org/jira/browse/SPARK-15930 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.6.1 >Reporter: John Aherne >Priority: Minor > Labels: fp-growth, mllib > > Add a row count property to MLlib's FPGrowth model. > When using the model from FPGrowth, a count of the total number of records is > often necessary. > It appears that the function already calculates that value when training the > model, so it would save time not having to do it again outside the model. > Sorry if this is the wrong place for this kind of stuff. I am new to Jira, > Spark, and making suggestions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3847) Enum.hashCode is only consistent within the same JVM
[ https://issues.apache.org/jira/browse/SPARK-3847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15332587#comment-15332587 ] Brett Stime commented on SPARK-3847: Another option: have the default behavior be 'safe' and not share hashCodes between JVMs. If passing hashCodes really does significantly improve performance (when used outside of arrays and enums), there could be a special configuration setting to enable inter-JVM hashCodes. E.g., something like spark.shuffle.i_solemnly_swear_my_keys_have_consistent_hashes which can be set true to enable the performant behavior. This would provide for discoverable documentation of the issue and make it relatively easy to compare/test results from either mode to the other. > Enum.hashCode is only consistent within the same JVM > > > Key: SPARK-3847 > URL: https://issues.apache.org/jira/browse/SPARK-3847 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 > Environment: Oracle JDK 7u51 64bit on Ubuntu 12.04 >Reporter: Nathan Bijnens > Labels: enum > > When using java Enum's as key in some operations the results will be very > unexpected. The issue is that the Java Enum.hashCode returns the > memoryposition, which is different on each JVM. > {code} > messages.filter(_.getHeader.getKind == Kind.EVENT).count > >> 503650 > val tmp = messages.filter(_.getHeader.getKind == Kind.EVENT) > tmp.map(_.getHeader.getKind).countByValue > >> Map(EVENT -> 1389) > {code} > Because it's actually a JVM issue we either should reject with an error enums > as key or implement a workaround. > A good writeup of the issue can be found here (and a workaround): > http://dev.bizo.com/2014/02/beware-enums-in-spark.html > Somewhat more on the hash codes and Enum's: > https://stackoverflow.com/questions/4885095/what-is-the-reason-behind-enum-hashcode > And some issues (most of them rejected) at the Oracle Bug Java database: > - http://bugs.java.com/bugdatabase/view_bug.do?bug_id=8050217 > - http://bugs.java.com/bugdatabase/view_bug.do?bug_id=7190798 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15447) Performance test for ALS in Spark 2.0
[ https://issues.apache.org/jira/browse/SPARK-15447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15332578#comment-15332578 ] Reynold Xin commented on SPARK-15447: - We can close this one now can't we? > Performance test for ALS in Spark 2.0 > - > > Key: SPARK-15447 > URL: https://issues.apache.org/jira/browse/SPARK-15447 > Project: Spark > Issue Type: Task > Components: ML >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Nick Pentreath >Priority: Critical > Labels: QA > > We made several changes to ALS in 2.0. It is necessary to run some tests to > avoid performance regression. We should test (synthetic) datasets from 1 > million ratings to 1 billion ratings. > cc [~mlnick] [~holdenk] Do you have time to run some large-scale performance > tests? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15901) Test Cases for CONVERT_METASTORE_ORC and CONVERT_METASTORE_PARQUET
[ https://issues.apache.org/jira/browse/SPARK-15901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-15901: --- Assignee: Xiao Li > Test Cases for CONVERT_METASTORE_ORC and CONVERT_METASTORE_PARQUET > -- > > Key: SPARK-15901 > URL: https://issues.apache.org/jira/browse/SPARK-15901 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li >Assignee: Xiao Li > Fix For: 2.0.0 > > > So far, we do not have test cases for verifying whether the external > parameters CONVERT_METASTORE_ORC and CONVERT_METASTORE_PARQUET properly works > when users use non-default values. Adding test cases for avoiding regression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15901) Test Cases for CONVERT_METASTORE_ORC and CONVERT_METASTORE_PARQUET
[ https://issues.apache.org/jira/browse/SPARK-15901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-15901. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 13622 [https://github.com/apache/spark/pull/13622] > Test Cases for CONVERT_METASTORE_ORC and CONVERT_METASTORE_PARQUET > -- > > Key: SPARK-15901 > URL: https://issues.apache.org/jira/browse/SPARK-15901 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > Fix For: 2.0.0 > > > So far, we do not have test cases for verifying whether the external > parameters CONVERT_METASTORE_ORC and CONVERT_METASTORE_PARQUET properly works > when users use non-default values. Adding test cases for avoiding regression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org