[jira] [Commented] (SPARK-16452) basic INFORMATION_SCHEMA support
[ https://issues.apache.org/jira/browse/SPARK-16452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15372261#comment-15372261 ] Dongjoon Hyun commented on SPARK-16452: --- Hi, [~rxin]. I have a question about VIEW description (VIEW SQL). `SQLBuilder` is in `sql/core` and temporary views (like information schema) are in `SessionCatalog` of `sql/catalyst` as a private map. Since `sql/catalyst` can not generate View SQL by itself, `InformationSchema` needs some API to access them. It seems not desirable, but is it okay to make one? > basic INFORMATION_SCHEMA support > > > Key: SPARK-16452 > URL: https://issues.apache.org/jira/browse/SPARK-16452 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > Attachments: INFORMATION_SCHEMAsupport.pdf > > > INFORMATION_SCHEMA is part of SQL92 support. This ticket proposes adding a > few tables as defined in SQL92 standard to Spark SQL. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13514) Spark Shuffle Service 1.6.0 issue in Yarn
[ https://issues.apache.org/jira/browse/SPARK-13514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15372252#comment-15372252 ] Kevin commented on SPARK-13514: --- I met this problem too these days. Are there other solutions to solve the problem? Or is this fixed in the later release 1.6.1 and 1.6.2? > Spark Shuffle Service 1.6.0 issue in Yarn > -- > > Key: SPARK-13514 > URL: https://issues.apache.org/jira/browse/SPARK-13514 > Project: Spark > Issue Type: Bug >Reporter: Satish Kolli > > Spark shuffle service 1.6.0 in Yarn fails with an unknown exception. When I > replace the spark shuffle jar with version 1.5.2 jar file, the following > succeeds with out any issues. > Hadoop Version: 2.5.1 (Kerberos Enabled) > Spark Version: 1.6.0 > Java Version: 1.7.0_79 > {code} > $SPARK_HOME/bin/spark-shell \ > --master yarn \ > --deploy-mode client \ > --conf spark.dynamicAllocation.enabled=true \ > --conf spark.dynamicAllocation.minExecutors=5 \ > --conf spark.yarn.executor.memoryOverhead=2048 \ > --conf spark.shuffle.service.enabled=true \ > --conf spark.scheduler.mode=FAIR \ > --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \ > --executor-memory 6G \ > --driver-memory 8G > {code} > {code} > scala> val df = sc.parallelize(1 to 50).toDF > df: org.apache.spark.sql.DataFrame = [_1: int] > scala> df.show(50) > {code} > {code} > 16/02/26 08:20:53 INFO spark.SparkContext: Starting job: show at :30 > 16/02/26 08:20:53 INFO scheduler.DAGScheduler: Got job 0 (show at > :30) with 1 output partitions > 16/02/26 08:20:53 INFO scheduler.DAGScheduler: Final stage: ResultStage 0 > (show at :30) > 16/02/26 08:20:53 INFO scheduler.DAGScheduler: Parents of final stage: List() > 16/02/26 08:20:53 INFO scheduler.DAGScheduler: Missing parents: List() > 16/02/26 08:20:53 INFO scheduler.DAGScheduler: Submitting ResultStage 0 > (MapPartitionsRDD[2] at show at :30), which has no missing parents > 16/02/26 08:20:53 INFO storage.MemoryStore: Block broadcast_0 stored as > values in memory (estimated size 2.2 KB, free 2.2 KB) > 16/02/26 08:20:53 INFO storage.MemoryStore: Block broadcast_0_piece0 stored > as bytes in memory (estimated size 1411.0 B, free 3.6 KB) > 16/02/26 08:20:53 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in > memory on 10.5.76.106:46683 (size: 1411.0 B, free: 5.5 GB) > 16/02/26 08:20:53 INFO spark.SparkContext: Created broadcast 0 from broadcast > at DAGScheduler.scala:1006 > 16/02/26 08:20:53 INFO scheduler.DAGScheduler: Submitting 1 missing tasks > from ResultStage 0 (MapPartitionsRDD[2] at show at :30) > 16/02/26 08:20:53 INFO cluster.YarnScheduler: Adding task set 0.0 with 1 tasks > 16/02/26 08:20:53 INFO scheduler.FairSchedulableBuilder: Added task set > TaskSet_0 tasks to pool default > 16/02/26 08:20:53 INFO scheduler.TaskSetManager: Starting task 0.0 in stage > 0.0 (TID 0, , partition 0,PROCESS_LOCAL, 2031 bytes) > 16/02/26 08:20:53 INFO cluster.YarnClientSchedulerBackend: Disabling executor > 2. > 16/02/26 08:20:54 INFO scheduler.DAGScheduler: Executor lost: 2 (epoch 0) > 16/02/26 08:20:54 INFO storage.BlockManagerMasterEndpoint: Trying to remove > executor 2 from BlockManagerMaster. > 16/02/26 08:20:54 INFO storage.BlockManagerMasterEndpoint: Removing block > manager BlockManagerId(2, , 48113) > 16/02/26 08:20:54 INFO storage.BlockManagerMaster: Removed 2 successfully in > removeExecutor > 16/02/26 08:20:54 ERROR cluster.YarnScheduler: Lost executor 2 on > : Container marked as failed: > container_1456492687549_0001_01_03 on host: . > Exit status: 1. Diagnostics: Exception from container-launch: > ExitCodeException exitCode=1: > ExitCodeException exitCode=1: > at org.apache.hadoop.util.Shell.runCommand(Shell.java:538) > at org.apache.hadoop.util.Shell.run(Shell.java:455) > at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:702) > at > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:300) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Container exited with a non-zero exit code 1 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --
[jira] [Commented] (SPARK-16437) SparkR read.df() from parquet got error: SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder"
[ https://issues.apache.org/jira/browse/SPARK-16437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15372211#comment-15372211 ] Xin Ren commented on SPARK-16437: - But I still find some minor improvements during my debugging, and will submit a PR tomorrow. > SparkR read.df() from parquet got error: SLF4J: Failed to load class > "org.slf4j.impl.StaticLoggerBinder" > > > Key: SPARK-16437 > URL: https://issues.apache.org/jira/browse/SPARK-16437 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Xin Ren >Priority: Minor > > build SparkR with command > {code} > build/mvn -DskipTests -Psparkr package > {code} > start SparkR console > {code} > ./bin/sparkR > {code} > then get error > {code} > Welcome to > __ >/ __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ > /___/ .__/\_,_/_/ /_/\_\ version 2.0.0-SNAPSHOT > /_/ > SparkSession available as 'spark'. > > > > > > library(SparkR) > > > > df <- read.df("examples/src/main/resources/users.parquet") > SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". > SLF4J: Defaulting to no-operation (NOP) logger implementation > SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further > details. > > > > > > head(df) > 16/07/07 23:20:54 WARN ParquetRecordReader: Can not initialize counter due to > context is not a instance of TaskInputOutputContext, but is > org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl > name favorite_color favorite_numbers > 1 Alyssa3, 9, 15, 20 > 2Benred NULL > {code} > Reference > * seems need to add a lib from slf4j to point to older version > http://stackoverflow.com/questions/7421612/slf4j-failed-to-load-class-org-slf4j-impl-staticloggerbinder > * on slf4j official site: http://www.slf4j.org/codes.html#StaticLoggerBinder -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16437) SparkR read.df() from parquet got error: SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder"
[ https://issues.apache.org/jira/browse/SPARK-16437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15372210#comment-15372210 ] Xin Ren commented on SPARK-16437: - I worked on this for couple days, and I found it's not caused by Spark, but the parquet library "parquet-mr/parquet-hadoop". I've debug by step, and found this error is from here: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L820 and after digging into "parquet-hadoop", it's mostly probably because this library is missing the slf4j binder: https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L231 But it's technically not a bug, since Spark is using {code}1.7.16{code}, and since 1.6 SLF4J is defaulting to no-operation (NOP) logger implementation, so should be ok. > SparkR read.df() from parquet got error: SLF4J: Failed to load class > "org.slf4j.impl.StaticLoggerBinder" > > > Key: SPARK-16437 > URL: https://issues.apache.org/jira/browse/SPARK-16437 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Xin Ren >Priority: Minor > > build SparkR with command > {code} > build/mvn -DskipTests -Psparkr package > {code} > start SparkR console > {code} > ./bin/sparkR > {code} > then get error > {code} > Welcome to > __ >/ __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ > /___/ .__/\_,_/_/ /_/\_\ version 2.0.0-SNAPSHOT > /_/ > SparkSession available as 'spark'. > > > > > > library(SparkR) > > > > df <- read.df("examples/src/main/resources/users.parquet") > SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". > SLF4J: Defaulting to no-operation (NOP) logger implementation > SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further > details. > > > > > > head(df) > 16/07/07 23:20:54 WARN ParquetRecordReader: Can not initialize counter due to > context is not a instance of TaskInputOutputContext, but is > org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl > name favorite_color favorite_numbers > 1 Alyssa3, 9, 15, 20 > 2Benred NULL > {code} > Reference > * seems need to add a lib from slf4j to point to older version > http://stackoverflow.com/questions/7421612/slf4j-failed-to-load-class-org-slf4j-impl-staticloggerbinder > * on slf4j official site: http://www.slf4j.org/codes.html#StaticLoggerBinder -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16199) Add a method to list the referenced columns in data source Filter
[ https://issues.apache.org/jira/browse/SPARK-16199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-16199. - Resolution: Fixed Assignee: Peter Lee (was: Reynold Xin) Fix Version/s: 2.1.0 > Add a method to list the referenced columns in data source Filter > - > > Key: SPARK-16199 > URL: https://issues.apache.org/jira/browse/SPARK-16199 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Peter Lee > Fix For: 2.1.0 > > > It would be useful to support listing the columns that are referenced by a > filter. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12639) Improve Explain for DataSources with Handled Predicate Pushdowns
[ https://issues.apache.org/jira/browse/SPARK-12639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-12639: - Assignee: Russell Alexander Spitzer > Improve Explain for DataSources with Handled Predicate Pushdowns > > > Key: SPARK-12639 > URL: https://issues.apache.org/jira/browse/SPARK-12639 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Russell Alexander Spitzer >Assignee: Russell Alexander Spitzer >Priority: Minor > Fix For: 2.1.0 > > > SPARK-11661 improves handling of predicate pushdowns but has an unintended > consequence of making the explain string more confusing. > It basically makes it seem as if a source is always pushing down all of the > filters (even those it cannot handle) > This can have a confusing effect (I kept checking my code to see where I had > broken something ) > {code: title= "Query plan for source where nothing is handled by C* Source"} > Filter a#71 = 1) && (b#72 = 2)) && (c#73 = 1)) && (e#75 = 1)) > +- Scan > org.apache.spark.sql.cassandra.CassandraSourceRelation@4b9cf75c[a#71,b#72,c#73,d#74,e#75,f#76,g#77,h#78] > PushedFilters: [EqualTo(a,1), EqualTo(b,2), EqualTo(c,1), EqualTo(e,1)] > {code} > Although the tell tale "Filter" step is present my first instinct would tell > me that the underlying source relation is using all of those filters. > {code: title = "Query plan for source where everything is handled by C* > Source"} > Scan > org.apache.spark.sql.cassandra.CassandraSourceRelation@55d4456c[a#79,b#80,c#81,d#82,e#83,f#84,g#85,h#86] > PushedFilters: [EqualTo(a,1), EqualTo(b,2), EqualTo(c,1), EqualTo(e,1)] > {code} > I think this would be much clearer if we changed the metadata key to > "HandledFilters" and only listed those handled fully by the underlying source. > Something like > {code: title="Proposed Explain for Pushdown were none of the predicates are > handled by the underlying source"} > Filter a#71 = 1) && (b#72 = 2)) && (c#73 = 1)) && (e#75 = 1)) > +- Scan > org.apache.spark.sql.cassandra.CassandraSourceRelation@4b9cf75c[a#71,b#72,c#73,d#74,e#75,f#76,g#77,h#78] > HandledFilters: [] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12639) Improve Explain for DataSources with Handled Predicate Pushdowns
[ https://issues.apache.org/jira/browse/SPARK-12639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-12639. -- Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 11317 [https://github.com/apache/spark/pull/11317] > Improve Explain for DataSources with Handled Predicate Pushdowns > > > Key: SPARK-12639 > URL: https://issues.apache.org/jira/browse/SPARK-12639 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Russell Alexander Spitzer >Priority: Minor > Fix For: 2.1.0 > > > SPARK-11661 improves handling of predicate pushdowns but has an unintended > consequence of making the explain string more confusing. > It basically makes it seem as if a source is always pushing down all of the > filters (even those it cannot handle) > This can have a confusing effect (I kept checking my code to see where I had > broken something ) > {code: title= "Query plan for source where nothing is handled by C* Source"} > Filter a#71 = 1) && (b#72 = 2)) && (c#73 = 1)) && (e#75 = 1)) > +- Scan > org.apache.spark.sql.cassandra.CassandraSourceRelation@4b9cf75c[a#71,b#72,c#73,d#74,e#75,f#76,g#77,h#78] > PushedFilters: [EqualTo(a,1), EqualTo(b,2), EqualTo(c,1), EqualTo(e,1)] > {code} > Although the tell tale "Filter" step is present my first instinct would tell > me that the underlying source relation is using all of those filters. > {code: title = "Query plan for source where everything is handled by C* > Source"} > Scan > org.apache.spark.sql.cassandra.CassandraSourceRelation@55d4456c[a#79,b#80,c#81,d#82,e#83,f#84,g#85,h#86] > PushedFilters: [EqualTo(a,1), EqualTo(b,2), EqualTo(c,1), EqualTo(e,1)] > {code} > I think this would be much clearer if we changed the metadata key to > "HandledFilters" and only listed those handled fully by the underlying source. > Something like > {code: title="Proposed Explain for Pushdown were none of the predicates are > handled by the underlying source"} > Filter a#71 = 1) && (b#72 = 2)) && (c#73 = 1)) && (e#75 = 1)) > +- Scan > org.apache.spark.sql.cassandra.CassandraSourceRelation@4b9cf75c[a#71,b#72,c#73,d#74,e#75,f#76,g#77,h#78] > HandledFilters: [] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16482) If a table's schema is inferred at runtime, describe table command does not show the schema
[ https://issues.apache.org/jira/browse/SPARK-16482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16482: Assignee: Apache Spark > If a table's schema is inferred at runtime, describe table command does not > show the schema > --- > > Key: SPARK-16482 > URL: https://issues.apache.org/jira/browse/SPARK-16482 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Yin Huai >Assignee: Apache Spark >Priority: Critical > > If we create a table pointing to a parquet/json datasets without specifying > the schema, describe table command does not show the schema at all. It only > shows {{# Schema of this table is inferred at runtime}}. In 1.6, describe > table does show the schema of such a table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16482) If a table's schema is inferred at runtime, describe table command does not show the schema
[ https://issues.apache.org/jira/browse/SPARK-16482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16482: Assignee: (was: Apache Spark) > If a table's schema is inferred at runtime, describe table command does not > show the schema > --- > > Key: SPARK-16482 > URL: https://issues.apache.org/jira/browse/SPARK-16482 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Yin Huai >Priority: Critical > > If we create a table pointing to a parquet/json datasets without specifying > the schema, describe table command does not show the schema at all. It only > shows {{# Schema of this table is inferred at runtime}}. In 1.6, describe > table does show the schema of such a table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16482) If a table's schema is inferred at runtime, describe table command does not show the schema
[ https://issues.apache.org/jira/browse/SPARK-16482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15372145#comment-15372145 ] Apache Spark commented on SPARK-16482: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/14148 > If a table's schema is inferred at runtime, describe table command does not > show the schema > --- > > Key: SPARK-16482 > URL: https://issues.apache.org/jira/browse/SPARK-16482 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Yin Huai >Priority: Critical > > If we create a table pointing to a parquet/json datasets without specifying > the schema, describe table command does not show the schema at all. It only > shows {{# Schema of this table is inferred at runtime}}. In 1.6, describe > table does show the schema of such a table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15440) Add CSRF Filter for REST APIs to Spark
[ https://issues.apache.org/jira/browse/SPARK-15440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang resolved SPARK-15440. - Resolution: Won't Fix > Add CSRF Filter for REST APIs to Spark > -- > > Key: SPARK-15440 > URL: https://issues.apache.org/jira/browse/SPARK-15440 > Project: Spark > Issue Type: New Feature > Components: Deploy, Spark Core >Reporter: Yanbo Liang > > CSRF prevention for REST APIs can be provided through a common servlet > filter. This filter would check for the existence of an custom HTTP header - > such as X-XSRF-Header. > The fact that CSRF attacks are entirely browser based means that the above > approach can ensure that requests are coming from either: applications served > by the same origin as the REST API or that there is explicit policy > configuration that allows the setting of a header on XmlHttpRequest from > another origin. > We have done similar work for Hadoop > (https://issues.apache.org/jira/browse/HADOOP-12691) and other components. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16488) Codegen variable namespace collision for pmod and partitionBy
[ https://issues.apache.org/jira/browse/SPARK-16488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-16488. - Resolution: Fixed Assignee: Sameer Agarwal Fix Version/s: 2.0.0 > Codegen variable namespace collision for pmod and partitionBy > - > > Key: SPARK-16488 > URL: https://issues.apache.org/jira/browse/SPARK-16488 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Sameer Agarwal >Assignee: Sameer Agarwal > Fix For: 2.0.0 > > > Reported by [~brkyvz]. Original description below: > The generated code used by `pmod` conflicts with DataFrameWriter.partitionBy > Quick repro: > {code} > import org.apache.spark.sql.functions._ > case class Test(a: Int, b: String) > val ds = Seq(Test(0, "a"), Test(1, "b"), Test(1, > "a")).toDS.createOrReplaceTempView("test") > sql(""" > select > * > from > test > distribute by > pmod(a, 2) > """) > .write > .partitionBy("b") > .mode("overwrite") > .parquet("/tmp/repro") > {code} > You may also use repartition with the function `pmod` instead of using `pmod` > inside `distribute by` in sql. > Example generated code (two variables defined as r): > {code} > /* 025 */ public UnsafeRow apply(InternalRow i) { > /* 026 */ int value1 = 42; > /* 027 */ > /* 028 */ boolean isNull2 = i.isNullAt(0); > /* 029 */ UTF8String value2 = isNull2 ? null : (i.getUTF8String(0)); > /* 030 */ if (!isNull2) { > /* 031 */ value1 = > org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(value2.getBaseObject(), > value2.getBaseOffset(), value2.numBytes(), value1); > /* 032 */ } > /* 033 */ > /* 034 */ > /* 035 */ int value4 = 42; > /* 036 */ > /* 037 */ boolean isNull5 = i.isNullAt(1); > /* 038 */ UTF8String value5 = isNull5 ? null : (i.getUTF8String(1)); > /* 039 */ if (!isNull5) { > /* 040 */ value4 = > org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(value5.getBaseObject(), > value5.getBaseOffset(), value5.numBytes(), value4); > /* 041 */ } > /* 042 */ > /* 043 */ int value3 = -1; > /* 044 */ > /* 045 */ int r = value4 % 10; > /* 046 */ if (r < 0) { > /* 047 */ value3 = (r + 10) % 10; > /* 048 */ } else { > /* 049 */ value3 = r; > /* 050 */ } > /* 051 */ value1 = > org.apache.spark.unsafe.hash.Murmur3_x86_32.hashInt(value3, value1); > /* 052 */ > /* 053 */ int value = -1; > /* 054 */ > /* 055 */ int r = value1 % 200; > /* 056 */ if (r < 0) { > /* 057 */ value = (r + 200) % 200; > /* 058 */ } else { > /* 059 */ value = r; > /* 060 */ } > /* 061 */ rowWriter.write(0, value); > /* 062 */ return result; > /* 063 */ } > /* 064 */ } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16490) Python mllib example for chi-squared feature selector
[ https://issues.apache.org/jira/browse/SPARK-16490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shuai Lin updated SPARK-16490: -- Labels: starter (was: ) > Python mllib example for chi-squared feature selector > - > > Key: SPARK-16490 > URL: https://issues.apache.org/jira/browse/SPARK-16490 > Project: Spark > Issue Type: Task > Components: MLlib, PySpark >Reporter: Shuai Lin >Priority: Minor > Labels: starter > > There are java & scala examples for {{ChiSqSelector}} in mllib, but the > correspondent python example is missing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16334) [SQL] SQL query on parquet table java.lang.ArrayIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/SPARK-16334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15372068#comment-15372068 ] Vladimir Ivanov commented on SPARK-16334: - Hi Herman, Thank you for reply! I wasn't able to reproduce this error anymore with this spark.sql.parquet.enableVectorizedReader setting set to false in Spark 2.0. As for steps to reproduce: unfortunately I have reproduced it on partitioned parquet files containing client-sensitive data (we are calling SparkSession programmatically from our app), so I can't provide data here. Let's see what I can do. > [SQL] SQL query on parquet table java.lang.ArrayIndexOutOfBoundsException > - > > Key: SPARK-16334 > URL: https://issues.apache.org/jira/browse/SPARK-16334 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Egor Pahomov >Priority: Critical > Labels: sql > > Query: > {code} > select * from blabla where user_id = 415706251 > {code} > Error: > {code} > 16/06/30 14:07:27 WARN scheduler.TaskSetManager: Lost task 11.0 in stage 0.0 > (TID 3, hadoop6): java.lang.ArrayIndexOutOfBoundsException: 6934 > at > org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary.decodeToBinary(PlainValuesDictionary.java:119) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.decodeDictionaryIds(VectorizedColumnReader.java:273) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:170) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:230) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:36) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} > Work on 1.6.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14812) ML, Graph 2.0 QA: API: Experimental, DeveloperApi, final, sealed audit
[ https://issues.apache.org/jira/browse/SPARK-14812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14812: Assignee: Apache Spark (was: Joseph K. Bradley) > ML, Graph 2.0 QA: API: Experimental, DeveloperApi, final, sealed audit > -- > > Key: SPARK-14812 > URL: https://issues.apache.org/jira/browse/SPARK-14812 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib >Reporter: Joseph K. Bradley >Assignee: Apache Spark >Priority: Blocker > > We should make a pass through the items marked as Experimental or > DeveloperApi and see if any are stable enough to be unmarked. > We should also check for items marked final or sealed to see if they are > stable enough to be opened up as APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14812) ML, Graph 2.0 QA: API: Experimental, DeveloperApi, final, sealed audit
[ https://issues.apache.org/jira/browse/SPARK-14812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15372043#comment-15372043 ] Apache Spark commented on SPARK-14812: -- User 'jkbradley' has created a pull request for this issue: https://github.com/apache/spark/pull/14147 > ML, Graph 2.0 QA: API: Experimental, DeveloperApi, final, sealed audit > -- > > Key: SPARK-14812 > URL: https://issues.apache.org/jira/browse/SPARK-14812 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Blocker > > We should make a pass through the items marked as Experimental or > DeveloperApi and see if any are stable enough to be unmarked. > We should also check for items marked final or sealed to see if they are > stable enough to be opened up as APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14812) ML, Graph 2.0 QA: API: Experimental, DeveloperApi, final, sealed audit
[ https://issues.apache.org/jira/browse/SPARK-14812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14812: Assignee: Joseph K. Bradley (was: Apache Spark) > ML, Graph 2.0 QA: API: Experimental, DeveloperApi, final, sealed audit > -- > > Key: SPARK-14812 > URL: https://issues.apache.org/jira/browse/SPARK-14812 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Blocker > > We should make a pass through the items marked as Experimental or > DeveloperApi and see if any are stable enough to be unmarked. > We should also check for items marked final or sealed to see if they are > stable enough to be opened up as APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14812) ML, Graph 2.0 QA: API: Experimental, DeveloperApi, final, sealed audit
[ https://issues.apache.org/jira/browse/SPARK-14812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley reassigned SPARK-14812: - Assignee: Joseph K. Bradley (was: DB Tsai) > ML, Graph 2.0 QA: API: Experimental, DeveloperApi, final, sealed audit > -- > > Key: SPARK-14812 > URL: https://issues.apache.org/jira/browse/SPARK-14812 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Blocker > > We should make a pass through the items marked as Experimental or > DeveloperApi and see if any are stable enough to be unmarked. > We should also check for items marked final or sealed to see if they are > stable enough to be opened up as APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16491) Crc32 should use different variable names (not "checksum")
Reynold Xin created SPARK-16491: --- Summary: Crc32 should use different variable names (not "checksum") Key: SPARK-16491 URL: https://issues.apache.org/jira/browse/SPARK-16491 Project: Spark Issue Type: Sub-task Reporter: Reynold Xin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16488) Codegen variable namespace collision for pmod and partitionBy
[ https://issues.apache.org/jira/browse/SPARK-16488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-16488: Issue Type: Sub-task (was: Bug) Parent: SPARK-16489 > Codegen variable namespace collision for pmod and partitionBy > - > > Key: SPARK-16488 > URL: https://issues.apache.org/jira/browse/SPARK-16488 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Sameer Agarwal > > Reported by [~brkyvz]. Original description below: > The generated code used by `pmod` conflicts with DataFrameWriter.partitionBy > Quick repro: > {code} > import org.apache.spark.sql.functions._ > case class Test(a: Int, b: String) > val ds = Seq(Test(0, "a"), Test(1, "b"), Test(1, > "a")).toDS.createOrReplaceTempView("test") > sql(""" > select > * > from > test > distribute by > pmod(a, 2) > """) > .write > .partitionBy("b") > .mode("overwrite") > .parquet("/tmp/repro") > {code} > You may also use repartition with the function `pmod` instead of using `pmod` > inside `distribute by` in sql. > Example generated code (two variables defined as r): > {code} > /* 025 */ public UnsafeRow apply(InternalRow i) { > /* 026 */ int value1 = 42; > /* 027 */ > /* 028 */ boolean isNull2 = i.isNullAt(0); > /* 029 */ UTF8String value2 = isNull2 ? null : (i.getUTF8String(0)); > /* 030 */ if (!isNull2) { > /* 031 */ value1 = > org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(value2.getBaseObject(), > value2.getBaseOffset(), value2.numBytes(), value1); > /* 032 */ } > /* 033 */ > /* 034 */ > /* 035 */ int value4 = 42; > /* 036 */ > /* 037 */ boolean isNull5 = i.isNullAt(1); > /* 038 */ UTF8String value5 = isNull5 ? null : (i.getUTF8String(1)); > /* 039 */ if (!isNull5) { > /* 040 */ value4 = > org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(value5.getBaseObject(), > value5.getBaseOffset(), value5.numBytes(), value4); > /* 041 */ } > /* 042 */ > /* 043 */ int value3 = -1; > /* 044 */ > /* 045 */ int r = value4 % 10; > /* 046 */ if (r < 0) { > /* 047 */ value3 = (r + 10) % 10; > /* 048 */ } else { > /* 049 */ value3 = r; > /* 050 */ } > /* 051 */ value1 = > org.apache.spark.unsafe.hash.Murmur3_x86_32.hashInt(value3, value1); > /* 052 */ > /* 053 */ int value = -1; > /* 054 */ > /* 055 */ int r = value1 % 200; > /* 056 */ if (r < 0) { > /* 057 */ value = (r + 200) % 200; > /* 058 */ } else { > /* 059 */ value = r; > /* 060 */ } > /* 061 */ rowWriter.write(0, value); > /* 062 */ return result; > /* 063 */ } > /* 064 */ } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16490) Python mllib example for chi-squared feature selector
Shuai Lin created SPARK-16490: - Summary: Python mllib example for chi-squared feature selector Key: SPARK-16490 URL: https://issues.apache.org/jira/browse/SPARK-16490 Project: Spark Issue Type: Task Components: MLlib, PySpark Reporter: Shuai Lin Priority: Minor There are java & scala examples for {{ChiSqSelector}} in mllib, but the correspondent python example is missing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16489) Test harness to prevent expression code generation from reusing variable names
[ https://issues.apache.org/jira/browse/SPARK-16489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-16489: Description: In code generation, it is incorrect for expressions to reuse variable names across different instances of itself. As an example, SPARK-16488 reports a bug in which pmod expression reuses variable name "r". This patch updates ExpressionEvalHelper test harness to always project two instances of the same expression, which will help us catch variable reuse problems in expression unit tests. > Test harness to prevent expression code generation from reusing variable names > -- > > Key: SPARK-16489 > URL: https://issues.apache.org/jira/browse/SPARK-16489 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > In code generation, it is incorrect for expressions to reuse variable names > across different instances of itself. As an example, SPARK-16488 reports a > bug in which pmod expression reuses variable name "r". > This patch updates ExpressionEvalHelper test harness to always project two > instances of the same expression, which will help us catch variable reuse > problems in expression unit tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16433) Improve StreamingQuery.explain when no data arrives
[ https://issues.apache.org/jira/browse/SPARK-16433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-16433. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 14100 [https://github.com/apache/spark/pull/14100] > Improve StreamingQuery.explain when no data arrives > --- > > Key: SPARK-16433 > URL: https://issues.apache.org/jira/browse/SPARK-16433 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Fix For: 2.0.0 > > > StreamingQuery.explain shows "N/A" when no data arrives. It's pretty > confusing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16489) Test harness to prevent expression code generation from reusing variable names
[ https://issues.apache.org/jira/browse/SPARK-16489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16489: Assignee: Reynold Xin (was: Apache Spark) > Test harness to prevent expression code generation from reusing variable names > -- > > Key: SPARK-16489 > URL: https://issues.apache.org/jira/browse/SPARK-16489 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16489) Test harness to prevent expression code generation from reusing variable names
[ https://issues.apache.org/jira/browse/SPARK-16489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15372014#comment-15372014 ] Apache Spark commented on SPARK-16489: -- User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/14146 > Test harness to prevent expression code generation from reusing variable names > -- > > Key: SPARK-16489 > URL: https://issues.apache.org/jira/browse/SPARK-16489 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16489) Test harness to prevent expression code generation from reusing variable names
[ https://issues.apache.org/jira/browse/SPARK-16489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16489: Assignee: Apache Spark (was: Reynold Xin) > Test harness to prevent expression code generation from reusing variable names > -- > > Key: SPARK-16489 > URL: https://issues.apache.org/jira/browse/SPARK-16489 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Reynold Xin >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16489) Test harness to prevent expression code generation from reusing variable names
Reynold Xin created SPARK-16489: --- Summary: Test harness to prevent expression code generation from reusing variable names Key: SPARK-16489 URL: https://issues.apache.org/jira/browse/SPARK-16489 Project: Spark Issue Type: Bug Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14812) ML, Graph 2.0 QA: API: Experimental, DeveloperApi, final, sealed audit
[ https://issues.apache.org/jira/browse/SPARK-14812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15371841#comment-15371841 ] Joseph K. Bradley edited comment on SPARK-14812 at 7/12/16 1:03 AM: [~thunterdb] has reviewed the public API for things which should be private. Issues noted in [SPARK-16485] General decisions to follow, except where noted: * spark.mllib, pyspark.mllib: Remove all Experimental annotations. Leave DeveloperApi annotations alone. * spark.ml, pyspark.ml ** Treat Estimator-Model pairs of classes and companion objects the same way. ** For all algorithms marked Experimental with Since tag <= 1.6, remove Experimental annotation. ** For all algorithms marked Experimental with Since tag = 2.0, leave Experimental annotation. * While writing and reviewing a PR, we should be careful to check for cases where a class should no longer be Experimental but has an Experimental method, val, or other feature. * DeveloperApi annotations are left alone, except where noted. * No changes to which types are sealed. spark.ml, pyspark.ml * Model Summary classes remain Experimental * MLWriter, MLReader, MLWritable, MLReadable remain Experimental * ml.tree.Node, ml.tree.Split, and subclasses should no longer be DeveloperApi How does this sound [~mlnick]? was (Author: josephkb): [~thunterdb] has reviewed the public API for things which should be private. Issues noted in [SPARK-16485] General decisions to follow, except where noted: * spark.mllib, pyspark.mllib: Remove all Experimental annotations. Leave DeveloperApi annotations alone. * spark.ml, pyspark.ml ** Treat Estimator-Model pairs of classes and companion objects the same way. ** For all algorithms marked Experimental with Since tag <= 1.6, remove Experimental annotation. ** For all algorithms marked Experimental with Since tag = 2.0, leave Experimental annotation. * While writing and reviewing a PR, we should be careful to check for cases where a class should no longer be Experimental but has an Experimental method, val, or other feature. * DeveloperApi annotations are left alone, except where noted. * No changes to which types are sealed. spark.ml, pyspark.ml * classification ** LogisticRegression*Summary classes remain Experimental * MLWriter, MLReader, MLWritable, MLReadable remain Experimental * ml.tree.Node, ml.tree.Split, and subclasses should no longer be DeveloperApi How does this sound [~mlnick]? > ML, Graph 2.0 QA: API: Experimental, DeveloperApi, final, sealed audit > -- > > Key: SPARK-14812 > URL: https://issues.apache.org/jira/browse/SPARK-14812 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib >Reporter: Joseph K. Bradley >Assignee: DB Tsai >Priority: Blocker > > We should make a pass through the items marked as Experimental or > DeveloperApi and see if any are stable enough to be unmarked. > We should also check for items marked final or sealed to see if they are > stable enough to be opened up as APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16487) Some batches might not get marked as fully processed in JobGenerator
[ https://issues.apache.org/jira/browse/SPARK-16487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16487: Assignee: Apache Spark > Some batches might not get marked as fully processed in JobGenerator > > > Key: SPARK-16487 > URL: https://issues.apache.org/jira/browse/SPARK-16487 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: Ahmed Mahran >Assignee: Apache Spark >Priority: Trivial > > In JobGenerator, the code reads like that some batches might not get marked > as fully processed. In the following flowchart, the batch should get marked > fully processed before endpoint C however it is not. Currently, this does not > actually cause an issue, as the condition {code}(time - zeroTime) is multiple > of checkpoint duration?{code} always evaluates to true as the checkpoint > duration is always set to be equal to the batch duration. > !https://s31.postimg.org/udy9lti2j/spark_streaming_job_generator.png|width=700! > [Image > URL|https://s31.postimg.org/udy9lti2j/spark_streaming_job_generator.png] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16487) Some batches might not get marked as fully processed in JobGenerator
[ https://issues.apache.org/jira/browse/SPARK-16487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15372002#comment-15372002 ] Apache Spark commented on SPARK-16487: -- User 'ahmed-mahran' has created a pull request for this issue: https://github.com/apache/spark/pull/14145 > Some batches might not get marked as fully processed in JobGenerator > > > Key: SPARK-16487 > URL: https://issues.apache.org/jira/browse/SPARK-16487 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: Ahmed Mahran >Priority: Trivial > > In JobGenerator, the code reads like that some batches might not get marked > as fully processed. In the following flowchart, the batch should get marked > fully processed before endpoint C however it is not. Currently, this does not > actually cause an issue, as the condition {code}(time - zeroTime) is multiple > of checkpoint duration?{code} always evaluates to true as the checkpoint > duration is always set to be equal to the batch duration. > !https://s31.postimg.org/udy9lti2j/spark_streaming_job_generator.png|width=700! > [Image > URL|https://s31.postimg.org/udy9lti2j/spark_streaming_job_generator.png] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16487) Some batches might not get marked as fully processed in JobGenerator
[ https://issues.apache.org/jira/browse/SPARK-16487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16487: Assignee: (was: Apache Spark) > Some batches might not get marked as fully processed in JobGenerator > > > Key: SPARK-16487 > URL: https://issues.apache.org/jira/browse/SPARK-16487 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: Ahmed Mahran >Priority: Trivial > > In JobGenerator, the code reads like that some batches might not get marked > as fully processed. In the following flowchart, the batch should get marked > fully processed before endpoint C however it is not. Currently, this does not > actually cause an issue, as the condition {code}(time - zeroTime) is multiple > of checkpoint duration?{code} always evaluates to true as the checkpoint > duration is always set to be equal to the batch duration. > !https://s31.postimg.org/udy9lti2j/spark_streaming_job_generator.png|width=700! > [Image > URL|https://s31.postimg.org/udy9lti2j/spark_streaming_job_generator.png] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16488) Codegen variable namespace collision for pmod and partitionBy
[ https://issues.apache.org/jira/browse/SPARK-16488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16488: Assignee: Apache Spark > Codegen variable namespace collision for pmod and partitionBy > - > > Key: SPARK-16488 > URL: https://issues.apache.org/jira/browse/SPARK-16488 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Sameer Agarwal >Assignee: Apache Spark > > Reported by [~brkyvz]. Original description below: > The generated code used by `pmod` conflicts with DataFrameWriter.partitionBy > Quick repro: > {code} > import org.apache.spark.sql.functions._ > case class Test(a: Int, b: String) > val ds = Seq(Test(0, "a"), Test(1, "b"), Test(1, > "a")).toDS.createOrReplaceTempView("test") > sql(""" > select > * > from > test > distribute by > pmod(a, 2) > """) > .write > .partitionBy("b") > .mode("overwrite") > .parquet("/tmp/repro") > {code} > You may also use repartition with the function `pmod` instead of using `pmod` > inside `distribute by` in sql. > Example generated code (two variables defined as r): > {code} > /* 025 */ public UnsafeRow apply(InternalRow i) { > /* 026 */ int value1 = 42; > /* 027 */ > /* 028 */ boolean isNull2 = i.isNullAt(0); > /* 029 */ UTF8String value2 = isNull2 ? null : (i.getUTF8String(0)); > /* 030 */ if (!isNull2) { > /* 031 */ value1 = > org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(value2.getBaseObject(), > value2.getBaseOffset(), value2.numBytes(), value1); > /* 032 */ } > /* 033 */ > /* 034 */ > /* 035 */ int value4 = 42; > /* 036 */ > /* 037 */ boolean isNull5 = i.isNullAt(1); > /* 038 */ UTF8String value5 = isNull5 ? null : (i.getUTF8String(1)); > /* 039 */ if (!isNull5) { > /* 040 */ value4 = > org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(value5.getBaseObject(), > value5.getBaseOffset(), value5.numBytes(), value4); > /* 041 */ } > /* 042 */ > /* 043 */ int value3 = -1; > /* 044 */ > /* 045 */ int r = value4 % 10; > /* 046 */ if (r < 0) { > /* 047 */ value3 = (r + 10) % 10; > /* 048 */ } else { > /* 049 */ value3 = r; > /* 050 */ } > /* 051 */ value1 = > org.apache.spark.unsafe.hash.Murmur3_x86_32.hashInt(value3, value1); > /* 052 */ > /* 053 */ int value = -1; > /* 054 */ > /* 055 */ int r = value1 % 200; > /* 056 */ if (r < 0) { > /* 057 */ value = (r + 200) % 200; > /* 058 */ } else { > /* 059 */ value = r; > /* 060 */ } > /* 061 */ rowWriter.write(0, value); > /* 062 */ return result; > /* 063 */ } > /* 064 */ } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16488) Codegen variable namespace collision for pmod and partitionBy
[ https://issues.apache.org/jira/browse/SPARK-16488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16488: Assignee: (was: Apache Spark) > Codegen variable namespace collision for pmod and partitionBy > - > > Key: SPARK-16488 > URL: https://issues.apache.org/jira/browse/SPARK-16488 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Sameer Agarwal > > Reported by [~brkyvz]. Original description below: > The generated code used by `pmod` conflicts with DataFrameWriter.partitionBy > Quick repro: > {code} > import org.apache.spark.sql.functions._ > case class Test(a: Int, b: String) > val ds = Seq(Test(0, "a"), Test(1, "b"), Test(1, > "a")).toDS.createOrReplaceTempView("test") > sql(""" > select > * > from > test > distribute by > pmod(a, 2) > """) > .write > .partitionBy("b") > .mode("overwrite") > .parquet("/tmp/repro") > {code} > You may also use repartition with the function `pmod` instead of using `pmod` > inside `distribute by` in sql. > Example generated code (two variables defined as r): > {code} > /* 025 */ public UnsafeRow apply(InternalRow i) { > /* 026 */ int value1 = 42; > /* 027 */ > /* 028 */ boolean isNull2 = i.isNullAt(0); > /* 029 */ UTF8String value2 = isNull2 ? null : (i.getUTF8String(0)); > /* 030 */ if (!isNull2) { > /* 031 */ value1 = > org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(value2.getBaseObject(), > value2.getBaseOffset(), value2.numBytes(), value1); > /* 032 */ } > /* 033 */ > /* 034 */ > /* 035 */ int value4 = 42; > /* 036 */ > /* 037 */ boolean isNull5 = i.isNullAt(1); > /* 038 */ UTF8String value5 = isNull5 ? null : (i.getUTF8String(1)); > /* 039 */ if (!isNull5) { > /* 040 */ value4 = > org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(value5.getBaseObject(), > value5.getBaseOffset(), value5.numBytes(), value4); > /* 041 */ } > /* 042 */ > /* 043 */ int value3 = -1; > /* 044 */ > /* 045 */ int r = value4 % 10; > /* 046 */ if (r < 0) { > /* 047 */ value3 = (r + 10) % 10; > /* 048 */ } else { > /* 049 */ value3 = r; > /* 050 */ } > /* 051 */ value1 = > org.apache.spark.unsafe.hash.Murmur3_x86_32.hashInt(value3, value1); > /* 052 */ > /* 053 */ int value = -1; > /* 054 */ > /* 055 */ int r = value1 % 200; > /* 056 */ if (r < 0) { > /* 057 */ value = (r + 200) % 200; > /* 058 */ } else { > /* 059 */ value = r; > /* 060 */ } > /* 061 */ rowWriter.write(0, value); > /* 062 */ return result; > /* 063 */ } > /* 064 */ } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16488) Codegen variable namespace collision for pmod and partitionBy
[ https://issues.apache.org/jira/browse/SPARK-16488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15372000#comment-15372000 ] Apache Spark commented on SPARK-16488: -- User 'sameeragarwal' has created a pull request for this issue: https://github.com/apache/spark/pull/14144 > Codegen variable namespace collision for pmod and partitionBy > - > > Key: SPARK-16488 > URL: https://issues.apache.org/jira/browse/SPARK-16488 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Sameer Agarwal > > Reported by [~brkyvz]. Original description below: > The generated code used by `pmod` conflicts with DataFrameWriter.partitionBy > Quick repro: > {code} > import org.apache.spark.sql.functions._ > case class Test(a: Int, b: String) > val ds = Seq(Test(0, "a"), Test(1, "b"), Test(1, > "a")).toDS.createOrReplaceTempView("test") > sql(""" > select > * > from > test > distribute by > pmod(a, 2) > """) > .write > .partitionBy("b") > .mode("overwrite") > .parquet("/tmp/repro") > {code} > You may also use repartition with the function `pmod` instead of using `pmod` > inside `distribute by` in sql. > Example generated code (two variables defined as r): > {code} > /* 025 */ public UnsafeRow apply(InternalRow i) { > /* 026 */ int value1 = 42; > /* 027 */ > /* 028 */ boolean isNull2 = i.isNullAt(0); > /* 029 */ UTF8String value2 = isNull2 ? null : (i.getUTF8String(0)); > /* 030 */ if (!isNull2) { > /* 031 */ value1 = > org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(value2.getBaseObject(), > value2.getBaseOffset(), value2.numBytes(), value1); > /* 032 */ } > /* 033 */ > /* 034 */ > /* 035 */ int value4 = 42; > /* 036 */ > /* 037 */ boolean isNull5 = i.isNullAt(1); > /* 038 */ UTF8String value5 = isNull5 ? null : (i.getUTF8String(1)); > /* 039 */ if (!isNull5) { > /* 040 */ value4 = > org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(value5.getBaseObject(), > value5.getBaseOffset(), value5.numBytes(), value4); > /* 041 */ } > /* 042 */ > /* 043 */ int value3 = -1; > /* 044 */ > /* 045 */ int r = value4 % 10; > /* 046 */ if (r < 0) { > /* 047 */ value3 = (r + 10) % 10; > /* 048 */ } else { > /* 049 */ value3 = r; > /* 050 */ } > /* 051 */ value1 = > org.apache.spark.unsafe.hash.Murmur3_x86_32.hashInt(value3, value1); > /* 052 */ > /* 053 */ int value = -1; > /* 054 */ > /* 055 */ int r = value1 % 200; > /* 056 */ if (r < 0) { > /* 057 */ value = (r + 200) % 200; > /* 058 */ } else { > /* 059 */ value = r; > /* 060 */ } > /* 061 */ rowWriter.write(0, value); > /* 062 */ return result; > /* 063 */ } > /* 064 */ } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16488) Codegen variable namespace collision for pmod and partitionBy
Sameer Agarwal created SPARK-16488: -- Summary: Codegen variable namespace collision for pmod and partitionBy Key: SPARK-16488 URL: https://issues.apache.org/jira/browse/SPARK-16488 Project: Spark Issue Type: Bug Components: SQL Reporter: Sameer Agarwal Reported by [~brkyvz]. Original description below: The generated code used by `pmod` conflicts with DataFrameWriter.partitionBy Quick repro: {code} import org.apache.spark.sql.functions._ case class Test(a: Int, b: String) val ds = Seq(Test(0, "a"), Test(1, "b"), Test(1, "a")).toDS.createOrReplaceTempView("test") sql(""" select * from test distribute by pmod(a, 2) """) .write .partitionBy("b") .mode("overwrite") .parquet("/tmp/repro") {code} You may also use repartition with the function `pmod` instead of using `pmod` inside `distribute by` in sql. Example generated code (two variables defined as r): {code} /* 025 */ public UnsafeRow apply(InternalRow i) { /* 026 */ int value1 = 42; /* 027 */ /* 028 */ boolean isNull2 = i.isNullAt(0); /* 029 */ UTF8String value2 = isNull2 ? null : (i.getUTF8String(0)); /* 030 */ if (!isNull2) { /* 031 */ value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(value2.getBaseObject(), value2.getBaseOffset(), value2.numBytes(), value1); /* 032 */ } /* 033 */ /* 034 */ /* 035 */ int value4 = 42; /* 036 */ /* 037 */ boolean isNull5 = i.isNullAt(1); /* 038 */ UTF8String value5 = isNull5 ? null : (i.getUTF8String(1)); /* 039 */ if (!isNull5) { /* 040 */ value4 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(value5.getBaseObject(), value5.getBaseOffset(), value5.numBytes(), value4); /* 041 */ } /* 042 */ /* 043 */ int value3 = -1; /* 044 */ /* 045 */ int r = value4 % 10; /* 046 */ if (r < 0) { /* 047 */ value3 = (r + 10) % 10; /* 048 */ } else { /* 049 */ value3 = r; /* 050 */ } /* 051 */ value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashInt(value3, value1); /* 052 */ /* 053 */ int value = -1; /* 054 */ /* 055 */ int r = value1 % 200; /* 056 */ if (r < 0) { /* 057 */ value = (r + 200) % 200; /* 058 */ } else { /* 059 */ value = r; /* 060 */ } /* 061 */ rowWriter.write(0, value); /* 062 */ return result; /* 063 */ } /* 064 */ } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16487) Some batches might not get marked as fully processed in JobGenerator
[ https://issues.apache.org/jira/browse/SPARK-16487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ahmed Mahran updated SPARK-16487: - Description: In JobGenerator, the code reads like that some batches might not get marked as fully processed. In the following flowchart, the batch should get marked fully processed before endpoint C however it is not. Currently, this does not actually cause an issue, as the condition {code}(time - zeroTime) is multiple of checkpoint duration?{code} always evaluates to true as the checkpoint duration is always set to be equal to the batch duration. !https://s31.postimg.org/udy9lti2j/spark_streaming_job_generator.png|width=700! [Image URL|https://s31.postimg.org/udy9lti2j/spark_streaming_job_generator.png] was: In JobGenerator, the code reads like that some batches might not get marked as fully processed. In the following flowchart, the batch should get marked fully processed before endpoint C however it is not. Currently, this does not actually cause an issue, as the condition {code}(time - zeroTime) is multiple of checkpoint duration?{code} always evaluates to true as the checkpoint duration is always set to be equal to the batch duration. !https://s31.postimg.org/udy9lti2j/spark_streaming_job_generator.png! [Image URL|https://s31.postimg.org/udy9lti2j/spark_streaming_job_generator.png] > Some batches might not get marked as fully processed in JobGenerator > > > Key: SPARK-16487 > URL: https://issues.apache.org/jira/browse/SPARK-16487 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: Ahmed Mahran >Priority: Trivial > > In JobGenerator, the code reads like that some batches might not get marked > as fully processed. In the following flowchart, the batch should get marked > fully processed before endpoint C however it is not. Currently, this does not > actually cause an issue, as the condition {code}(time - zeroTime) is multiple > of checkpoint duration?{code} always evaluates to true as the checkpoint > duration is always set to be equal to the batch duration. > !https://s31.postimg.org/udy9lti2j/spark_streaming_job_generator.png|width=700! > [Image > URL|https://s31.postimg.org/udy9lti2j/spark_streaming_job_generator.png] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16487) Some batches might not get marked as fully processed in JobGenerator
[ https://issues.apache.org/jira/browse/SPARK-16487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ahmed Mahran updated SPARK-16487: - Description: In JobGenerator, the code reads like that some batches might not get marked as fully processed. In the following flowchart, the batch should get marked fully processed before endpoint C however it is not. Currently, this does not actually cause an issue, as the condition {code}(time - zeroTime) is multiple of checkpoint duration?{code} always evaluates to true as the checkpoint duration is always set to be equal to the batch duration. !https://s31.postimg.org/udy9lti2j/spark_streaming_job_generator.png! [Image URL|https://s31.postimg.org/udy9lti2j/spark_streaming_job_generator.png] was: In JobGenerator, the code reads like that some batches might not get marked as fully processed. In the following flowchart, the batch should get marked fully processed before endpoint C however it is not. Currently, this does not actually cause an issue, as the condition {code}(time - zeroTime) is multiple of checkpoint duration?{code} always evaluates to true as the checkpoint duration is always set to be equal to the batch duration. !https://s31.postimg.org/udy9lti2j/spark_streaming_job_generator.png|width=800! [Image URL|https://s31.postimg.org/udy9lti2j/spark_streaming_job_generator.png] > Some batches might not get marked as fully processed in JobGenerator > > > Key: SPARK-16487 > URL: https://issues.apache.org/jira/browse/SPARK-16487 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: Ahmed Mahran >Priority: Trivial > > In JobGenerator, the code reads like that some batches might not get marked > as fully processed. In the following flowchart, the batch should get marked > fully processed before endpoint C however it is not. Currently, this does not > actually cause an issue, as the condition {code}(time - zeroTime) is multiple > of checkpoint duration?{code} always evaluates to true as the checkpoint > duration is always set to be equal to the batch duration. > !https://s31.postimg.org/udy9lti2j/spark_streaming_job_generator.png! > [Image > URL|https://s31.postimg.org/udy9lti2j/spark_streaming_job_generator.png] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16487) Some batches might not get marked as fully processed in JobGenerator
[ https://issues.apache.org/jira/browse/SPARK-16487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ahmed Mahran updated SPARK-16487: - Description: In JobGenerator, the code reads like that some batches might not get marked as fully processed. In the following flowchart, the batch should get marked fully processed before endpoint C however it is not. Currently, this does not actually cause an issue, as the condition {code}(time - zeroTime) is multiple of checkpoint duration?{code} always evaluates to true as the checkpoint duration is always set to be equal to the batch duration. !https://s31.postimg.org/udy9lti2j/spark_streaming_job_generator.png|width=800! [Image URL|https://s31.postimg.org/udy9lti2j/spark_streaming_job_generator.png] was: In JobGenerator, the code reads like that some batches might not get marked as fully processed. In the following flowchart, the batch should get marked fully processed before endpoint C however it is not. Currently, this does not actually cause an issue, as the condition {code}(time - zeroTime) is multiple of checkpoint duration?{code} always evaluates to true as the checkpoint duration is always set to be equal to the batch duration. !https://s31.postimg.org/udy9lti2j/spark_streaming_job_generator.png! > Some batches might not get marked as fully processed in JobGenerator > > > Key: SPARK-16487 > URL: https://issues.apache.org/jira/browse/SPARK-16487 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: Ahmed Mahran >Priority: Trivial > > In JobGenerator, the code reads like that some batches might not get marked > as fully processed. In the following flowchart, the batch should get marked > fully processed before endpoint C however it is not. Currently, this does not > actually cause an issue, as the condition {code}(time - zeroTime) is multiple > of checkpoint duration?{code} always evaluates to true as the checkpoint > duration is always set to be equal to the batch duration. > !https://s31.postimg.org/udy9lti2j/spark_streaming_job_generator.png|width=800! > [Image > URL|https://s31.postimg.org/udy9lti2j/spark_streaming_job_generator.png] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16435) Behavior changes if initialExecutor is less than minExecutor for dynamic allocation
[ https://issues.apache.org/jira/browse/SPARK-16435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15371975#comment-15371975 ] Saisai Shao commented on SPARK-16435: - OK, I will file a small patch to add the warning log about this invalid configuration. > Behavior changes if initialExecutor is less than minExecutor for dynamic > allocation > --- > > Key: SPARK-16435 > URL: https://issues.apache.org/jira/browse/SPARK-16435 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Affects Versions: 2.0.0 >Reporter: Saisai Shao >Priority: Minor > > After SPARK-13723, the behavior changed for > {{spark.dynamicAllocation.initialExecutors}} less then > {{spark.dynamicAllocation.minExecutors}} situation. > initialExecutors < minExecutors is an invalid setting, > h4. Before SPARK-13723 > If initialExecutors < minExecutors, Spark will throw exception with: > {code} > java.lang.IllegalArgumentException: requirement failed: initial executor > number xxx must between min executor number xxx and max executor number xxx > {code} > This will clearly let user know that current configuration is invalid. > h4. After SPARK-13723 > Because we also consider {{spark.executor.instances}}, so the initial number > is the max value between minExecutors, initialExecutors, numExecutors. > This will silently ignore the situation where initialExecutors < minExecutors. > So at least we should add some warning logs to let user know this is an > invalid configuration. > What do you think [~tgraves], [~rdblue]? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16487) Some batches might not get marked as fully processed in JobGenerator
Ahmed Mahran created SPARK-16487: Summary: Some batches might not get marked as fully processed in JobGenerator Key: SPARK-16487 URL: https://issues.apache.org/jira/browse/SPARK-16487 Project: Spark Issue Type: Bug Components: Streaming Reporter: Ahmed Mahran Priority: Trivial In JobGenerator, the code reads like that some batches might not get marked as fully processed. In the following flowchart, the batch should get marked fully processed before endpoint C however it is not. Currently, this does not actually cause an issue, as the condition {code}(time - zeroTime) is multiple of checkpoint duration?{code} always evaluates to true as the checkpoint duration is always set to be equal to the batch duration. !https://s31.postimg.org/udy9lti2j/spark_streaming_job_generator.png! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15816) SQL server based on Postgres protocol
[ https://issues.apache.org/jira/browse/SPARK-15816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15371952#comment-15371952 ] Takeshi Yamamuro commented on SPARK-15816: -- okay, I first take time on this prototype. thanks. > SQL server based on Postgres protocol > - > > Key: SPARK-15816 > URL: https://issues.apache.org/jira/browse/SPARK-15816 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin > > At Spark Summit today this idea came up from a discussion: it would be great > to investigate the possibility of implementing a new SQL server using > Postgres' protocol, in lieu of Hive ThriftServer 2. I'm creating this ticket > to track this idea, in case others have feedback. > This server can have a simpler architecture, and allows users to leverage a > wide range of tools that are already available for Postgres (and many > commercial database systems based on Postgres). > Some of the problems we'd need to figure out are: > 1. What is the Postgres protocol? Is there an official documentation for it? > 2. How difficult would it be to implement that protocol in Spark (JVM in > particular). > 3. How does data type mapping work? > 4. How does system commands work? Would Spark need to support all of > Postgres' commands? > 5. Any restrictions in supporting nested data? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-15816) SQL server based on Postgres protocol
[ https://issues.apache.org/jira/browse/SPARK-15816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-15816: - Comment: was deleted (was: okay, I first take time on this prototype. thanks.) > SQL server based on Postgres protocol > - > > Key: SPARK-15816 > URL: https://issues.apache.org/jira/browse/SPARK-15816 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin > > At Spark Summit today this idea came up from a discussion: it would be great > to investigate the possibility of implementing a new SQL server using > Postgres' protocol, in lieu of Hive ThriftServer 2. I'm creating this ticket > to track this idea, in case others have feedback. > This server can have a simpler architecture, and allows users to leverage a > wide range of tools that are already available for Postgres (and many > commercial database systems based on Postgres). > Some of the problems we'd need to figure out are: > 1. What is the Postgres protocol? Is there an official documentation for it? > 2. How difficult would it be to implement that protocol in Spark (JVM in > particular). > 3. How does data type mapping work? > 4. How does system commands work? Would Spark need to support all of > Postgres' commands? > 5. Any restrictions in supporting nested data? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15816) SQL server based on Postgres protocol
[ https://issues.apache.org/jira/browse/SPARK-15816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15371950#comment-15371950 ] Takeshi Yamamuro commented on SPARK-15816: -- okay, I first take time on this prototype. thanks. > SQL server based on Postgres protocol > - > > Key: SPARK-15816 > URL: https://issues.apache.org/jira/browse/SPARK-15816 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin > > At Spark Summit today this idea came up from a discussion: it would be great > to investigate the possibility of implementing a new SQL server using > Postgres' protocol, in lieu of Hive ThriftServer 2. I'm creating this ticket > to track this idea, in case others have feedback. > This server can have a simpler architecture, and allows users to leverage a > wide range of tools that are already available for Postgres (and many > commercial database systems based on Postgres). > Some of the problems we'd need to figure out are: > 1. What is the Postgres protocol? Is there an official documentation for it? > 2. How difficult would it be to implement that protocol in Spark (JVM in > particular). > 3. How does data type mapping work? > 4. How does system commands work? Would Spark need to support all of > Postgres' commands? > 5. Any restrictions in supporting nested data? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16430) Add an option in file stream source to read 1 file at a time
[ https://issues.apache.org/jira/browse/SPARK-16430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15371887#comment-15371887 ] Apache Spark commented on SPARK-16430: -- User 'tdas' has created a pull request for this issue: https://github.com/apache/spark/pull/14143 > Add an option in file stream source to read 1 file at a time > > > Key: SPARK-16430 > URL: https://issues.apache.org/jira/browse/SPARK-16430 > Project: Spark > Issue Type: Sub-task > Components: SQL, Streaming >Reporter: Tathagata Das >Assignee: Tathagata Das > > An option that limits the file stream source to read 1 file at a time enables > rate limiting. It has the additional convenience that a static set of files > can be used like a stream for testing as this will allows those files to be > considered one at a time. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14812) ML, Graph 2.0 QA: API: Experimental, DeveloperApi, final, sealed audit
[ https://issues.apache.org/jira/browse/SPARK-14812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15371860#comment-15371860 ] Joseph K. Bradley commented on SPARK-14812: --- I'll start on a PR, but please let me know if you have suggestions. > ML, Graph 2.0 QA: API: Experimental, DeveloperApi, final, sealed audit > -- > > Key: SPARK-14812 > URL: https://issues.apache.org/jira/browse/SPARK-14812 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib >Reporter: Joseph K. Bradley >Assignee: DB Tsai >Priority: Blocker > > We should make a pass through the items marked as Experimental or > DeveloperApi and see if any are stable enough to be unmarked. > We should also check for items marked final or sealed to see if they are > stable enough to be opened up as APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14087) PySpark ML JavaModel does not properly own params after being fit
[ https://issues.apache.org/jira/browse/SPARK-14087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler resolved SPARK-14087. -- Resolution: Resolved Fix Version/s: 2.0.0 This is no longer an issue as the PySpark wrapper class {{JavaModel}} calls {{_resetUid}} to brute force update all UIDs in the model to that of the Java Object. This is slightly different than how the Scala side works by overriding the UID value on construction. I think it would be better to mimic that, but I'll close this since it's working now. > PySpark ML JavaModel does not properly own params after being fit > - > > Key: SPARK-14087 > URL: https://issues.apache.org/jira/browse/SPARK-14087 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Minor > Fix For: 2.0.0 > > Attachments: feature.py > > > When a PySpark model is created after fitting data, its UID is initialized to > the parent estimator's value. Before this assignment, any params defined in > the model are copied from the object to the class in > {{Params._copy_params()}} and assigned a different parent UID. This causes > PySpark to think the params are not owned by the model and can lead to a > {{ValueError}} raised from {{Params._shouldOwn()}}, such as: > {noformat} > ValueError: Param Param(parent='CountVectorizerModel_4336a81ba742b2593fef', > name='outputCol', doc='output column name.') does not belong to > CountVectorizer_4c8e9fd539542d783e66. > {noformat} > I encountered this problem while working on SPARK-13967 where I tried to add > the shared params {{HasInputCol}} and {{HasOutputCol}} to > {{CountVectorizerModel}}. See the attached file feature.py for the WIP. > Using the modified 'feature.py', this sample code shows the mixup in UIDs and > produces the error above. > {noformat} > sc = SparkContext(appName="count_vec_test") > sqlContext = SQLContext(sc) > df = sqlContext.createDataFrame( > [(0, ["a", "b", "c"]), (1, ["a", "b", "b", "c", "a"])], ["label", > "raw"]) > cv = CountVectorizer(inputCol="raw", outputCol="vectors") > model = cv.fit(df) > print(model.uid) > for p in model.params: > print(str(p)) > model.transform(df).show(truncate=False) > {noformat} > output (the UIDs should match): > {noformat} > CountVectorizer_4c8e9fd539542d783e66 > CountVectorizerModel_4336a81ba742b2593fef__binary > CountVectorizerModel_4336a81ba742b2593fef__inputCol > CountVectorizerModel_4336a81ba742b2593fef__outputCol > {noformat} > In the Scala implementation of this, the model overrides the UID value, which > the Params use when they are constructed, so they all end up with the parent > estimator UID. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16439) Incorrect information in SQL Query details
[ https://issues.apache.org/jira/browse/SPARK-16439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16439: Assignee: Apache Spark > Incorrect information in SQL Query details > -- > > Key: SPARK-16439 > URL: https://issues.apache.org/jira/browse/SPARK-16439 > Project: Spark > Issue Type: Bug > Components: SQL, Web UI >Affects Versions: 2.0.0 >Reporter: Maciej Bryński >Assignee: Apache Spark > Attachments: sample.png, spark.jpg > > > One picture is worth a thousand words. > Please see attachment > Incorrect values are in fields: > * data size > * number of output rows > * time to collect -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16439) Incorrect information in SQL Query details
[ https://issues.apache.org/jira/browse/SPARK-16439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16439: Assignee: (was: Apache Spark) > Incorrect information in SQL Query details > -- > > Key: SPARK-16439 > URL: https://issues.apache.org/jira/browse/SPARK-16439 > Project: Spark > Issue Type: Bug > Components: SQL, Web UI >Affects Versions: 2.0.0 >Reporter: Maciej Bryński > Attachments: sample.png, spark.jpg > > > One picture is worth a thousand words. > Please see attachment > Incorrect values are in fields: > * data size > * number of output rows > * time to collect -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16439) Incorrect information in SQL Query details
[ https://issues.apache.org/jira/browse/SPARK-16439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15371845#comment-15371845 ] Apache Spark commented on SPARK-16439: -- User 'maver1ck' has created a pull request for this issue: https://github.com/apache/spark/pull/14142 > Incorrect information in SQL Query details > -- > > Key: SPARK-16439 > URL: https://issues.apache.org/jira/browse/SPARK-16439 > Project: Spark > Issue Type: Bug > Components: SQL, Web UI >Affects Versions: 2.0.0 >Reporter: Maciej Bryński > Attachments: sample.png, spark.jpg > > > One picture is worth a thousand words. > Please see attachment > Incorrect values are in fields: > * data size > * number of output rows > * time to collect -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14812) ML, Graph 2.0 QA: API: Experimental, DeveloperApi, final, sealed audit
[ https://issues.apache.org/jira/browse/SPARK-14812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15371841#comment-15371841 ] Joseph K. Bradley edited comment on SPARK-14812 at 7/11/16 10:58 PM: - [~thunterdb] has reviewed the public API for things which should be private. Issues noted in [SPARK-16485] General decisions to follow, except where noted: * spark.mllib, pyspark.mllib: Remove all Experimental annotations. Leave DeveloperApi annotations alone. * spark.ml, pyspark.ml ** Treat Estimator-Model pairs of classes and companion objects the same way. ** For all algorithms marked Experimental with Since tag <= 1.6, remove Experimental annotation. ** For all algorithms marked Experimental with Since tag = 2.0, leave Experimental annotation. * While writing and reviewing a PR, we should be careful to check for cases where a class should no longer be Experimental but has an Experimental method, val, or other feature. * DeveloperApi annotations are left alone, except where noted. * No changes to which types are sealed. spark.ml, pyspark.ml * classification ** LogisticRegression*Summary classes remain Experimental * MLWriter, MLReader, MLWritable, MLReadable remain Experimental * ml.tree.Node, ml.tree.Split, and subclasses should no longer be DeveloperApi How does this sound [~mlnick]? was (Author: josephkb): [~thunterdb] has reviewed the public API for things which should be private. Issues noted in [SPARK-16485] General decisions to follow, except where noted: * spark.mllib, pyspark.mllib: Remove all Experimental annotations. Leave DeveloperApi annotations alone. * spark.ml, pyspark.ml ** Treat Estimator-Model pairs of classes and companion objects the same way. ** For all algorithms marked Experimental with Since tag <= 1.6, remove Experimental annotation. ** For all algorithms marked Experimental with Since tag = 2.0, leave Experimental annotation. * While writing and reviewing a PR, we should be careful to check for cases where a class should no longer be Experimental but has an Experimental method, val, or other feature. * DeveloperApi annotations are left alone, except where noted. * No changes to sealed types. spark.ml, pyspark.ml * classification ** LogisticRegression*Summary classes remain Experimental * MLWriter, MLReader, MLWritable, MLReadable remain Experimental * ml.tree.Node, ml.tree.Split, and subclasses should no longer be DeveloperApi How does this sound [~mlnick]? > ML, Graph 2.0 QA: API: Experimental, DeveloperApi, final, sealed audit > -- > > Key: SPARK-14812 > URL: https://issues.apache.org/jira/browse/SPARK-14812 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib >Reporter: Joseph K. Bradley >Assignee: DB Tsai >Priority: Blocker > > We should make a pass through the items marked as Experimental or > DeveloperApi and see if any are stable enough to be unmarked. > We should also check for items marked final or sealed to see if they are > stable enough to be opened up as APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14812) ML, Graph 2.0 QA: API: Experimental, DeveloperApi, final, sealed audit
[ https://issues.apache.org/jira/browse/SPARK-14812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15371841#comment-15371841 ] Joseph K. Bradley edited comment on SPARK-14812 at 7/11/16 10:57 PM: - [~thunterdb] has reviewed the public API for things which should be private. Issues noted in [SPARK-16485] General decisions to follow, except where noted: * spark.mllib, pyspark.mllib: Remove all Experimental annotations. Leave DeveloperApi annotations alone. * spark.ml, pyspark.ml ** Treat Estimator-Model pairs of classes and companion objects the same way. ** For all algorithms marked Experimental with Since tag <= 1.6, remove Experimental annotation. ** For all algorithms marked Experimental with Since tag = 2.0, leave Experimental annotation. * While writing and reviewing a PR, we should be careful to check for cases where a class should no longer be Experimental but has an Experimental method, val, or other feature. * DeveloperApi annotations are left alone, except where noted. * No changes to sealed types. spark.ml, pyspark.ml * classification ** LogisticRegression*Summary classes remain Experimental * MLWriter, MLReader, MLWritable, MLReadable remain Experimental * ml.tree.Node, ml.tree.Split, and subclasses should no longer be DeveloperApi How does this sound [~mlnick]? was (Author: josephkb): [~thunterdb] has reviewed the public API for things which should be private. Issues noted in [SPARK-16485] General rules to follow, except where noted: * spark.mllib, pyspark.mllib: Remove all Experimental annotations. Leave DeveloperApi annotations alone. * spark.ml, pyspark.ml ** Treat Estimator-Model pairs of classes and companion objects the same way. ** For all algorithms marked Experimental with Since tag <= 1.6, remove Experimental annotation. ** For all algorithms marked Experimental with Since tag = 2.0, leave Experimental annotation. * While writing and reviewing a PR, we should be careful to check for cases where a class should no longer be Experimental but has an Experimental method, val, or other feature. * DeveloperApi annotations are left alone, except where noted. spark.ml, pyspark.ml * classification ** LogisticRegression*Summary classes remain Experimental * MLWriter, MLReader, MLWritable, MLReadable remain Experimental * ml.tree.Node, ml.tree.Split, and subclasses should no longer be DeveloperApi How does this sound [~mlnick]? > ML, Graph 2.0 QA: API: Experimental, DeveloperApi, final, sealed audit > -- > > Key: SPARK-14812 > URL: https://issues.apache.org/jira/browse/SPARK-14812 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib >Reporter: Joseph K. Bradley >Assignee: DB Tsai >Priority: Blocker > > We should make a pass through the items marked as Experimental or > DeveloperApi and see if any are stable enough to be unmarked. > We should also check for items marked final or sealed to see if they are > stable enough to be opened up as APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14812) ML, Graph 2.0 QA: API: Experimental, DeveloperApi, final, sealed audit
[ https://issues.apache.org/jira/browse/SPARK-14812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15371841#comment-15371841 ] Joseph K. Bradley commented on SPARK-14812: --- [~thunterdb] has reviewed the public API for things which should be private. Issues noted in [SPARK-16485] General rules to follow, except where noted: * spark.mllib, pyspark.mllib: Remove all Experimental annotations. Leave DeveloperApi annotations alone. * spark.ml, pyspark.ml ** Treat Estimator-Model pairs of classes and companion objects the same way. ** For all algorithms marked Experimental with Since tag <= 1.6, remove Experimental annotation. ** For all algorithms marked Experimental with Since tag = 2.0, leave Experimental annotation. * While writing and reviewing a PR, we should be careful to check for cases where a class should no longer be Experimental but has an Experimental method, val, or other feature. * DeveloperApi annotations are left alone, except where noted. spark.ml, pyspark.ml * classification ** LogisticRegression*Summary classes remain Experimental * MLWriter, MLReader, MLWritable, MLReadable remain Experimental * ml.tree.Node, ml.tree.Split, and subclasses should no longer be DeveloperApi How does this sound [~mlnick]? > ML, Graph 2.0 QA: API: Experimental, DeveloperApi, final, sealed audit > -- > > Key: SPARK-14812 > URL: https://issues.apache.org/jira/browse/SPARK-14812 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib >Reporter: Joseph K. Bradley >Assignee: DB Tsai >Priority: Blocker > > We should make a pass through the items marked as Experimental or > DeveloperApi and see if any are stable enough to be unmarked. > We should also check for items marked final or sealed to see if they are > stable enough to be opened up as APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6567) Large linear model parallelism via a join and reduceByKey
[ https://issues.apache.org/jira/browse/SPARK-6567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15371816#comment-15371816 ] Ben McCann commented on SPARK-6567: --- [~hucheng] can you share your code for this? > Large linear model parallelism via a join and reduceByKey > - > > Key: SPARK-6567 > URL: https://issues.apache.org/jira/browse/SPARK-6567 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Reporter: Reza Zadeh > Attachments: model-parallelism.pptx > > > To train a linear model, each training point in the training set needs its > dot product computed against the model, per iteration. If the model is large > (too large to fit in memory on a single machine) then SPARK-4590 proposes > using parameter server. > There is an easier way to achieve this without parameter servers. In > particular, if the data is held as a BlockMatrix and the model as an RDD, > then each block can be joined with the relevant part of the model, followed > by a reduceByKey to compute the dot products. > This obviates the need for a parameter server, at least for linear models. > However, it's unclear how it compares performance-wise to parameter servers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14812) ML, Graph 2.0 QA: API: Experimental, DeveloperApi, final, sealed audit
[ https://issues.apache.org/jira/browse/SPARK-14812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15371804#comment-15371804 ] Joseph K. Bradley commented on SPARK-14812: --- First, I'll comment that I think we can remove Experimental from anything in spark.mllib and pyspark.mllib. We can leave DeveloperApi annotation there, though. I'll just check spark.ml and pyspark.ml. > ML, Graph 2.0 QA: API: Experimental, DeveloperApi, final, sealed audit > -- > > Key: SPARK-14812 > URL: https://issues.apache.org/jira/browse/SPARK-14812 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib >Reporter: Joseph K. Bradley >Assignee: DB Tsai >Priority: Blocker > > We should make a pass through the items marked as Experimental or > DeveloperApi and see if any are stable enough to be unmarked. > We should also check for items marked final or sealed to see if they are > stable enough to be opened up as APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14812) ML, Graph 2.0 QA: API: Experimental, DeveloperApi, final, sealed audit
[ https://issues.apache.org/jira/browse/SPARK-14812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15371791#comment-15371791 ] Joseph K. Bradley commented on SPARK-14812: --- I'll make a pass over the docs now to review what might be graduated. > ML, Graph 2.0 QA: API: Experimental, DeveloperApi, final, sealed audit > -- > > Key: SPARK-14812 > URL: https://issues.apache.org/jira/browse/SPARK-14812 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib >Reporter: Joseph K. Bradley >Assignee: DB Tsai >Priority: Blocker > > We should make a pass through the items marked as Experimental or > DeveloperApi and see if any are stable enough to be unmarked. > We should also check for items marked final or sealed to see if they are > stable enough to be opened up as APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16439) Incorrect information in SQL Query details
[ https://issues.apache.org/jira/browse/SPARK-16439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15371780#comment-15371780 ] Maciej Bryński commented on SPARK-16439: I found that problem is locale dependent. The \u00A0 sign is added by java NumberFormat class. It can be avoided when using NumberFormat.setGroupingUsed(false) > Incorrect information in SQL Query details > -- > > Key: SPARK-16439 > URL: https://issues.apache.org/jira/browse/SPARK-16439 > Project: Spark > Issue Type: Bug > Components: SQL, Web UI >Affects Versions: 2.0.0 >Reporter: Maciej Bryński > Attachments: sample.png, spark.jpg > > > One picture is worth a thousand words. > Please see attachment > Incorrect values are in fields: > * data size > * number of output rows > * time to collect -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16349) IsolatedClientLoader ignores needed Hadoop classes not present in Spark's loader
[ https://issues.apache.org/jira/browse/SPARK-16349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-16349: - Assignee: Marcelo Vanzin > IsolatedClientLoader ignores needed Hadoop classes not present in Spark's > loader > > > Key: SPARK-16349 > URL: https://issues.apache.org/jira/browse/SPARK-16349 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin > Fix For: 2.1.0 > > > While trying to use a custom classpath for metastore jars > (spark.sql.hive.metastore.jars pointing at some filesystem path), I ran into > the following issue: > {noformat} > java.lang.ClassNotFoundException: java.lang.NoClassDefFoundError: > org/apache/hadoop/mapred/MRVersion when creating Hive client using classpath > {noformat} > The issue here is that {{MRVersion}} is not packaged anywhere with Spark, and > the code in {{IsolatedClientLoader}} only ever tries the parent class loader > when loading hadoop classes in this configuration. So even though I had the > class in the list of files in {{spark.sql.hive.metastore.jars}}, Spark never > tries to load it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16349) IsolatedClientLoader ignores needed Hadoop classes not present in Spark's loader
[ https://issues.apache.org/jira/browse/SPARK-16349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-16349. -- Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 14020 [https://github.com/apache/spark/pull/14020] > IsolatedClientLoader ignores needed Hadoop classes not present in Spark's > loader > > > Key: SPARK-16349 > URL: https://issues.apache.org/jira/browse/SPARK-16349 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Marcelo Vanzin > Fix For: 2.1.0 > > > While trying to use a custom classpath for metastore jars > (spark.sql.hive.metastore.jars pointing at some filesystem path), I ran into > the following issue: > {noformat} > java.lang.ClassNotFoundException: java.lang.NoClassDefFoundError: > org/apache/hadoop/mapred/MRVersion when creating Hive client using classpath > {noformat} > The issue here is that {{MRVersion}} is not packaged anywhere with Spark, and > the code in {{IsolatedClientLoader}} only ever tries the parent class loader > when loading hadoop classes in this configuration. So even though I had the > class in the list of files in {{spark.sql.hive.metastore.jars}}, Spark never > tries to load it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16375) [Spark web UI]:The wrong value(numCompletedTasks) has been assigned to the variable numSkippedTasks
[ https://issues.apache.org/jira/browse/SPARK-16375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16375: Assignee: (was: Apache Spark) > [Spark web UI]:The wrong value(numCompletedTasks) has been assigned to the > variable numSkippedTasks > --- > > Key: SPARK-16375 > URL: https://issues.apache.org/jira/browse/SPARK-16375 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.0.0 >Reporter: marymwu > Attachments: numSkippedTasksWrongValue.png > > > [Spark web UI]:The wrong value(numCompletedTasks) has been assigned to the > variable numSkippedTasks > See attachment for reference. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16375) [Spark web UI]:The wrong value(numCompletedTasks) has been assigned to the variable numSkippedTasks
[ https://issues.apache.org/jira/browse/SPARK-16375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15371734#comment-15371734 ] Apache Spark commented on SPARK-16375: -- User 'ajbozarth' has created a pull request for this issue: https://github.com/apache/spark/pull/14141 > [Spark web UI]:The wrong value(numCompletedTasks) has been assigned to the > variable numSkippedTasks > --- > > Key: SPARK-16375 > URL: https://issues.apache.org/jira/browse/SPARK-16375 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.0.0 >Reporter: marymwu > Attachments: numSkippedTasksWrongValue.png > > > [Spark web UI]:The wrong value(numCompletedTasks) has been assigned to the > variable numSkippedTasks > See attachment for reference. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16375) [Spark web UI]:The wrong value(numCompletedTasks) has been assigned to the variable numSkippedTasks
[ https://issues.apache.org/jira/browse/SPARK-16375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16375: Assignee: Apache Spark > [Spark web UI]:The wrong value(numCompletedTasks) has been assigned to the > variable numSkippedTasks > --- > > Key: SPARK-16375 > URL: https://issues.apache.org/jira/browse/SPARK-16375 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.0.0 >Reporter: marymwu >Assignee: Apache Spark > Attachments: numSkippedTasksWrongValue.png > > > [Spark web UI]:The wrong value(numCompletedTasks) has been assigned to the > variable numSkippedTasks > See attachment for reference. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16486) Python API parity issues from 2.0 QA
[ https://issues.apache.org/jira/browse/SPARK-16486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-16486: -- Target Version/s: (was: 2.0.0) > Python API parity issues from 2.0 QA > > > Key: SPARK-16486 > URL: https://issues.apache.org/jira/browse/SPARK-16486 > Project: Spark > Issue Type: Umbrella > Components: Documentation, ML, PySpark >Reporter: Joseph K. Bradley >Assignee: holdenk >Priority: Blocker > > This is an umbrella for Python API parity issues for MLlib found during 2.0 > QA [SPARK-14813] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16486) Python API parity issues from 2.0 QA
[ https://issues.apache.org/jira/browse/SPARK-16486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-16486: -- Priority: Major (was: Blocker) > Python API parity issues from 2.0 QA > > > Key: SPARK-16486 > URL: https://issues.apache.org/jira/browse/SPARK-16486 > Project: Spark > Issue Type: Umbrella > Components: Documentation, ML, PySpark >Reporter: Joseph K. Bradley >Assignee: holdenk > > This is an umbrella for Python API parity issues for MLlib found during 2.0 > QA [SPARK-14813] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14813) ML 2.0 QA: API: Python API coverage
[ https://issues.apache.org/jira/browse/SPARK-14813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-14813. --- Resolution: Done Fix Version/s: 2.0.0 Closing. All open tasks have been copied to [SPARK-16486] Thanks [~holdenk] and others! > ML 2.0 QA: API: Python API coverage > --- > > Key: SPARK-14813 > URL: https://issues.apache.org/jira/browse/SPARK-14813 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML, PySpark >Reporter: Joseph K. Bradley >Assignee: holdenk >Priority: Blocker > Fix For: 2.0.0 > > > For new public APIs added to MLlib, we need to check the generated HTML doc > and compare the Scala & Python versions. We need to track: > * Inconsistency: Do class/method/parameter names match? > * Docs: Is the Python doc missing or just a stub? We want the Python doc to > be as complete as the Scala doc. > * API breaking changes: These should be very rare but are occasionally either > necessary (intentional) or accidental. These must be recorded and added in > the Migration Guide for this release. > ** Note: If the API change is for an Alpha/Experimental/DeveloperApi > component, please note that as well. > * Missing classes/methods/parameters: We should create to-do JIRAs for > functionality missing from Python, to be added in the next release cycle. > Please use a *separate* JIRA (linked below as "requires") for this list of > to-do items. > ** *NOTE: These missing features should be added in the next release. This > work is just to generate a list of to-do items for the future.* > UPDATE: This only needs to cover spark.ml since spark.mllib is going into > maintenance mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-15630) 2.0 python coverage ml root module
[ https://issues.apache.org/jira/browse/SPARK-15630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley closed SPARK-15630. - Resolution: Won't Fix > 2.0 python coverage ml root module > -- > > Key: SPARK-15630 > URL: https://issues.apache.org/jira/browse/SPARK-15630 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: holdenk > > Audit the root pipeline components in PySpark ML for API compatibility. See > parent SPARK-14813 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15630) 2.0 python coverage ml root module
[ https://issues.apache.org/jira/browse/SPARK-15630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-15630: -- Priority: Major (was: Blocker) > 2.0 python coverage ml root module > -- > > Key: SPARK-15630 > URL: https://issues.apache.org/jira/browse/SPARK-15630 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: holdenk > > Audit the root pipeline components in PySpark ML for API compatibility. See > parent SPARK-14813 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15630) 2.0 python coverage ml root module
[ https://issues.apache.org/jira/browse/SPARK-15630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15371725#comment-15371725 ] Joseph K. Bradley commented on SPARK-15630: --- I'm going to go ahead and close this. We need to remove blockers from 2.0, and the python check is not really a blocker. > 2.0 python coverage ml root module > -- > > Key: SPARK-15630 > URL: https://issues.apache.org/jira/browse/SPARK-15630 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: holdenk >Priority: Blocker > > Audit the root pipeline components in PySpark ML for API compatibility. See > parent SPARK-14813 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15623) 2.0 python coverage ml.feature
[ https://issues.apache.org/jira/browse/SPARK-15623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15371721#comment-15371721 ] Joseph K. Bradley commented on SPARK-15623: --- I'll go ahead and close this. Thanks! > 2.0 python coverage ml.feature > -- > > Key: SPARK-15623 > URL: https://issues.apache.org/jira/browse/SPARK-15623 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: holdenk >Priority: Blocker > Fix For: 2.0.0 > > > See parent task SPARK-14813. > [~bryanc] did this component. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15623) 2.0 python coverage ml.feature
[ https://issues.apache.org/jira/browse/SPARK-15623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-15623. --- Resolution: Done Assignee: Bryan Cutler Fix Version/s: 2.0.0 Target Version/s: (was: 2.0.0) > 2.0 python coverage ml.feature > -- > > Key: SPARK-15623 > URL: https://issues.apache.org/jira/browse/SPARK-15623 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: holdenk >Assignee: Bryan Cutler >Priority: Blocker > Fix For: 2.0.0 > > > See parent task SPARK-14813. > [~bryanc] did this component. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3181) Add Robust Regression Algorithm with Huber Estimator
[ https://issues.apache.org/jira/browse/SPARK-3181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15371715#comment-15371715 ] DB Tsai commented on SPARK-3181: I prefer to option 1) as well. We can extend it to elastic net later when we know how to do OWLQNB. Thanks. > Add Robust Regression Algorithm with Huber Estimator > > > Key: SPARK-3181 > URL: https://issues.apache.org/jira/browse/SPARK-3181 > Project: Spark > Issue Type: New Feature > Components: ML, MLlib >Reporter: Fan Jiang >Assignee: Yanbo Liang > Labels: features > Original Estimate: 0h > Remaining Estimate: 0h > > Linear least square estimates assume the error has normal distribution and > can behave badly when the errors are heavy-tailed. In practical we get > various types of data. We need to include Robust Regression to employ a > fitting criterion that is not as vulnerable as least square. > In 1973, Huber introduced M-estimation for regression which stands for > "maximum likelihood type". The method is resistant to outliers in the > response variable and has been widely used. > The new feature for MLlib will contain 3 new files > /main/scala/org/apache/spark/mllib/regression/RobustRegression.scala > /test/scala/org/apache/spark/mllib/regression/RobustRegressionSuite.scala > /main/scala/org/apache/spark/examples/mllib/HuberRobustRegression.scala > and one new class HuberRobustGradient in > /main/scala/org/apache/spark/mllib/optimization/Gradient.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16484) Incremental Cardinality estimation operations with Hyperloglog
[ https://issues.apache.org/jira/browse/SPARK-16484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15371709#comment-15371709 ] Yongjia Wang commented on SPARK-16484: -- Yes, I agree all the building blocks are there and easy enough to put together a solution now. I guess what I did is the second approach you mentioned - saving the hll++ "buffer" as a byte array column, with a custom UDAF to merge them using SQL expression. I was trying to say if it worth extending sparksql to include those extra UDAFs, making it more accessible for regular spark users. Also doing intersection of multiple sets can be tricky, wouldn't it be nice to have it as part of sparksql's standard set of functions? > Incremental Cardinality estimation operations with Hyperloglog > -- > > Key: SPARK-16484 > URL: https://issues.apache.org/jira/browse/SPARK-16484 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Yongjia Wang > > Efficient cardinality estimation is very important, and SparkSQL has had > approxCountDistinct based on Hyperloglog for quite some time. However, there > isn't a way to do incremental estimation. For example, if we want to get > updated distinct counts of the last 90 days, we need to do the aggregation > for the entire window over and over again. The more efficient way involves > serializing the counter for smaller time windows (such as hourly) so the > counts can be efficiently updated in an incremental fashion for any time > window. > With the support of custom UDAF, Binary DataType and the HyperloglogPlusPlus > implementation in the current Spark version, it's easy enough to extend the > functionality to include incremental counting, and even other general set > operations such as intersection and set difference. Spark API is already as > elegant as it can be, but it still takes quite some effort to do a custom > implementation of the aforementioned operations which are supposed to be in > high demand. I have been searching but failed to find an usable existing > solution nor any ongoing effort for this. The closest I got is the following > but it does not work with Spark 1.6 due to API changes. > https://github.com/collectivemedia/spark-hyperloglog/blob/master/src/main/scala/org/apache/spark/sql/hyperloglog/aggregates.scala > I wonder if it worth to integrate such operations into SparkSQL. The only > problem I see is it depends on serialization of a specific HLL implementation > and introduce compatibility issues. But as long as the user is aware of such > issue, it should be fine. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16144) Add a separate Rd for ML generic methods: read.ml, write.ml, summary, predict
[ https://issues.apache.org/jira/browse/SPARK-16144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman resolved SPARK-16144. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 13993 [https://github.com/apache/spark/pull/13993] > Add a separate Rd for ML generic methods: read.ml, write.ml, summary, predict > - > > Key: SPARK-16144 > URL: https://issues.apache.org/jira/browse/SPARK-16144 > Project: Spark > Issue Type: Sub-task > Components: Documentation, MLlib, SparkR >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Yanbo Liang > Fix For: 2.0.0 > > > After we grouped generic methods by the algorithm, it would be nice to add a > separate Rd for each ML generic methods, in particular, write.ml, read.ml, > summary, and predict and link the implementations with seealso. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16426) IsotonicRegression produces NaNs with certain data
[ https://issues.apache.org/jira/browse/SPARK-16426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15371664#comment-15371664 ] Apache Spark commented on SPARK-16426: -- User 'neggert' has created a pull request for this issue: https://github.com/apache/spark/pull/14140 > IsotonicRegression produces NaNs with certain data > -- > > Key: SPARK-16426 > URL: https://issues.apache.org/jira/browse/SPARK-16426 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.3.1, 1.4.1, 1.5.2, 1.6.2 >Reporter: Nic Eggert > > {code} > val r = sc.parallelize(Seq[(Double, Double, Double)]((2, 1, 1), (1, 1, 1), > (0, 2, 1), (1, 2, 1), (0.5, 3, 1), (0, 3, 1)), 2) > val i = new IsotonicRegression().run(r) > scala> i.predict(3.0) > res12: Double = NaN > scala> i.predictions > res13: Array[Double] = Array(0.75, 0.75, NaN, NaN) > {code} > I believe I understand the problem so I'll submit a PR shortly. > The problem happens when rows with the same feature value but different > labels end up on different partitions. The merge function in > poolAdjacentViolators introduces 0-weight points to be used for linear > interpolation. This works fine, as long as they are always next to a > non-0-weight point, but in the above case, you can end up with two 0-weight > points with the same feature value, which end up next to each other in the > final PAV step. If these points are pooled, it creates a NaN. > One solution to this is to ensure that the all points with identical feature > values end up on the same partition. This is the solution I intend to submit > a PR for. Another option would be to try to get rid of the 0-weight points, > but that seems trickier to me. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16426) IsotonicRegression produces NaNs with certain data
[ https://issues.apache.org/jira/browse/SPARK-16426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16426: Assignee: Apache Spark > IsotonicRegression produces NaNs with certain data > -- > > Key: SPARK-16426 > URL: https://issues.apache.org/jira/browse/SPARK-16426 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.3.1, 1.4.1, 1.5.2, 1.6.2 >Reporter: Nic Eggert >Assignee: Apache Spark > > {code} > val r = sc.parallelize(Seq[(Double, Double, Double)]((2, 1, 1), (1, 1, 1), > (0, 2, 1), (1, 2, 1), (0.5, 3, 1), (0, 3, 1)), 2) > val i = new IsotonicRegression().run(r) > scala> i.predict(3.0) > res12: Double = NaN > scala> i.predictions > res13: Array[Double] = Array(0.75, 0.75, NaN, NaN) > {code} > I believe I understand the problem so I'll submit a PR shortly. > The problem happens when rows with the same feature value but different > labels end up on different partitions. The merge function in > poolAdjacentViolators introduces 0-weight points to be used for linear > interpolation. This works fine, as long as they are always next to a > non-0-weight point, but in the above case, you can end up with two 0-weight > points with the same feature value, which end up next to each other in the > final PAV step. If these points are pooled, it creates a NaN. > One solution to this is to ensure that the all points with identical feature > values end up on the same partition. This is the solution I intend to submit > a PR for. Another option would be to try to get rid of the 0-weight points, > but that seems trickier to me. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16426) IsotonicRegression produces NaNs with certain data
[ https://issues.apache.org/jira/browse/SPARK-16426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16426: Assignee: (was: Apache Spark) > IsotonicRegression produces NaNs with certain data > -- > > Key: SPARK-16426 > URL: https://issues.apache.org/jira/browse/SPARK-16426 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.3.1, 1.4.1, 1.5.2, 1.6.2 >Reporter: Nic Eggert > > {code} > val r = sc.parallelize(Seq[(Double, Double, Double)]((2, 1, 1), (1, 1, 1), > (0, 2, 1), (1, 2, 1), (0.5, 3, 1), (0, 3, 1)), 2) > val i = new IsotonicRegression().run(r) > scala> i.predict(3.0) > res12: Double = NaN > scala> i.predictions > res13: Array[Double] = Array(0.75, 0.75, NaN, NaN) > {code} > I believe I understand the problem so I'll submit a PR shortly. > The problem happens when rows with the same feature value but different > labels end up on different partitions. The merge function in > poolAdjacentViolators introduces 0-weight points to be used for linear > interpolation. This works fine, as long as they are always next to a > non-0-weight point, but in the above case, you can end up with two 0-weight > points with the same feature value, which end up next to each other in the > final PAV step. If these points are pooled, it creates a NaN. > One solution to this is to ensure that the all points with identical feature > values end up on the same partition. This is the solution I intend to submit > a PR for. Another option would be to try to get rid of the 0-weight points, > but that seems trickier to me. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16455) Add a new hook in CoarseGrainedSchedulerBackend in order to stop scheduling new tasks when cluster is restarting
[ https://issues.apache.org/jira/browse/SPARK-16455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15371653#comment-15371653 ] YangyangLiu commented on SPARK-16455: - Oh, we are implementing a new feature in an internal tool. When the cluster is restarting, we let driver wait until service is available again. So during restarting, we try to stop scheduling new tasks. > Add a new hook in CoarseGrainedSchedulerBackend in order to stop scheduling > new tasks when cluster is restarting > > > Key: SPARK-16455 > URL: https://issues.apache.org/jira/browse/SPARK-16455 > Project: Spark > Issue Type: New Feature > Components: Scheduler >Reporter: YangyangLiu >Priority: Minor > > In our case, we are implementing a new mechanism which will let driver > survive when cluster is temporarily down and restarting. So when the service > provided by cluster is not available, scheduler should stop scheduling new > tasks. I added a hook inside CoarseGrainedSchedulerBackend class, in order to > avoid new task scheduling when it's necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16482) If a table's schema is inferred at runtime, describe table command does not show the schema
[ https://issues.apache.org/jira/browse/SPARK-16482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15371648#comment-15371648 ] Xiao Li commented on SPARK-16482: - Let me try it. : ) > If a table's schema is inferred at runtime, describe table command does not > show the schema > --- > > Key: SPARK-16482 > URL: https://issues.apache.org/jira/browse/SPARK-16482 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Yin Huai >Priority: Critical > > If we create a table pointing to a parquet/json datasets without specifying > the schema, describe table command does not show the schema at all. It only > shows {{# Schema of this table is inferred at runtime}}. In 1.6, describe > table does show the schema of such a table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16483) Unifying struct fields and columns
[ https://issues.apache.org/jira/browse/SPARK-16483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-16483: - Target Version/s: 2.1.0 > Unifying struct fields and columns > -- > > Key: SPARK-16483 > URL: https://issues.apache.org/jira/browse/SPARK-16483 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Simeon Simeonov > Labels: sql > > This issue comes as a result of an exchange with Michael Armbrust outside of > the usual JIRA/dev list channels. > DataFrame provides a full set of manipulation operations for top-level > columns. They have be added, removed, modified and renamed. The same is not > true about fields inside structs yet, from a logical standpoint, Spark users > may very well want to perform the same operations on struct fields, > especially since automatic schema discovery from JSON input tends to create > deeply nested structs. > Common use-cases include: > - Remove and/or rename struct field(s) to adjust the schema > - Fix a data quality issue with a struct field (update/rewrite) > To do this with the existing API by hand requires manually calling > {{named_struct}} and listing all fields, including ones we don't want to > manipulate. This leads to complex, fragile code that cannot survive schema > evolution. > It would be far better if the various APIs that can now manipulate top-level > columns were extended to handle struct fields at arbitrary locations or, > alternatively, if we introduced new APIs for modifying any field in a > dataframe, whether it is a top-level one or one nested inside a struct. > Purely for discussion purposes, here is the skeleton implementation of an > update() implicit that we've use to modify any existing field in a dataframe. > (Note that it depends on various other utilities and implicits that are not > included). https://gist.github.com/ssimeonov/f98dcfa03cd067157fa08aaa688b0f66 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16334) [SQL] SQL query on parquet table java.lang.ArrayIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/SPARK-16334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15371633#comment-15371633 ] Herman van Hovell commented on SPARK-16334: --- It would also be great if we can reproduce this. Could share an example? > [SQL] SQL query on parquet table java.lang.ArrayIndexOutOfBoundsException > - > > Key: SPARK-16334 > URL: https://issues.apache.org/jira/browse/SPARK-16334 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Egor Pahomov >Priority: Critical > Labels: sql > > Query: > {code} > select * from blabla where user_id = 415706251 > {code} > Error: > {code} > 16/06/30 14:07:27 WARN scheduler.TaskSetManager: Lost task 11.0 in stage 0.0 > (TID 3, hadoop6): java.lang.ArrayIndexOutOfBoundsException: 6934 > at > org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary.decodeToBinary(PlainValuesDictionary.java:119) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.decodeDictionaryIds(VectorizedColumnReader.java:273) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:170) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:230) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:36) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} > Work on 1.6.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16486) Python API parity issues from 2.0 QA
[ https://issues.apache.org/jira/browse/SPARK-16486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-16486: -- Description: This is an umbrella for Python API parity issues for MLlib found during 2.0 QA [SPARK-14813] (was: For new public APIs added to MLlib, we need to check the generated HTML doc and compare the Scala & Python versions. We need to track: * Inconsistency: Do class/method/parameter names match? * Docs: Is the Python doc missing or just a stub? We want the Python doc to be as complete as the Scala doc. * API breaking changes: These should be very rare but are occasionally either necessary (intentional) or accidental. These must be recorded and added in the Migration Guide for this release. ** Note: If the API change is for an Alpha/Experimental/DeveloperApi component, please note that as well. * Missing classes/methods/parameters: We should create to-do JIRAs for functionality missing from Python, to be added in the next release cycle. Please use a *separate* JIRA (linked below as "requires") for this list of to-do items. ** *NOTE: These missing features should be added in the next release. This work is just to generate a list of to-do items for the future.* UPDATE: This only needs to cover spark.ml since spark.mllib is going into maintenance mode.) > Python API parity issues from 2.0 QA > > > Key: SPARK-16486 > URL: https://issues.apache.org/jira/browse/SPARK-16486 > Project: Spark > Issue Type: Umbrella > Components: Documentation, ML, PySpark >Reporter: Joseph K. Bradley >Assignee: holdenk >Priority: Blocker > > This is an umbrella for Python API parity issues for MLlib found during 2.0 > QA [SPARK-14813] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14813) ML 2.0 QA: API: Python API coverage
[ https://issues.apache.org/jira/browse/SPARK-14813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15371624#comment-15371624 ] holdenk commented on SPARK-14813: - Yup, auditing is done and once 2.0 is out we will go back in and finish up fixing the issues we found. > ML 2.0 QA: API: Python API coverage > --- > > Key: SPARK-14813 > URL: https://issues.apache.org/jira/browse/SPARK-14813 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML, PySpark >Reporter: Joseph K. Bradley >Assignee: holdenk >Priority: Blocker > > For new public APIs added to MLlib, we need to check the generated HTML doc > and compare the Scala & Python versions. We need to track: > * Inconsistency: Do class/method/parameter names match? > * Docs: Is the Python doc missing or just a stub? We want the Python doc to > be as complete as the Scala doc. > * API breaking changes: These should be very rare but are occasionally either > necessary (intentional) or accidental. These must be recorded and added in > the Migration Guide for this release. > ** Note: If the API change is for an Alpha/Experimental/DeveloperApi > component, please note that as well. > * Missing classes/methods/parameters: We should create to-do JIRAs for > functionality missing from Python, to be added in the next release cycle. > Please use a *separate* JIRA (linked below as "requires") for this list of > to-do items. > ** *NOTE: These missing features should be added in the next release. This > work is just to generate a list of to-do items for the future.* > UPDATE: This only needs to cover spark.ml since spark.mllib is going into > maintenance mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16486) Python API parity issues from 2.0 QA
[ https://issues.apache.org/jira/browse/SPARK-16486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-16486: -- Issue Type: Umbrella (was: Sub-task) Parent: (was: SPARK-14808) > Python API parity issues from 2.0 QA > > > Key: SPARK-16486 > URL: https://issues.apache.org/jira/browse/SPARK-16486 > Project: Spark > Issue Type: Umbrella > Components: Documentation, ML, PySpark >Reporter: Joseph K. Bradley >Assignee: holdenk >Priority: Blocker > > For new public APIs added to MLlib, we need to check the generated HTML doc > and compare the Scala & Python versions. We need to track: > * Inconsistency: Do class/method/parameter names match? > * Docs: Is the Python doc missing or just a stub? We want the Python doc to > be as complete as the Scala doc. > * API breaking changes: These should be very rare but are occasionally either > necessary (intentional) or accidental. These must be recorded and added in > the Migration Guide for this release. > ** Note: If the API change is for an Alpha/Experimental/DeveloperApi > component, please note that as well. > * Missing classes/methods/parameters: We should create to-do JIRAs for > functionality missing from Python, to be added in the next release cycle. > Please use a *separate* JIRA (linked below as "requires") for this list of > to-do items. > ** *NOTE: These missing features should be added in the next release. This > work is just to generate a list of to-do items for the future.* > UPDATE: This only needs to cover spark.ml since spark.mllib is going into > maintenance mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16486) Python API parity issues from 2.0 QA
Joseph K. Bradley created SPARK-16486: - Summary: Python API parity issues from 2.0 QA Key: SPARK-16486 URL: https://issues.apache.org/jira/browse/SPARK-16486 Project: Spark Issue Type: Sub-task Components: Documentation, ML, PySpark Reporter: Joseph K. Bradley Assignee: holdenk Priority: Blocker For new public APIs added to MLlib, we need to check the generated HTML doc and compare the Scala & Python versions. We need to track: * Inconsistency: Do class/method/parameter names match? * Docs: Is the Python doc missing or just a stub? We want the Python doc to be as complete as the Scala doc. * API breaking changes: These should be very rare but are occasionally either necessary (intentional) or accidental. These must be recorded and added in the Migration Guide for this release. ** Note: If the API change is for an Alpha/Experimental/DeveloperApi component, please note that as well. * Missing classes/methods/parameters: We should create to-do JIRAs for functionality missing from Python, to be added in the next release cycle. Please use a *separate* JIRA (linked below as "requires") for this list of to-do items. ** *NOTE: These missing features should be added in the next release. This work is just to generate a list of to-do items for the future.* UPDATE: This only needs to cover spark.ml since spark.mllib is going into maintenance mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16385) NoSuchMethodException thrown by Utils.waitForProcess
[ https://issues.apache.org/jira/browse/SPARK-16385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-16385: -- Fix Version/s: 1.6.3 > NoSuchMethodException thrown by Utils.waitForProcess > > > Key: SPARK-16385 > URL: https://issues.apache.org/jira/browse/SPARK-16385 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin > Fix For: 1.6.3, 2.0.0 > > > The code in Utils.waitForProcess catches the wrong exception: when using > reflection, {{NoSuchMethodException}} is thrown, but the code catches > {{NoSuchMethodError}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14813) ML 2.0 QA: API: Python API coverage
[ https://issues.apache.org/jira/browse/SPARK-14813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15371615#comment-15371615 ] Joseph K. Bradley commented on SPARK-14813: --- Yes, this was supposed to be an audit JIRA, with the missing APIs noted for targeting to future releases. I'm going to venture that auditing is done, but [~holdenk] please let me know if not. I'll copy remaining tasks to a new non-2.0 JIRA and then close this once done. > ML 2.0 QA: API: Python API coverage > --- > > Key: SPARK-14813 > URL: https://issues.apache.org/jira/browse/SPARK-14813 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML, PySpark >Reporter: Joseph K. Bradley >Assignee: holdenk >Priority: Blocker > > For new public APIs added to MLlib, we need to check the generated HTML doc > and compare the Scala & Python versions. We need to track: > * Inconsistency: Do class/method/parameter names match? > * Docs: Is the Python doc missing or just a stub? We want the Python doc to > be as complete as the Scala doc. > * API breaking changes: These should be very rare but are occasionally either > necessary (intentional) or accidental. These must be recorded and added in > the Migration Guide for this release. > ** Note: If the API change is for an Alpha/Experimental/DeveloperApi > component, please note that as well. > * Missing classes/methods/parameters: We should create to-do JIRAs for > functionality missing from Python, to be added in the next release cycle. > Please use a *separate* JIRA (linked below as "requires") for this list of > to-do items. > ** *NOTE: These missing features should be added in the next release. This > work is just to generate a list of to-do items for the future.* > UPDATE: This only needs to cover spark.ml since spark.mllib is going into > maintenance mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16334) [SQL] SQL query on parquet table java.lang.ArrayIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/SPARK-16334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15371606#comment-15371606 ] Herman van Hovell commented on SPARK-16334: --- Could you try to disable the vectorized parquet reader. You can do this issuing the following SQL statement {{SET spark.sql.parquet.enableVectorizedReader = FALSE}}, or by issuing {{spark.conf.set("spark.sql.parquet.enableVectorizedReader", "false")}} in the REPL. > [SQL] SQL query on parquet table java.lang.ArrayIndexOutOfBoundsException > - > > Key: SPARK-16334 > URL: https://issues.apache.org/jira/browse/SPARK-16334 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Egor Pahomov >Priority: Critical > Labels: sql > > Query: > {code} > select * from blabla where user_id = 415706251 > {code} > Error: > {code} > 16/06/30 14:07:27 WARN scheduler.TaskSetManager: Lost task 11.0 in stage 0.0 > (TID 3, hadoop6): java.lang.ArrayIndexOutOfBoundsException: 6934 > at > org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary.decodeToBinary(PlainValuesDictionary.java:119) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.decodeDictionaryIds(VectorizedColumnReader.java:273) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:170) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:230) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:36) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} > Work on 1.6.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16403) Example cleanup and fix minor issues
[ https://issues.apache.org/jira/browse/SPARK-16403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler updated SPARK-16403: - Description: General cleanup of examples, focused on PySpark ML, to remove unused imports, sync with Scala examples, improve consistency and fix minor issues such as arg checks etc. * consistent appNames, most are camel case * fix formatting, add newlines if difficult to read - many examples are just solid blocks of code * should use __future__ print function * simple_text_classification_pipeline is a duplicate of pipeline_example * simple_params_example is a duplicate of estimator_transformer_param_example * some spelling errors was: General cleanup of examples, focused on PySpark ML, to remove unused imports, sync with Scala examples, improve consistency and fix minor issues such as arg checks etc. * consistent appNames, most are camel case * fix formatting, add newlines if difficult to read - many examples are just solid blocks of code * should use __future__ print function * pipeline_example is a duplicate of simple_text_classification_pipeline * some spelling errors > Example cleanup and fix minor issues > > > Key: SPARK-16403 > URL: https://issues.apache.org/jira/browse/SPARK-16403 > Project: Spark > Issue Type: Sub-task > Components: Examples, PySpark >Reporter: Bryan Cutler >Priority: Trivial > > General cleanup of examples, focused on PySpark ML, to remove unused imports, > sync with Scala examples, improve consistency and fix minor issues such as > arg checks etc. > * consistent appNames, most are camel case > * fix formatting, add newlines if difficult to read - many examples are just > solid blocks of code > * should use __future__ print function > * simple_text_classification_pipeline is a duplicate of pipeline_example > * simple_params_example is a duplicate of estimator_transformer_param_example > * some spelling errors -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16458) SessionCatalog should support `listColumns` for temporary tables
[ https://issues.apache.org/jira/browse/SPARK-16458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell resolved SPARK-16458. --- Resolution: Fixed Assignee: Dongjoon Hyun Fix Version/s: 2.0.0 > SessionCatalog should support `listColumns` for temporary tables > > > Key: SPARK-16458 > URL: https://issues.apache.org/jira/browse/SPARK-16458 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 2.0.0 > > > Temporary tables are used frequently, but `spark.catalog.listColumns` does > not support those tables. > {code} > scala> spark.range(10).createOrReplaceTempView("t1") > scala> spark.catalog.listTables().collect() > res1: Array[org.apache.spark.sql.catalog.Table] = Array(Table[name='t1', > tableType='TEMPORARY', isTemporary='true']) > scala> spark.catalog.listColumns("t1").collect() > org.apache.spark.sql.AnalysisException: Table 't1' does not exist in database > 'default'.; > {code} > This issue make `SessionCatalog` supports temporary table column listing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16240) model loading backward compatibility for ml.clustering.LDA
[ https://issues.apache.org/jira/browse/SPARK-16240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-16240: -- Assignee: Gayathri Murali > model loading backward compatibility for ml.clustering.LDA > -- > > Key: SPARK-16240 > URL: https://issues.apache.org/jira/browse/SPARK-16240 > Project: Spark > Issue Type: Bug >Reporter: yuhao yang >Assignee: Gayathri Murali > > After resolving the matrix conversion issue, LDA model still cannot load 1.6 > models as one of the parameter name is changed. > https://github.com/apache/spark/pull/12065 > We can perhaps add some special logic in the loading code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16484) Incremental Cardinality estimation operations with Hyperloglog
[ https://issues.apache.org/jira/browse/SPARK-16484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15371571#comment-15371571 ] Herman van Hovell commented on SPARK-16484: --- I do think much of the machinery is already in place. The dense HyperLogLog++ implementation supports merging and partial results. There a few things to consider: - Structured streaming can play a role in what you want to do (if you are only doing time based aggregates of course). - You could export the registers (after partial aggregation), and merge them on the fly. That could be done using a few HLL++ a-like aggregate expressions. - Datasets in combination with Aggregators might also work. > Incremental Cardinality estimation operations with Hyperloglog > -- > > Key: SPARK-16484 > URL: https://issues.apache.org/jira/browse/SPARK-16484 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Yongjia Wang > > Efficient cardinality estimation is very important, and SparkSQL has had > approxCountDistinct based on Hyperloglog for quite some time. However, there > isn't a way to do incremental estimation. For example, if we want to get > updated distinct counts of the last 90 days, we need to do the aggregation > for the entire window over and over again. The more efficient way involves > serializing the counter for smaller time windows (such as hourly) so the > counts can be efficiently updated in an incremental fashion for any time > window. > With the support of custom UDAF, Binary DataType and the HyperloglogPlusPlus > implementation in the current Spark version, it's easy enough to extend the > functionality to include incremental counting, and even other general set > operations such as intersection and set difference. Spark API is already as > elegant as it can be, but it still takes quite some effort to do a custom > implementation of the aforementioned operations which are supposed to be in > high demand. I have been searching but failed to find an usable existing > solution nor any ongoing effort for this. The closest I got is the following > but it does not work with Spark 1.6 due to API changes. > https://github.com/collectivemedia/spark-hyperloglog/blob/master/src/main/scala/org/apache/spark/sql/hyperloglog/aggregates.scala > I wonder if it worth to integrate such operations into SparkSQL. The only > problem I see is it depends on serialization of a specific HLL implementation > and introduce compatibility issues. But as long as the user is aware of such > issue, it should be fine. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16334) [SQL] SQL query on parquet table java.lang.ArrayIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/SPARK-16334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15371572#comment-15371572 ] Vladimir Ivanov commented on SPARK-16334: - I believe it relates to the following change in Spark 2.0: "For example, we have implemented a new vectorized Parquet reader that does decompression and decoding in column batches. When decoding integer columns (on disk), this new reader is roughly 9 times faster than the non-vectorized one" taken from this link: https://databricks.com/blog/2016/05/23/apache-spark-as-a-compiler-joining-a-billion-rows-per-second-on-a-laptop.html Corresponding JIRA tickets: https://issues.apache.org/jira/browse/SPARK-12854 https://issues.apache.org/jira/browse/SPARK-14008 > [SQL] SQL query on parquet table java.lang.ArrayIndexOutOfBoundsException > - > > Key: SPARK-16334 > URL: https://issues.apache.org/jira/browse/SPARK-16334 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Egor Pahomov >Priority: Critical > Labels: sql > > Query: > {code} > select * from blabla where user_id = 415706251 > {code} > Error: > {code} > 16/06/30 14:07:27 WARN scheduler.TaskSetManager: Lost task 11.0 in stage 0.0 > (TID 3, hadoop6): java.lang.ArrayIndexOutOfBoundsException: 6934 > at > org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary.decodeToBinary(PlainValuesDictionary.java:119) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.decodeDictionaryIds(VectorizedColumnReader.java:273) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:170) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:230) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:36) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} > Work on 1.6.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16482) If a table's schema is inferred at runtime, describe table command does not show the schema
[ https://issues.apache.org/jira/browse/SPARK-16482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15371564#comment-15371564 ] Yin Huai commented on SPARK-16482: -- Yea. Thanks! Seems we can just use lookupRelation to get the schema. > If a table's schema is inferred at runtime, describe table command does not > show the schema > --- > > Key: SPARK-16482 > URL: https://issues.apache.org/jira/browse/SPARK-16482 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Yin Huai >Priority: Critical > > If we create a table pointing to a parquet/json datasets without specifying > the schema, describe table command does not show the schema at all. It only > shows {{# Schema of this table is inferred at runtime}}. In 1.6, describe > table does show the schema of such a table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16450) Build failes for Mesos 0.28.x
[ https://issues.apache.org/jira/browse/SPARK-16450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15371562#comment-15371562 ] Michael Gummelt commented on SPARK-16450: - Once Mesos 1.0 is released, I'll submit a PR to upgrade. Long term solution is to use the HTTP API, so we no longer have to deal with libmesos, but that's a large change. > Build failes for Mesos 0.28.x > - > > Key: SPARK-16450 > URL: https://issues.apache.org/jira/browse/SPARK-16450 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 2.0.0 > Environment: Mesos 0.28.0 >Reporter: Niels Becker > > Build fails: > [error] > /usr/local/spark/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala:82: > type mismatch; > [error] found : org.apache.mesos.protobuf.ByteString > [error] required: String > [error] credBuilder.setSecret(ByteString.copyFromUtf8(secret)) > Build cmd: > dev/make-distribution.sh --tgz -Phadoop-2.7 -Dhadoop.version=2.7.0 -Phive > -DskipTests -Dmesos.version=0.28.0 -Djava.version=1.8 > Spark Version: 2.0.0-rc2 > Java: OpenJDK Runtime Environment (build 1.8.0_91-8u91-b14-1~bpo8+1-b14 > Scala Version: 2.11.8 > Same error for mesos.version=0.28.2 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16482) If a table's schema is inferred at runtime, describe table command does not show the schema
[ https://issues.apache.org/jira/browse/SPARK-16482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15371554#comment-15371554 ] Xiao Li commented on SPARK-16482: - Do you want me to do it? > If a table's schema is inferred at runtime, describe table command does not > show the schema > --- > > Key: SPARK-16482 > URL: https://issues.apache.org/jira/browse/SPARK-16482 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Yin Huai >Priority: Critical > > If we create a table pointing to a parquet/json datasets without specifying > the schema, describe table command does not show the schema at all. It only > shows {{# Schema of this table is inferred at runtime}}. In 1.6, describe > table does show the schema of such a table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16485) Additional fixes to Mllib 2.0 documentation
[ https://issues.apache.org/jira/browse/SPARK-16485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Hunter updated SPARK-16485: --- Description: While reviewing the documentation of MLlib, I found some additional issues. Important issues that affect the binary signatures: - GBTClassificationModel: all the setters should be overriden - LogisticRegressionModel: setThreshold(s) - RandomForestClassificationModel: all the setters should be overriden - org.apache.spark.ml.stat.distribution.MultivariateGaussian is exposed but most of the methods are private[ml] -> do we need to expose this class for now? - GeneralizedLinearRegressionModel: linkObj, familyObj, familyAndLink should not be exposed - sqlDataTypes: name does not follow conventions. Do we need to expose it? Issues that involve only documentation: - Evaluator: 1. inconsistent doc between evaluate and isLargerBetter - MinMaxScaler: math rendering - GeneralizedLinearRegressionSummary: aic doc is incorrect The reference documentation that was used was: http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc2-docs/ was: While reviewing the documentation of MLlib, I found some additional issues. Important issues that affect the binary signatures: - GBTClassificationModel: all the setters should be overriden - LogisticRegressionModel: setThreshold(s) - RandomForestClassificationModel: all the setters should be overriden - org.apache.spark.ml.stat.distribution.MultivariateGaussian is exposed but most of the methods are private[ml] -> do we need to expose this class for now? - GeneralizedLinearRegressionModel: linkObj, familyObj, familyAndLink should not be exposed - sqlDataTypes: name does not follow conventions. Do we need to expose it? Issues that involve only documentation: - Evaluator: 1. inconsistent doc between evaluate and isLargerBetter 2. missing `def evaluate(dataset: Dataset[_]): Double` from the doc (the other method with the same name shows up). This may be a bug in scaladoc. - MinMaxScaler: math rendering - GeneralizedLinearRegressionSummary: aic doc is incorrect The reference documentation that was used was: http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc2-docs/ > Additional fixes to Mllib 2.0 documentation > --- > > Key: SPARK-16485 > URL: https://issues.apache.org/jira/browse/SPARK-16485 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib, SparkR >Reporter: Timothy Hunter > > While reviewing the documentation of MLlib, I found some additional issues. > Important issues that affect the binary signatures: > - GBTClassificationModel: all the setters should be overriden > - LogisticRegressionModel: setThreshold(s) > - RandomForestClassificationModel: all the setters should be overriden > - org.apache.spark.ml.stat.distribution.MultivariateGaussian is exposed but > most of the methods are private[ml] -> do we need to expose this class for > now? > - GeneralizedLinearRegressionModel: linkObj, familyObj, familyAndLink should > not be exposed > - sqlDataTypes: name does not follow conventions. Do we need to expose it? > Issues that involve only documentation: > - Evaluator: > 1. inconsistent doc between evaluate and isLargerBetter > - MinMaxScaler: math rendering > - GeneralizedLinearRegressionSummary: aic doc is incorrect > The reference documentation that was used was: > http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc2-docs/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16485) Additional fixes to Mllib 2.0 documentation
[ https://issues.apache.org/jira/browse/SPARK-16485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15371518#comment-15371518 ] Joseph K. Bradley edited comment on SPARK-16485 at 7/11/16 8:11 PM: sqlDataTypes: Yes, it is needed. See [SPARK-16074]. But the name could be corrected. was (Author: josephkb): sqlDataTypes: Yes, it is needed. See [SPARK-16074] > Additional fixes to Mllib 2.0 documentation > --- > > Key: SPARK-16485 > URL: https://issues.apache.org/jira/browse/SPARK-16485 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib, SparkR >Reporter: Timothy Hunter > > While reviewing the documentation of MLlib, I found some additional issues. > Important issues that affect the binary signatures: > - GBTClassificationModel: all the setters should be overriden > - LogisticRegressionModel: setThreshold(s) > - RandomForestClassificationModel: all the setters should be overriden > - org.apache.spark.ml.stat.distribution.MultivariateGaussian is exposed but > most of the methods are private[ml] -> do we need to expose this class for > now? > - GeneralizedLinearRegressionModel: linkObj, familyObj, familyAndLink should > not be exposed > - sqlDataTypes: name does not follow conventions. Do we need to expose it? > Issues that involve only documentation: > - Evaluator: > 1. inconsistent doc between evaluate and isLargerBetter > 2. missing `def evaluate(dataset: Dataset[_]): Double` from the doc (the > other method with the same name shows up). This may be a bug in scaladoc. > - MinMaxScaler: math rendering > - GeneralizedLinearRegressionSummary: aic doc is incorrect > The reference documentation that was used was: > http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc2-docs/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16485) Additional fixes to Mllib 2.0 documentation
[ https://issues.apache.org/jira/browse/SPARK-16485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15371518#comment-15371518 ] Joseph K. Bradley commented on SPARK-16485: --- sqlDataTypes: Yes, it is needed. See [SPARK-16074] > Additional fixes to Mllib 2.0 documentation > --- > > Key: SPARK-16485 > URL: https://issues.apache.org/jira/browse/SPARK-16485 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib, SparkR >Reporter: Timothy Hunter > > While reviewing the documentation of MLlib, I found some additional issues. > Important issues that affect the binary signatures: > - GBTClassificationModel: all the setters should be overriden > - LogisticRegressionModel: setThreshold(s) > - RandomForestClassificationModel: all the setters should be overriden > - org.apache.spark.ml.stat.distribution.MultivariateGaussian is exposed but > most of the methods are private[ml] -> do we need to expose this class for > now? > - GeneralizedLinearRegressionModel: linkObj, familyObj, familyAndLink should > not be exposed > - sqlDataTypes: name does not follow conventions. Do we need to expose it? > Issues that involve only documentation: > - Evaluator: > 1. inconsistent doc between evaluate and isLargerBetter > 2. missing `def evaluate(dataset: Dataset[_]): Double` from the doc (the > other method with the same name shows up). This may be a bug in scaladoc. > - MinMaxScaler: math rendering > - GeneralizedLinearRegressionSummary: aic doc is incorrect > The reference documentation that was used was: > http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc2-docs/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16485) Additional fixes to Mllib 2.0 documentation
[ https://issues.apache.org/jira/browse/SPARK-16485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15371517#comment-15371517 ] Joseph K. Bradley commented on SPARK-16485: --- MultivariateGaussian should stay IMO. It's small but useful. > Additional fixes to Mllib 2.0 documentation > --- > > Key: SPARK-16485 > URL: https://issues.apache.org/jira/browse/SPARK-16485 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib, SparkR >Reporter: Timothy Hunter > > While reviewing the documentation of MLlib, I found some additional issues. > Important issues that affect the binary signatures: > - GBTClassificationModel: all the setters should be overriden > - LogisticRegressionModel: setThreshold(s) > - RandomForestClassificationModel: all the setters should be overriden > - org.apache.spark.ml.stat.distribution.MultivariateGaussian is exposed but > most of the methods are private[ml] -> do we need to expose this class for > now? > - GeneralizedLinearRegressionModel: linkObj, familyObj, familyAndLink should > not be exposed > - sqlDataTypes: name does not follow conventions. Do we need to expose it? > Issues that involve only documentation: > - Evaluator: > 1. inconsistent doc between evaluate and isLargerBetter > 2. missing `def evaluate(dataset: Dataset[_]): Double` from the doc (the > other method with the same name shows up). This may be a bug in scaladoc. > - MinMaxScaler: math rendering > - GeneralizedLinearRegressionSummary: aic doc is incorrect > The reference documentation that was used was: > http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc2-docs/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16482) If a table's schema is inferred at runtime, describe table command does not show the schema
[ https://issues.apache.org/jira/browse/SPARK-16482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-16482: - Affects Version/s: 2.0.0 > If a table's schema is inferred at runtime, describe table command does not > show the schema > --- > > Key: SPARK-16482 > URL: https://issues.apache.org/jira/browse/SPARK-16482 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Yin Huai >Priority: Critical > > If we create a table pointing to a parquet/json datasets without specifying > the schema, describe table command does not show the schema at all. It only > shows {{# Schema of this table is inferred at runtime}}. In 1.6, describe > table does show the schema of such a table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15705) Spark won't read ORC schema from metastore for partitioned tables
[ https://issues.apache.org/jira/browse/SPARK-15705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15371513#comment-15371513 ] Nic Eggert commented on SPARK-15705: Double-checked just for fun. The problem still exists in RC2. > Spark won't read ORC schema from metastore for partitioned tables > - > > Key: SPARK-15705 > URL: https://issues.apache.org/jira/browse/SPARK-15705 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 > Environment: HDP 2.3.4 (Hive 1.2.1, Hadoop 2.7.1) >Reporter: Nic Eggert >Priority: Critical > > Spark does not seem to read the schema from the Hive metastore for > partitioned tables stored as ORC files. It appears to read the schema from > the files themselves, which, if they were created with Hive, does not match > the metastore schema (at least not before before Hive 2.0, see HIVE-4243). To > reproduce: > In Hive: > {code} > hive> create table default.test (id BIGINT, name STRING) partitioned by > (state STRING) stored as orc; > hive> insert into table default.test partition (state="CA") values (1, > "mike"), (2, "steve"), (3, "bill"); > {code} > In Spark > {code} > scala> spark.table("default.test").printSchema > {code} > Expected result: Spark should preserve the column names that were defined in > Hive. > Actual Result: > {code} > root > |-- _col0: long (nullable = true) > |-- _col1: string (nullable = true) > |-- state: string (nullable = true) > {code} > Possibly related to SPARK-14959? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org