[jira] [Commented] (SPARK-16745) Spark job completed however have to wait for 13 mins (data size is small)
[ https://issues.apache.org/jira/browse/SPARK-16745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15940654#comment-15940654 ] Don Drake commented on SPARK-16745: --- I just came across the same exception running Spark 2.1.0 on my Mac, running a few spark-shell commands on a tiny dataset that previously worked just fine. But, I never got a result, just the timeout exceptions. The issue is that today I'm running them on a corporate VPN, with proxy setting enabled, and the IP address the driver is using is my local (Wifi) address that the proxy server cannot connect to. This took a while to figure out, but I added {{--conf spark.driver.host=127.0.0.1}} to my command-line and that forced all local networking between driver and executors (to bypass the proxy server) and the query came back in the expected amount of time. > Spark job completed however have to wait for 13 mins (data size is small) > - > > Key: SPARK-16745 > URL: https://issues.apache.org/jira/browse/SPARK-16745 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 1.6.1 > Environment: Max OS X Yosemite, Terminal, MacBook Air Late 2014 >Reporter: Joe Chong >Priority: Minor > > I submitted a job in scala spark shell to show a DataFrame. The data size is > about 43K. The job was successful in the end, but took more than 13 minutes > to resolve. Upon checking the log, there's multiple exception raised on > "Failed to check existence of class" with a java.net.connectionexpcetion > message indicating timeout trying to connect to the port 52067, the repl port > that Spark setup. Please assist to troubleshoot. Thanks. > Started Spark in standalone mode > $ spark-shell --driver-memory 5g --master local[*] > 16/07/26 21:05:29 WARN util.NativeCodeLoader: Unable to load native-hadoop > library for your platform... using builtin-java classes where applicable > 16/07/26 21:05:30 INFO spark.SecurityManager: Changing view acls to: joechong > 16/07/26 21:05:30 INFO spark.SecurityManager: Changing modify acls to: > joechong > 16/07/26 21:05:30 INFO spark.SecurityManager: SecurityManager: authentication > disabled; ui acls disabled; users with view permissions: Set(joechong); users > with modify permissions: Set(joechong) > 16/07/26 21:05:30 INFO spark.HttpServer: Starting HTTP Server > 16/07/26 21:05:30 INFO server.Server: jetty-8.y.z-SNAPSHOT > 16/07/26 21:05:30 INFO server.AbstractConnector: Started > SocketConnector@0.0.0.0:52067 > 16/07/26 21:05:30 INFO util.Utils: Successfully started service 'HTTP class > server' on port 52067. > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 1.6.1 > /_/ > Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_66) > Type in expressions to have them evaluated. > Type :help for more information. > 16/07/26 21:05:34 INFO spark.SparkContext: Running Spark version 1.6.1 > 16/07/26 21:05:34 INFO spark.SecurityManager: Changing view acls to: joechong > 16/07/26 21:05:34 INFO spark.SecurityManager: Changing modify acls to: > joechong > 16/07/26 21:05:34 INFO spark.SecurityManager: SecurityManager: authentication > disabled; ui acls disabled; users with view permissions: Set(joechong); users > with modify permissions: Set(joechong) > 16/07/26 21:05:35 INFO util.Utils: Successfully started service 'sparkDriver' > on port 52072. > 16/07/26 21:05:35 INFO slf4j.Slf4jLogger: Slf4jLogger started > 16/07/26 21:05:35 INFO Remoting: Starting remoting > 16/07/26 21:05:35 INFO Remoting: Remoting started; listening on addresses > :[akka.tcp://sparkDriverActorSystem@10.199.29.218:52074] > 16/07/26 21:05:35 INFO util.Utils: Successfully started service > 'sparkDriverActorSystem' on port 52074. > 16/07/26 21:05:35 INFO spark.SparkEnv: Registering MapOutputTracker > 16/07/26 21:05:35 INFO spark.SparkEnv: Registering BlockManagerMaster > 16/07/26 21:05:35 INFO storage.DiskBlockManager: Created local directory at > /private/var/folders/r7/bs2f87nj6lnd5vm51lvxcw68gn/T/blockmgr-cd542a27-6ff1-4f51-a72b-78654142fdb6 > 16/07/26 21:05:35 INFO storage.MemoryStore: MemoryStore started with capacity > 3.4 GB > 16/07/26 21:05:35 INFO spark.SparkEnv: Registering OutputCommitCoordinator > 16/07/26 21:05:36 INFO server.Server: jetty-8.y.z-SNAPSHOT > 16/07/26 21:05:36 INFO server.AbstractConnector: Started > SelectChannelConnector@0.0.0.0:4040 > 16/07/26 21:05:36 INFO util.Utils: Successfully started service 'SparkUI' on > port 4040. > 16/07/26 21:05:36 INFO ui.SparkUI: Started SparkUI at > http://10.199.29.218:4040 > 16/07/26 21:05:36 INFO executor.Executor: Starting executor ID driver on host > localhost > 16/07/26 21:05:36 INFO
[jira] [Commented] (SPARK-19477) [SQL] Datasets created from a Dataframe with extra columns retain the extra columns
[ https://issues.apache.org/jira/browse/SPARK-19477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15862560#comment-15862560 ] Don Drake commented on SPARK-19477: --- How does lazy apply here? If I read/create a dataframe with extra columns, then do ds = df.as[XYZ], then immediately ds.write.parquet("file"), the write trigger should enable any lazy functionality if I understand this. Do you have a suggested workaround? I'm currently retrieving the encoder for the case class to get the schema, then calling ds.select() on the columns from the schema. > [SQL] Datasets created from a Dataframe with extra columns retain the extra > columns > --- > > Key: SPARK-19477 > URL: https://issues.apache.org/jira/browse/SPARK-19477 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Don Drake > > In 1.6, when you created a Dataset from a Dataframe that had extra columns, > the columns not in the case class were dropped from the Dataset. > For example in 1.6, the column c4 is gone: > {code} > scala> case class F(f1: String, f2: String, f3:String) > defined class F > scala> import sqlContext.implicits._ > import sqlContext.implicits._ > scala> val df = Seq(("a","b","c","x"), ("d", "e", "f","y"), ("h", "i", > "j","z")).toDF("f1", "f2", "f3", "c4") > df: org.apache.spark.sql.DataFrame = [f1: string, f2: string, f3: string, c4: > string] > scala> val ds = df.as[F] > ds: org.apache.spark.sql.Dataset[F] = [f1: string, f2: string, f3: string] > scala> ds.show > +---+---+---+ > | f1| f2| f3| > +---+---+---+ > | a| b| c| > | d| e| f| > | h| i| j| > {code} > This seems to have changed in Spark 2.0 and also 2.1: > Spark 2.1.0: > {code} > scala> case class F(f1: String, f2: String, f3:String) > defined class F > scala> import spark.implicits._ > import spark.implicits._ > scala> val df = Seq(("a","b","c","x"), ("d", "e", "f","y"), ("h", "i", > "j","z")).toDF("f1", "f2", "f3", "c4") > df: org.apache.spark.sql.DataFrame = [f1: string, f2: string ... 2 more > fields] > scala> val ds = df.as[F] > ds: org.apache.spark.sql.Dataset[F] = [f1: string, f2: string ... 2 more > fields] > scala> ds.show > +---+---+---+---+ > | f1| f2| f3| c4| > +---+---+---+---+ > | a| b| c| x| > | d| e| f| y| > | h| i| j| z| > +---+---+---+---+ > scala> import org.apache.spark.sql.Encoders > import org.apache.spark.sql.Encoders > scala> val fEncoder = Encoders.product[F] > fEncoder: org.apache.spark.sql.Encoder[F] = class[f1[0]: string, f2[0]: > string, f3[0]: string] > scala> fEncoder.schema == ds.schema > res2: Boolean = false > scala> ds.schema > res3: org.apache.spark.sql.types.StructType = > StructType(StructField(f1,StringType,true), StructField(f2,StringType,true), > StructField(f3,StringType,true), StructField(c4,StringType,true)) > scala> fEncoder.schema > res4: org.apache.spark.sql.types.StructType = > StructType(StructField(f1,StringType,true), StructField(f2,StringType,true), > StructField(f3,StringType,true)) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-19477) [SQL] Datasets created from a Dataframe with extra columns retain the extra columns
[ https://issues.apache.org/jira/browse/SPARK-19477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Don Drake reopened SPARK-19477: --- I'm struggling with this answer. I thought the point of Datasets, was to have a strongly typed definition, rather than the more loosely defined Dataframe. Why does it matter if I use relational or typed methods to access it? It works if I call a map() against it: {code} scala> ds.map(x => x).take(1) res7: Array[F] = Array(F(a,b,c)) {code} But the real problem I'm having is that when I attempt to save the Dataset, the schema is ignored: {code} scala> ds.write.parquet("a") SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. scala> val ds2 = spark.read.parquet("a").as[F] ds2: org.apache.spark.sql.Dataset[F] = [f1: string, f2: string ... 2 more fields] scala> ds2.printSchema root |-- f1: string (nullable = true) |-- f2: string (nullable = true) |-- f3: string (nullable = true) |-- c4: string (nullable = true) {code} IMHO, the c4 column should not have been saved. > [SQL] Datasets created from a Dataframe with extra columns retain the extra > columns > --- > > Key: SPARK-19477 > URL: https://issues.apache.org/jira/browse/SPARK-19477 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Don Drake > > In 1.6, when you created a Dataset from a Dataframe that had extra columns, > the columns not in the case class were dropped from the Dataset. > For example in 1.6, the column c4 is gone: > {code} > scala> case class F(f1: String, f2: String, f3:String) > defined class F > scala> import sqlContext.implicits._ > import sqlContext.implicits._ > scala> val df = Seq(("a","b","c","x"), ("d", "e", "f","y"), ("h", "i", > "j","z")).toDF("f1", "f2", "f3", "c4") > df: org.apache.spark.sql.DataFrame = [f1: string, f2: string, f3: string, c4: > string] > scala> val ds = df.as[F] > ds: org.apache.spark.sql.Dataset[F] = [f1: string, f2: string, f3: string] > scala> ds.show > +---+---+---+ > | f1| f2| f3| > +---+---+---+ > | a| b| c| > | d| e| f| > | h| i| j| > {code} > This seems to have changed in Spark 2.0 and also 2.1: > Spark 2.1.0: > {code} > scala> case class F(f1: String, f2: String, f3:String) > defined class F > scala> import spark.implicits._ > import spark.implicits._ > scala> val df = Seq(("a","b","c","x"), ("d", "e", "f","y"), ("h", "i", > "j","z")).toDF("f1", "f2", "f3", "c4") > df: org.apache.spark.sql.DataFrame = [f1: string, f2: string ... 2 more > fields] > scala> val ds = df.as[F] > ds: org.apache.spark.sql.Dataset[F] = [f1: string, f2: string ... 2 more > fields] > scala> ds.show > +---+---+---+---+ > | f1| f2| f3| c4| > +---+---+---+---+ > | a| b| c| x| > | d| e| f| y| > | h| i| j| z| > +---+---+---+---+ > scala> import org.apache.spark.sql.Encoders > import org.apache.spark.sql.Encoders > scala> val fEncoder = Encoders.product[F] > fEncoder: org.apache.spark.sql.Encoder[F] = class[f1[0]: string, f2[0]: > string, f3[0]: string] > scala> fEncoder.schema == ds.schema > res2: Boolean = false > scala> ds.schema > res3: org.apache.spark.sql.types.StructType = > StructType(StructField(f1,StringType,true), StructField(f2,StringType,true), > StructField(f3,StringType,true), StructField(c4,StringType,true)) > scala> fEncoder.schema > res4: org.apache.spark.sql.types.StructType = > StructType(StructField(f1,StringType,true), StructField(f2,StringType,true), > StructField(f3,StringType,true)) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19477) [SQL] Datasets created from a Dataframe with extra columns retain the extra columns
Don Drake created SPARK-19477: - Summary: [SQL] Datasets created from a Dataframe with extra columns retain the extra columns Key: SPARK-19477 URL: https://issues.apache.org/jira/browse/SPARK-19477 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0 Reporter: Don Drake In 1.6, when you created a Dataset from a Dataframe that had extra columns, the columns not in the case class were dropped from the Dataset. For example in 1.6, the column c4 is gone: {code} scala> case class F(f1: String, f2: String, f3:String) defined class F scala> import sqlContext.implicits._ import sqlContext.implicits._ scala> val df = Seq(("a","b","c","x"), ("d", "e", "f","y"), ("h", "i", "j","z")).toDF("f1", "f2", "f3", "c4") df: org.apache.spark.sql.DataFrame = [f1: string, f2: string, f3: string, c4: string] scala> val ds = df.as[F] ds: org.apache.spark.sql.Dataset[F] = [f1: string, f2: string, f3: string] scala> ds.show +---+---+---+ | f1| f2| f3| +---+---+---+ | a| b| c| | d| e| f| | h| i| j| {code} This seems to have changed in Spark 2.0 and also 2.1: Spark 2.1.0: {code} scala> case class F(f1: String, f2: String, f3:String) defined class F scala> import spark.implicits._ import spark.implicits._ scala> val df = Seq(("a","b","c","x"), ("d", "e", "f","y"), ("h", "i", "j","z")).toDF("f1", "f2", "f3", "c4") df: org.apache.spark.sql.DataFrame = [f1: string, f2: string ... 2 more fields] scala> val ds = df.as[F] ds: org.apache.spark.sql.Dataset[F] = [f1: string, f2: string ... 2 more fields] scala> ds.show +---+---+---+---+ | f1| f2| f3| c4| +---+---+---+---+ | a| b| c| x| | d| e| f| y| | h| i| j| z| +---+---+---+---+ scala> import org.apache.spark.sql.Encoders import org.apache.spark.sql.Encoders scala> val fEncoder = Encoders.product[F] fEncoder: org.apache.spark.sql.Encoder[F] = class[f1[0]: string, f2[0]: string, f3[0]: string] scala> fEncoder.schema == ds.schema res2: Boolean = false scala> ds.schema res3: org.apache.spark.sql.types.StructType = StructType(StructField(f1,StringType,true), StructField(f2,StringType,true), StructField(f3,StringType,true), StructField(c4,StringType,true)) scala> fEncoder.schema res4: org.apache.spark.sql.types.StructType = StructType(StructField(f1,StringType,true), StructField(f2,StringType,true), StructField(f3,StringType,true)) {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18207) class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" grows beyond 64 KB
[ https://issues.apache.org/jira/browse/SPARK-18207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15659978#comment-15659978 ] Don Drake commented on SPARK-18207: --- Hi, I was able to download a nightly SNAPSHOT release and verify that this resolves the issue for my project. Thanks to everyone who contributed to this fix and getting it merged in a timely manner. > class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" > grows beyond 64 KB > > > Key: SPARK-18207 > URL: https://issues.apache.org/jira/browse/SPARK-18207 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Affects Versions: 2.0.1, 2.1.0 >Reporter: Don Drake >Assignee: Kazuaki Ishizaki > Fix For: 2.1.0 > > Attachments: spark-18207.txt > > > I have 2 wide dataframes that contain nested data structures, when I explode > one of the dataframes, it doesn't include records with an empty nested > structure (outer explode not supported). So, I create a similar dataframe > with null values and union them together. See SPARK-13721 for more details > as to why I have to do this. > I was hoping that SPARK-16845 was going to address my issue, but it does not. > I was asked by [~lwlin] to open this JIRA. > I will attach a code snippet that can be pasted into spark-shell that > duplicates my code and the exception. This worked just fine in Spark 1.6.x. > {code} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 35 in > stage 5.0 failed 4 times, most recent failure: Lost task 35.3 in stage 5.0 > (TID 812, somehost.mydomain.com, executor 8): > java.util.concurrent.ExecutionException: java.lang.Exception: failed to > compile: org.codehaus.janino.JaninoRuntimeException: Code of method > "apply(Lorg/apache/spark/sql/catalyst/InternalRow;)Lorg/apache/spark/sql/catalyst/expressions/UnsafeRow;" > of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" > grows beyond 64 KB > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18207) class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" grows beyond 64 KB
[ https://issues.apache.org/jira/browse/SPARK-18207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15629177#comment-15629177 ] Don Drake commented on SPARK-18207: --- The difference with my case versus the other test cases is that my scenario involves a wide dataframe (800+ columns) that also have multiple nested structures (arrays of classes) involved in a SQL query (union). I have verified that [~lwlin]'s fix does not work for my case, but it does work for wide dataframes without nested structures. I agree it's similar to the others, but more complicated to reproduce. > class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" > grows beyond 64 KB > > > Key: SPARK-18207 > URL: https://issues.apache.org/jira/browse/SPARK-18207 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Affects Versions: 2.0.1, 2.1.0 >Reporter: Don Drake > Attachments: spark-18207.txt > > > I have 2 wide dataframes that contain nested data structures, when I explode > one of the dataframes, it doesn't include records with an empty nested > structure (outer explode not supported). So, I create a similar dataframe > with null values and union them together. See SPARK-13721 for more details > as to why I have to do this. > I was hoping that SPARK-16845 was going to address my issue, but it does not. > I was asked by [~lwlin] to open this JIRA. > I will attach a code snippet that can be pasted into spark-shell that > duplicates my code and the exception. This worked just fine in Spark 1.6.x. > {code} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 35 in > stage 5.0 failed 4 times, most recent failure: Lost task 35.3 in stage 5.0 > (TID 812, somehost.mydomain.com, executor 8): > java.util.concurrent.ExecutionException: java.lang.Exception: failed to > compile: org.codehaus.janino.JaninoRuntimeException: Code of method > "apply(Lorg/apache/spark/sql/catalyst/InternalRow;)Lorg/apache/spark/sql/catalyst/expressions/UnsafeRow;" > of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" > grows beyond 64 KB > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18207) class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" grows beyond 64 KB
[ https://issues.apache.org/jira/browse/SPARK-18207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15629140#comment-15629140 ] Don Drake commented on SPARK-18207: --- I opened it based on [~lwlin]'s suggestion in the comments of SPARK-16845. > class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" > grows beyond 64 KB > > > Key: SPARK-18207 > URL: https://issues.apache.org/jira/browse/SPARK-18207 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Affects Versions: 2.0.1, 2.1.0 >Reporter: Don Drake > Attachments: spark-18207.txt > > > I have 2 wide dataframes that contain nested data structures, when I explode > one of the dataframes, it doesn't include records with an empty nested > structure (outer explode not supported). So, I create a similar dataframe > with null values and union them together. See SPARK-13721 for more details > as to why I have to do this. > I was hoping that SPARK-16845 was going to address my issue, but it does not. > I was asked by [~lwlin] to open this JIRA. > I will attach a code snippet that can be pasted into spark-shell that > duplicates my code and the exception. This worked just fine in Spark 1.6.x. > {code} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 35 in > stage 5.0 failed 4 times, most recent failure: Lost task 35.3 in stage 5.0 > (TID 812, somehost.mydomain.com, executor 8): > java.util.concurrent.ExecutionException: java.lang.Exception: failed to > compile: org.codehaus.janino.JaninoRuntimeException: Code of method > "apply(Lorg/apache/spark/sql/catalyst/InternalRow;)Lorg/apache/spark/sql/catalyst/expressions/UnsafeRow;" > of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" > grows beyond 64 KB > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16845) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB
[ https://issues.apache.org/jira/browse/SPARK-16845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15626664#comment-15626664 ] Don Drake edited comment on SPARK-16845 at 11/2/16 12:32 AM: - I've been struggling to duplicate this and finally came up with a strategy that duplicates it in a spark-shell. It's a combination of a wide dataset with nested (array) structures and performing a union that seem to trigger it. I opened SPARK-18207. was (Author: dondrake): I've been struggling to duplicate this and finally came up with a strategy that duplicates it in a spark-shell. It's a combination of a wide dataset with nested (array) structures and performing a union that seem to trigger it. I'll open a new JIRA. > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" > grows beyond 64 KB > - > > Key: SPARK-16845 > URL: https://issues.apache.org/jira/browse/SPARK-16845 > Project: Spark > Issue Type: Bug > Components: Java API, ML, MLlib >Affects Versions: 2.0.0 >Reporter: hejie > Attachments: error.txt.zip > > > I have a wide table(400 columns), when I try fitting the traindata on all > columns, the fatal error occurs. > ... 46 more > Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method > "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I" > of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" > grows beyond 64 KB > at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941) > at org.codehaus.janino.CodeContext.write(CodeContext.java:854) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18207) class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" grows beyond 64 KB
[ https://issues.apache.org/jira/browse/SPARK-18207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Don Drake updated SPARK-18207: -- Description: I have 2 wide dataframes that contain nested data structures, when I explode one of the dataframes, it doesn't include records with an empty nested structure (outer explode not supported). So, I create a similar dataframe with null values and union them together. See SPARK-13721 for more details as to why I have to do this. I was hoping that SPARK-16845 was going to address my issue, but it does not. I was asked by [~lwlin] to open this JIRA. I will attach a code snippet that can be pasted into spark-shell that duplicates my code and the exception. This worked just fine in Spark 1.6.x. {code} org.apache.spark.SparkException: Job aborted due to stage failure: Task 35 in stage 5.0 failed 4 times, most recent failure: Lost task 35.3 in stage 5.0 (TID 812, somehost.mydomain.com, executor 8): java.util.concurrent.ExecutionException: java.lang.Exception: failed to compile: org.codehaus.janino.JaninoRuntimeException: Code of method "apply(Lorg/apache/spark/sql/catalyst/InternalRow;)Lorg/apache/spark/sql/catalyst/expressions/UnsafeRow;" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" grows beyond 64 KB {code} was: I have 2 wide dataframes that contain nested data structures, when I explode one of the dataframes, it doesn't include records with an empty nested structure (outer explode not supported). So, I create a similar dataframe with null values and union them together. See SPARK-13721 for more details as to why I have to do this. I was hoping that SPARK-16845 was going to address my issue, but it does not. I was asked by [~lwlin] to open this JIRA. I will attach a code snippet that can be pasted into spark-shell that duplicates my code and the exception. This worked just fine in Spark 1.6.x. > class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" > grows beyond 64 KB > > > Key: SPARK-18207 > URL: https://issues.apache.org/jira/browse/SPARK-18207 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Affects Versions: 2.0.1, 2.1.0 >Reporter: Don Drake > Attachments: spark-18207.txt > > > I have 2 wide dataframes that contain nested data structures, when I explode > one of the dataframes, it doesn't include records with an empty nested > structure (outer explode not supported). So, I create a similar dataframe > with null values and union them together. See SPARK-13721 for more details > as to why I have to do this. > I was hoping that SPARK-16845 was going to address my issue, but it does not. > I was asked by [~lwlin] to open this JIRA. > I will attach a code snippet that can be pasted into spark-shell that > duplicates my code and the exception. This worked just fine in Spark 1.6.x. > {code} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 35 in > stage 5.0 failed 4 times, most recent failure: Lost task 35.3 in stage 5.0 > (TID 812, somehost.mydomain.com, executor 8): > java.util.concurrent.ExecutionException: java.lang.Exception: failed to > compile: org.codehaus.janino.JaninoRuntimeException: Code of method > "apply(Lorg/apache/spark/sql/catalyst/InternalRow;)Lorg/apache/spark/sql/catalyst/expressions/UnsafeRow;" > of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" > grows beyond 64 KB > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18207) class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" grows beyond 64 KB
[ https://issues.apache.org/jira/browse/SPARK-18207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Don Drake updated SPARK-18207: -- Attachment: spark-18207.txt Please read the comments at the top of the attachment, you need to :paste portions of this into spark-shell. > class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" > grows beyond 64 KB > > > Key: SPARK-18207 > URL: https://issues.apache.org/jira/browse/SPARK-18207 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Affects Versions: 2.0.1, 2.1.0 >Reporter: Don Drake > Attachments: spark-18207.txt > > > I have 2 wide dataframes that contain nested data structures, when I explode > one of the dataframes, it doesn't include records with an empty nested > structure (outer explode not supported). So, I create a similar dataframe > with null values and union them together. See SPARK-13721 for more details > as to why I have to do this. > I was hoping that SPARK-16845 was going to address my issue, but it does not. > I was asked by [~lwlin] to open this JIRA. > I will attach a code snippet that can be pasted into spark-shell that > duplicates my code and the exception. This worked just fine in Spark 1.6.x. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18207) class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" grows beyond 64 KB
Don Drake created SPARK-18207: - Summary: class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" grows beyond 64 KB Key: SPARK-18207 URL: https://issues.apache.org/jira/browse/SPARK-18207 Project: Spark Issue Type: Bug Components: Optimizer, SQL Affects Versions: 2.0.1, 2.1.0 Reporter: Don Drake I have 2 wide dataframes that contain nested data structures, when I explode one of the dataframes, it doesn't include records with an empty nested structure (outer explode not supported). So, I create a similar dataframe with null values and union them together. See SPARK-13721 for more details as to why I have to do this. I was hoping that SPARK-16845 was going to address my issue, but it does not. I was asked by [~lwlin] to open this JIRA. I will attach a code snippet that can be pasted into spark-shell that duplicates my code and the exception. This worked just fine in Spark 1.6.x. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16845) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB
[ https://issues.apache.org/jira/browse/SPARK-16845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15626664#comment-15626664 ] Don Drake commented on SPARK-16845: --- I've been struggling to duplicate this and finally came up with a strategy that duplicates it in a spark-shell. It's a combination of a wide dataset with nested (array) structures and performing a union that seem to trigger it. I'll open a new JIRA. > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" > grows beyond 64 KB > - > > Key: SPARK-16845 > URL: https://issues.apache.org/jira/browse/SPARK-16845 > Project: Spark > Issue Type: Bug > Components: Java API, ML, MLlib >Affects Versions: 2.0.0 >Reporter: hejie > Attachments: error.txt.zip > > > I have a wide table(400 columns), when I try fitting the traindata on all > columns, the fatal error occurs. > ... 46 more > Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method > "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I" > of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" > grows beyond 64 KB > at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941) > at org.codehaus.janino.CodeContext.write(CodeContext.java:854) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16845) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB
[ https://issues.apache.org/jira/browse/SPARK-16845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Don Drake updated SPARK-16845: -- Attachment: error.txt.zip Does this generated code help in resolving this? > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" > grows beyond 64 KB > - > > Key: SPARK-16845 > URL: https://issues.apache.org/jira/browse/SPARK-16845 > Project: Spark > Issue Type: Bug > Components: Java API, ML, MLlib >Affects Versions: 2.0.0 >Reporter: hejie > Attachments: error.txt.zip > > > I have a wide table(400 columns), when I try fitting the traindata on all > columns, the fatal error occurs. > ... 46 more > Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method > "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I" > of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" > grows beyond 64 KB > at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941) > at org.codehaus.janino.CodeContext.write(CodeContext.java:854) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16845) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB
[ https://issues.apache.org/jira/browse/SPARK-16845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15614100#comment-15614100 ] Don Drake commented on SPARK-16845: --- I'm struggling to get a simple case created. I'm curious though, if I compile my .jar file using sbt with Spark 2.0.1 but use your compiled branch of Spark 2.1.0-SNAPSHOT as a run-time (spark-submit), would you expect it to work? When using your compile branch of Spark 2.1.0-SNAPSHOT and execute a spark-shell the test cases provided in this JIRA pass. But my code fails. Also, the error message says "grows beyond 64k" as the compiler error but the output generates over 400k of source code. I'll try to attach the exact error message java code. > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" > grows beyond 64 KB > - > > Key: SPARK-16845 > URL: https://issues.apache.org/jira/browse/SPARK-16845 > Project: Spark > Issue Type: Bug > Components: Java API, ML, MLlib >Affects Versions: 2.0.0 >Reporter: hejie > > I have a wide table(400 columns), when I try fitting the traindata on all > columns, the fatal error occurs. > ... 46 more > Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method > "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I" > of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" > grows beyond 64 KB > at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941) > at org.codehaus.janino.CodeContext.write(CodeContext.java:854) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16845) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB
[ https://issues.apache.org/jira/browse/SPARK-16845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15598442#comment-15598442 ] Don Drake commented on SPARK-16845: --- Update: It turns out that I am still getting this exception. I'll try to create a test case to duplicate it. Basically, I'm exploding a nested datastructure, then doing a union and then saving to Parquet. The resulting table has over 400 columns. I verified in spark-shell the exceptions do not occur with the test cases provided. Can you point me to your other solution? I can see if that works. > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" > grows beyond 64 KB > - > > Key: SPARK-16845 > URL: https://issues.apache.org/jira/browse/SPARK-16845 > Project: Spark > Issue Type: Bug > Components: Java API, ML, MLlib >Affects Versions: 2.0.0 >Reporter: hejie > > I have a wide table(400 columns), when I try fitting the traindata on all > columns, the fatal error occurs. > ... 46 more > Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method > "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I" > of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" > grows beyond 64 KB > at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941) > at org.codehaus.janino.CodeContext.write(CodeContext.java:854) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16845) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB
[ https://issues.apache.org/jira/browse/SPARK-16845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15592862#comment-15592862 ] Don Drake commented on SPARK-16845: --- I compiled your branch and ran my large job and it finished successfully. Sorry for the confusion, I wasn't watching the PR, just this JIRA and wasn't aware of the changes you were making. Can this get merged as well as backported to 2.0.x? Thanks so much. -Don > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" > grows beyond 64 KB > - > > Key: SPARK-16845 > URL: https://issues.apache.org/jira/browse/SPARK-16845 > Project: Spark > Issue Type: Bug > Components: Java API, ML, MLlib >Affects Versions: 2.0.0 >Reporter: hejie > > I have a wide table(400 columns), when I try fitting the traindata on all > columns, the fatal error occurs. > ... 46 more > Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method > "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I" > of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" > grows beyond 64 KB > at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941) > at org.codehaus.janino.CodeContext.write(CodeContext.java:854) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16845) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB
[ https://issues.apache.org/jira/browse/SPARK-16845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15590256#comment-15590256 ] Don Drake commented on SPARK-16845: --- [~lwlin] I saw your PR, but noticed it's failing some tests. Just curious if you will have some time to resolve this. If I can help, please let me know. -Don > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" > grows beyond 64 KB > - > > Key: SPARK-16845 > URL: https://issues.apache.org/jira/browse/SPARK-16845 > Project: Spark > Issue Type: Bug > Components: Java API, ML, MLlib >Affects Versions: 2.0.0 >Reporter: hejie > > I have a wide table(400 columns), when I try fitting the traindata on all > columns, the fatal error occurs. > ... 46 more > Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method > "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I" > of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" > grows beyond 64 KB > at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941) > at org.codehaus.janino.CodeContext.write(CodeContext.java:854) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16845) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB
[ https://issues.apache.org/jira/browse/SPARK-16845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15567358#comment-15567358 ] Don Drake commented on SPARK-16845: --- I can't at the moment, mine is not simple. But this JIRA has one: https://issues.apache.org/jira/browse/SPARK-17092 > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" > grows beyond 64 KB > - > > Key: SPARK-16845 > URL: https://issues.apache.org/jira/browse/SPARK-16845 > Project: Spark > Issue Type: Bug > Components: Java API, ML, MLlib >Affects Versions: 2.0.0 >Reporter: hejie > > I have a wide table(400 columns), when I try fitting the traindata on all > columns, the fatal error occurs. > ... 46 more > Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method > "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I" > of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" > grows beyond 64 KB > at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941) > at org.codehaus.janino.CodeContext.write(CodeContext.java:854) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16845) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB
[ https://issues.apache.org/jira/browse/SPARK-16845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15565592#comment-15565592 ] Don Drake commented on SPARK-16845: --- Unfortunately, it does not work around it. 16/10/10 18:19:47 ERROR CodeGenerator: failed to compile: org.codehaus.janino.JaninoRuntimeException: Code of method "(Lorg/apache/spark/sql/catalyst/InternalRow;)Lorg/apache/spark/sql/catalyst/expressions/UnsafeRow;" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" grows beyond 64 KB /* 001 */ public java.lang.Object generate(Object[] references) { /* 002 */ return new SpecificUnsafeProjection(references); /* 003 */ } > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" > grows beyond 64 KB > - > > Key: SPARK-16845 > URL: https://issues.apache.org/jira/browse/SPARK-16845 > Project: Spark > Issue Type: Bug > Components: Java API, ML, MLlib >Affects Versions: 2.0.0 >Reporter: hejie > > I have a wide table(400 columns), when I try fitting the traindata on all > columns, the fatal error occurs. > ... 46 more > Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method > "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I" > of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" > grows beyond 64 KB > at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941) > at org.codehaus.janino.CodeContext.write(CodeContext.java:854) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16845) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB
[ https://issues.apache.org/jira/browse/SPARK-16845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15543555#comment-15543555 ] Don Drake commented on SPARK-16845: --- I just hit this bug as well. Are there any suggested workarounds? > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" > grows beyond 64 KB > - > > Key: SPARK-16845 > URL: https://issues.apache.org/jira/browse/SPARK-16845 > Project: Spark > Issue Type: Bug > Components: Java API, ML, MLlib >Affects Versions: 2.0.0 >Reporter: hejie > > I have a wide table(400 columns), when I try fitting the traindata on all > columns, the fatal error occurs. > ... 46 more > Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method > "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I" > of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" > grows beyond 64 KB > at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941) > at org.codehaus.janino.CodeContext.write(CodeContext.java:854) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17384) SQL - Running query with outer join from 1.6 fails
Don Drake created SPARK-17384: - Summary: SQL - Running query with outer join from 1.6 fails Key: SPARK-17384 URL: https://issues.apache.org/jira/browse/SPARK-17384 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Don Drake I have some complex (10-table joins) SQL queries that utilize outer joins that work fine in Spark 1.6.2, but fail under Spark 2.0. I was able to duplicate the problem using a simple test case. Here's the code for Spark 2.0 that doesn't run (this runs fine in Spark 1.6.2): {code} case class C1(f1: String, f2: String, f3: String, f4: String) case class C2(g1: String, g2: String, g3: String, g4: String) case class C3(h1: String, h2: String, h3: String, h4: String) val sqlContext = spark.sqlContext val c1 = sc.parallelize(Seq( C1("h1", "c1a1", "c1b1", "c1c1"), C1("h2", "c1a2", "c1b2", "c1c2"), C1(null, "c1a3", "c1b3", "c1c3") )).toDF c1.createOrReplaceTempView("c1") val c2 = sc.parallelize(Seq( C2("h1", "c2a1", "c2b1", "c2c1"), C2("h2", "c2a2", "c2b2", "c2c2"), C2(null, "c2a3", "c2b3", "c2c3"), C2(null, "c2a4", "c2b4", "c2c4"), C2("h333", "c2a333", "c2b333", "c2c333") )).toDF c2.createOrReplaceTempView("c2") val c3 = sc.parallelize(Seq( C3("h1", "c3a1", "c3b1", "c3c1"), C3("h2", "c3a2", "c3b2", "c3c2"), C3(null, "c3a3", "c3b3", "c3c3") )).toDF c3.createOrReplaceTempView("c3") // doesn't work in Spark 2.0, works in Spark 1.6 val bad_df = sqlContext.sql(""" select * from c1, c3 left outer join c2 on (c1.f1 = c2.g1) where c1.f1 = c3.h1 """).show() // works in both val works_df = sqlContext.sql(""" select * from c1 left outer join c2 on (c1.f1 = c2.g1), c3 where c1.f1 = c3.h1 """).show() {code} Here's the output after running bad_df in Spark 2.0: {code} scala> val bad_df = sqlContext.sql(""" | select * | from c1, c3 | left outer join c2 on (c1.f1 = c2.g1) | where c1.f1 = c3.h1 | """).show() org.apache.spark.sql.AnalysisException: cannot resolve '`c1.f1`' given input columns: [h3, g3, h4, g2, g4, h2, h1, g1]; line 4 pos 25 at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:298) at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:190) at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:201) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$5.apply(QueryPlan.scala:209) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:209) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:74) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:125) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:125) at scala.collection.immutable.List.foreach(List.scala:381) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:125) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:125) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:125) at scala.collection.immutable.List.foreach(List.scala:381) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:125) at
[jira] [Commented] (SPARK-13721) Add support for LATERAL VIEW OUTER explode()
[ https://issues.apache.org/jira/browse/SPARK-13721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15459490#comment-15459490 ] Don Drake commented on SPARK-13721: --- My nested structures aren't simple types, they are structs (case classes) and so this existing method works great for me. This ticket it about modifying the explode() call to support outer, not adding outer to the data frame api. > Add support for LATERAL VIEW OUTER explode() > > > Key: SPARK-13721 > URL: https://issues.apache.org/jira/browse/SPARK-13721 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Ian Hellstrom > > Hive supports the [LATERAL VIEW > OUTER|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LateralView#LanguageManualLateralView-OuterLateralViews] > syntax to make sure that when an array is empty, the content from the outer > table is still returned. > Within Spark, this is currently only possible within the HiveContext and > executing HiveQL statements. It would be nice if the standard explode() > DataFrame method allows the same. A possible signature would be: > {code:scala} > explode[A, B](inputColumn: String, outputColumn: String, outer: Boolean = > false) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17341) Can't read Parquet data with fields containing periods "."
[ https://issues.apache.org/jira/browse/SPARK-17341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15453806#comment-15453806 ] Don Drake commented on SPARK-17341: --- I just downloaded the nightly build from 8/31/2016 and gave it a try. And it worked: {code} scala> inSquare.take(2) res2: Array[org.apache.spark.sql.Row] = Array([1,1], [2,4]) scala> inSquare.show(false) +-+-+ |value|squared.value| +-+-+ |1|1| |2|4| |3|9| |4|16 | |5|25 | +-+-+ {code} Thanks. > Can't read Parquet data with fields containing periods "." > -- > > Key: SPARK-17341 > URL: https://issues.apache.org/jira/browse/SPARK-17341 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Don Drake > > I am porting a set of Spark 1.6.2 applications to Spark 2.0 and I have > encountered a showstopper problem with Parquet dataset that have fields > containing a "." in a field name. This data comes from an external provider > (CSV) and we just pass through the field names. This has worked flawlessly > in Spark 1.5 and 1.6, but now spark can't seem to read these parquet files. > {code} > Spark context available as 'sc' (master = local[*], app id = > local-1472664486578). > Spark session available as 'spark'. > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 2.0.0 > /_/ > Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_51) > Type in expressions to have them evaluated. > Type :help for more information. > scala> val squaresDF = spark.sparkContext.makeRDD(1 to 5).map(i => (i, i * > i)).toDF("value", "squared.value") > 16/08/31 12:28:44 WARN ObjectStore: Version information not found in > metastore. hive.metastore.schema.verification is not enabled so recording the > schema version 1.2.0 > 16/08/31 12:28:44 WARN ObjectStore: Failed to get database default, returning > NoSuchObjectException > squaresDF: org.apache.spark.sql.DataFrame = [value: int, squared.value: int] > scala> squaresDF.take(2) > res0: Array[org.apache.spark.sql.Row] = Array([1,1], [2,4]) > scala> squaresDF.write.parquet("squares") > SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". > SLF4J: Defaulting to no-operation (NOP) logger implementation > SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further > details. > Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: > Compression: SNAPPY > Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: > Compression: SNAPPY > Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: > Compression: SNAPPY > Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: > Compression: SNAPPY > Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: > Compression: SNAPPY > Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: > Compression: SNAPPY > Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: > Compression: SNAPPY > Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: > Compression: SNAPPY > Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: > Parquet block size to 134217728 > Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: > Parquet block size to 134217728 > Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: > Parquet page size to 1048576 > Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: > Parquet block size to 134217728 > Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: > Parquet block size to 134217728 > Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: > Parquet block size to 134217728 > Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: > Parquet page size to 1048576 > Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: > Parquet page size to 1048576 > Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: > Parquet block size to 134217728 > Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: > Parquet block size to 134217728 > Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: > Parquet block size to 134217728 > Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: > Parquet page size to 1048576 > Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: > Parquet page size to 1048576 >
[jira] [Created] (SPARK-17341) Can't read Parquet data with fields containing periods "."
Don Drake created SPARK-17341: - Summary: Can't read Parquet data with fields containing periods "." Key: SPARK-17341 URL: https://issues.apache.org/jira/browse/SPARK-17341 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Don Drake I am porting a set of Spark 1.6.2 applications to Spark 2.0 and I have encountered a showstopper problem with Parquet dataset that have fields containing a "." in a field name. This data comes from an external provider (CSV) and we just pass through the field names. This has worked flawlessly in Spark 1.5 and 1.6, but now spark can't seem to read these parquet files. {code} Spark context available as 'sc' (master = local[*], app id = local-1472664486578). Spark session available as 'spark'. Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.0.0 /_/ Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_51) Type in expressions to have them evaluated. Type :help for more information. scala> val squaresDF = spark.sparkContext.makeRDD(1 to 5).map(i => (i, i * i)).toDF("value", "squared.value") 16/08/31 12:28:44 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0 16/08/31 12:28:44 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException squaresDF: org.apache.spark.sql.DataFrame = [value: int, squared.value: int] scala> squaresDF.take(2) res0: Array[org.apache.spark.sql.Row] = Array([1,1], [2,4]) scala> squaresDF.write.parquet("squares") SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: Compression: SNAPPY Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: Compression: SNAPPY Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: Compression: SNAPPY Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: Compression: SNAPPY Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: Compression: SNAPPY Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: Compression: SNAPPY Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: Compression: SNAPPY Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: Compression: SNAPPY Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet block size to 134217728 Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet block size to 134217728 Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet page size to 1048576 Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet block size to 134217728 Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet block size to 134217728 Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet block size to 134217728 Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet page size to 1048576 Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet page size to 1048576 Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet block size to 134217728 Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet block size to 134217728 Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet block size to 134217728 Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet page size to 1048576 Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet page size to 1048576 Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet dictionary page size to 1048576 Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet dictionary page size to 1048576 Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Dictionary is on Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet dictionary page size to 1048576 Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet page size to 1048576 Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet page size to 1048576 Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet dictionary page size to 1048576
[jira] [Commented] (SPARK-13721) Add support for LATERAL VIEW OUTER explode()
[ https://issues.apache.org/jira/browse/SPARK-13721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15452846#comment-15452846 ] Don Drake commented on SPARK-13721: --- Spark 2.0 has deprecated this function, what workarounds are suggested? > Add support for LATERAL VIEW OUTER explode() > > > Key: SPARK-13721 > URL: https://issues.apache.org/jira/browse/SPARK-13721 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Ian Hellstrom > > Hive supports the [LATERAL VIEW > OUTER|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LateralView#LanguageManualLateralView-OuterLateralViews] > syntax to make sure that when an array is empty, the content from the outer > table is still returned. > Within Spark, this is currently only possible within the HiveContext and > executing HiveQL statements. It would be nice if the standard explode() > DataFrame method allows the same. A possible signature would be: > {code:scala} > explode[A, B](inputColumn: String, outputColumn: String, outer: Boolean = > false) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15467) Getting stack overflow when attempting to query a wide Dataset (>200 fields)
[ https://issues.apache.org/jira/browse/SPARK-15467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15295512#comment-15295512 ] Don Drake commented on SPARK-15467: --- Vishnu, the 22 field limitation is with Scala 2.10.x, Spark 2.0 uses Scala 2.11.x, which increases the limit to 254 fields. > Getting stack overflow when attempting to query a wide Dataset (>200 fields) > > > Key: SPARK-15467 > URL: https://issues.apache.org/jira/browse/SPARK-15467 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Don Drake > > This can be duplicated in a spark-shell, I am running Spark 2.0.0-preview. > {code} > import spark.implicits._ > case class Wide( > val f0:String = "", > val f1:String = "", > val f2:String = "", > val f3:String = "", > val f4:String = "", > val f5:String = "", > val f6:String = "", > val f7:String = "", > val f8:String = "", > val f9:String = "", > val f10:String = "", > val f11:String = "", > val f12:String = "", > val f13:String = "", > val f14:String = "", > val f15:String = "", > val f16:String = "", > val f17:String = "", > val f18:String = "", > val f19:String = "", > val f20:String = "", > val f21:String = "", > val f22:String = "", > val f23:String = "", > val f24:String = "", > val f25:String = "", > val f26:String = "", > val f27:String = "", > val f28:String = "", > val f29:String = "", > val f30:String = "", > val f31:String = "", > val f32:String = "", > val f33:String = "", > val f34:String = "", > val f35:String = "", > val f36:String = "", > val f37:String = "", > val f38:String = "", > val f39:String = "", > val f40:String = "", > val f41:String = "", > val f42:String = "", > val f43:String = "", > val f44:String = "", > val f45:String = "", > val f46:String = "", > val f47:String = "", > val f48:String = "", > val f49:String = "", > val f50:String = "", > val f51:String = "", > val f52:String = "", > val f53:String = "", > val f54:String = "", > val f55:String = "", > val f56:String = "", > val f57:String = "", > val f58:String = "", > val f59:String = "", > val f60:String = "", > val f61:String = "", > val f62:String = "", > val f63:String = "", > val f64:String = "", > val f65:String = "", > val f66:String = "", > val f67:String = "", > val f68:String = "", > val f69:String = "", > val f70:String = "", > val f71:String = "", > val f72:String = "", > val f73:String = "", > val f74:String = "", > val f75:String = "", > val f76:String = "", > val f77:String = "", > val f78:String = "", > val f79:String = "", > val f80:String = "", > val f81:String = "", > val f82:String = "", > val f83:String = "", > val f84:String = "", > val f85:String = "", > val f86:String = "", > val f87:String = "", > val f88:String = "", > val f89:String = "", > val f90:String = "", > val f91:String = "", > val f92:String = "", > val f93:String = "", > val f94:String = "", > val f95:String = "", > val f96:String = "", > val f97:String = "", > val f98:String = "", > val f99:String = "", > val f100:String = "", > val f101:String = "", > val f102:String = "", > val f103:String = "", > val f104:String = "", > val f105:String = "", > val f106:String = "", > val f107:String = "", > val f108:String = "", > val f109:String = "", > val f110:String = "", > val f111:String = "", > val f112:String = "", > val f113:String = "", > val f114:String = "", > val f115:String = "", > val f116:String = "", > val f117:String = "", > val f118:String = "", > val f119:String = "", > val f120:String = "", > val f121:String = "", > val f122:String = "", > val f123:String = "", > val f124:String = "", > val f125:String = "", > val f126:String = "", > val f127:String = "", > val f128:String = "", > val f129:String = "", > val f130:String = "", > val f131:String = "", > val f132:String = "", > val f133:String = "", > val f134:String = "", > val f135:String = "", > val f136:String = "", > val f137:String = "", > val f138:String = "", > val f139:String = "", > val f140:String = "", > val f141:String = "", > val f142:String = "", > val f143:String = "", > val f144:String = "", > val f145:String = "", > val f146:String = "", > val f147:String = "", > val f148:String = "", > val f149:String = "", > val f150:String = "", > val f151:String = "", > val f152:String = "", > val f153:String = "", > val f154:String = "", > val f155:String = "", > val f156:String = "", > val f157:String = "", > val f158:String = "", > val f159:String = "", > val f160:String = "", > val f161:String = "", > val f162:String = "", > val f163:String = "", > val f164:String = "", > val f165:String = "", > val f166:String = "", > val f167:String = "", > val f168:String = "", > val f169:String = "", > val f170:String = "", > val f171:String = "", > val f172:String = "", > val f173:String = "", > val f174:String = "", >
[jira] [Created] (SPARK-15467) Getting stack overflow when attempting to query a wide Dataset (>200 fields)
Don Drake created SPARK-15467: - Summary: Getting stack overflow when attempting to query a wide Dataset (>200 fields) Key: SPARK-15467 URL: https://issues.apache.org/jira/browse/SPARK-15467 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Don Drake This can be duplicated in a spark-shell, I am running Spark 2.0.0-preview. {code} import spark.implicits._ case class Wide( val f0:String = "", val f1:String = "", val f2:String = "", val f3:String = "", val f4:String = "", val f5:String = "", val f6:String = "", val f7:String = "", val f8:String = "", val f9:String = "", val f10:String = "", val f11:String = "", val f12:String = "", val f13:String = "", val f14:String = "", val f15:String = "", val f16:String = "", val f17:String = "", val f18:String = "", val f19:String = "", val f20:String = "", val f21:String = "", val f22:String = "", val f23:String = "", val f24:String = "", val f25:String = "", val f26:String = "", val f27:String = "", val f28:String = "", val f29:String = "", val f30:String = "", val f31:String = "", val f32:String = "", val f33:String = "", val f34:String = "", val f35:String = "", val f36:String = "", val f37:String = "", val f38:String = "", val f39:String = "", val f40:String = "", val f41:String = "", val f42:String = "", val f43:String = "", val f44:String = "", val f45:String = "", val f46:String = "", val f47:String = "", val f48:String = "", val f49:String = "", val f50:String = "", val f51:String = "", val f52:String = "", val f53:String = "", val f54:String = "", val f55:String = "", val f56:String = "", val f57:String = "", val f58:String = "", val f59:String = "", val f60:String = "", val f61:String = "", val f62:String = "", val f63:String = "", val f64:String = "", val f65:String = "", val f66:String = "", val f67:String = "", val f68:String = "", val f69:String = "", val f70:String = "", val f71:String = "", val f72:String = "", val f73:String = "", val f74:String = "", val f75:String = "", val f76:String = "", val f77:String = "", val f78:String = "", val f79:String = "", val f80:String = "", val f81:String = "", val f82:String = "", val f83:String = "", val f84:String = "", val f85:String = "", val f86:String = "", val f87:String = "", val f88:String = "", val f89:String = "", val f90:String = "", val f91:String = "", val f92:String = "", val f93:String = "", val f94:String = "", val f95:String = "", val f96:String = "", val f97:String = "", val f98:String = "", val f99:String = "", val f100:String = "", val f101:String = "", val f102:String = "", val f103:String = "", val f104:String = "", val f105:String = "", val f106:String = "", val f107:String = "", val f108:String = "", val f109:String = "", val f110:String = "", val f111:String = "", val f112:String = "", val f113:String = "", val f114:String = "", val f115:String = "", val f116:String = "", val f117:String = "", val f118:String = "", val f119:String = "", val f120:String = "", val f121:String = "", val f122:String = "", val f123:String = "", val f124:String = "", val f125:String = "", val f126:String = "", val f127:String = "", val f128:String = "", val f129:String = "", val f130:String = "", val f131:String = "", val f132:String = "", val f133:String = "", val f134:String = "", val f135:String = "", val f136:String = "", val f137:String = "", val f138:String = "", val f139:String = "", val f140:String = "", val f141:String = "", val f142:String = "", val f143:String = "", val f144:String = "", val f145:String = "", val f146:String = "", val f147:String = "", val f148:String = "", val f149:String = "", val f150:String = "", val f151:String = "", val f152:String = "", val f153:String = "", val f154:String = "", val f155:String = "", val f156:String = "", val f157:String = "", val f158:String = "", val f159:String = "", val f160:String = "", val f161:String = "", val f162:String = "", val f163:String = "", val f164:String = "", val f165:String = "", val f166:String = "", val f167:String = "", val f168:String = "", val f169:String = "", val f170:String = "", val f171:String = "", val f172:String = "", val f173:String = "", val f174:String = "", val f175:String = "", val f176:String = "", val f177:String = "", val f178:String = "", val f179:String = "", val f180:String = "", val f181:String = "", val f182:String = "", val f183:String = "", val f184:String = "", val f185:String = "", val f186:String = "", val f187:String = "", val f188:String = "", val f189:String = "", val f190:String = "", val f191:String = "", val f192:String = "", val f193:String = "", val f194:String = "", val f195:String = "", val f196:String = "", val f197:String = "", val f198:String = "", val f199:String = "", val f200:String = "", val f201:String = "", val f202:String = "", val f203:String = "", val f204:String = "", val f205:String = "", val f206:String = "", val
[jira] [Commented] (SPARK-11085) Add support for HTTP proxy
[ https://issues.apache.org/jira/browse/SPARK-11085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14954989#comment-14954989 ] Don Drake commented on SPARK-11085: --- Neither of the options work. > Add support for HTTP proxy > --- > > Key: SPARK-11085 > URL: https://issues.apache.org/jira/browse/SPARK-11085 > Project: Spark > Issue Type: Improvement > Components: Spark Shell, Spark Submit >Reporter: Dustin Cote >Priority: Minor > > Add a way to update ivysettings.xml for the spark-shell and spark-submit to > support proxy settings for clusters that need to access a remote repository > through an http proxy. Typically this would be done like: > JAVA_OPTS="$JAVA_OPTS -Dhttp.proxyHost=proxy.host -Dhttp.proxyPort=8080 > -Dhttps.proxyHost=proxy.host.secure -Dhttps.proxyPort=8080" > Directly in the ivysettings.xml would look like: > > proxyport="8080" > nonproxyhosts="nonproxy.host"/> > > Even better would be a way to customize the ivysettings.xml with command > options. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10441) Cannot write timestamp to JSON
[ https://issues.apache.org/jira/browse/SPARK-10441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14736986#comment-14736986 ] Don Drake commented on SPARK-10441: --- Got it, thanks for the clarification. > Cannot write timestamp to JSON > -- > > Key: SPARK-10441 > URL: https://issues.apache.org/jira/browse/SPARK-10441 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Yin Huai >Assignee: Yin Huai >Priority: Critical > Fix For: 1.6.0, 1.5.1 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10441) Cannot write timestamp to JSON
[ https://issues.apache.org/jira/browse/SPARK-10441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14735690#comment-14735690 ] Don Drake commented on SPARK-10441: --- I see that PR 8597 was merged into master. Does master represent 1.5.1? I'm curious if this will be part of 1.5.0 as it's blocking my from upgrading at the moment. Thanks. > Cannot write timestamp to JSON > -- > > Key: SPARK-10441 > URL: https://issues.apache.org/jira/browse/SPARK-10441 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Yin Huai >Priority: Critical > Fix For: 1.6.0, 1.5.1 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8368) ClassNotFoundException in closure for map
[ https://issues.apache.org/jira/browse/SPARK-8368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14596001#comment-14596001 ] Don Drake commented on SPARK-8368: -- I've verified through a nightly build that this resolves my issue (SPARK-8365). Thanks! ClassNotFoundException in closure for map -- Key: SPARK-8368 URL: https://issues.apache.org/jira/browse/SPARK-8368 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Environment: Centos 6.5, java 1.7.0_67, scala 2.10.4. Build the project on Windows 7 and run in a spark standalone cluster(or local) mode on Centos 6.X. Reporter: CHEN Zhiwei Assignee: Yin Huai Priority: Blocker Fix For: 1.4.1, 1.5.0 After upgraded the cluster from spark 1.3.0 to 1.4.0(rc4), I encountered the following exception: ==begin exception {quote} Exception in thread main java.lang.ClassNotFoundException: com.yhd.ycache.magic.Model$$anonfun$9$$anonfun$10 at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:278) at org.apache.spark.util.InnerClosureFinder$$anon$4.visitMethodInsn(ClosureCleaner.scala:455) at com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.accept(Unknown Source) at com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.accept(Unknown Source) at org.apache.spark.util.ClosureCleaner$.getInnerClosureClasses(ClosureCleaner.scala:101) at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:197) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:132) at org.apache.spark.SparkContext.clean(SparkContext.scala:1891) at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:294) at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:293) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109) at org.apache.spark.rdd.RDD.withScope(RDD.scala:286) at org.apache.spark.rdd.RDD.map(RDD.scala:293) at org.apache.spark.sql.DataFrame.map(DataFrame.scala:1210) at com.yhd.ycache.magic.Model$.main(SSExample.scala:239) at com.yhd.ycache.magic.Model.main(SSExample.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:664) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) {quote} ===end exception=== I simplify the code that cause this issue, as following: ==begin code== {noformat} object Model extends Serializable{ def main(args: Array[String]) { val Array(sql) = args val sparkConf = new SparkConf().setAppName(Mode Example) val sc = new SparkContext(sparkConf) val hive = new HiveContext(sc) //get data by hive sql val rows = hive.sql(sql) val data = rows.map(r = { val arr = r.toSeq.toArray val label = 1.0 def fmap = ( input: Any ) = 1.0 val feature = arr.map(_=1.0) LabeledPoint(label, Vectors.dense(feature)) }) data.count() } } {noformat} =end code=== This code can run pretty well on spark-shell, but error when submit it to spark cluster (standalone or local mode). I try the same code on spark 1.3.0(local mode), and no exception is encountered. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail:
[jira] [Commented] (SPARK-8365) pyspark does not retain --packages or --jars passed on the command line as of 1.4.0
[ https://issues.apache.org/jira/browse/SPARK-8365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590290#comment-14590290 ] Don Drake commented on SPARK-8365: -- Is there a workaround that you are aware of? pyspark does not retain --packages or --jars passed on the command line as of 1.4.0 --- Key: SPARK-8365 URL: https://issues.apache.org/jira/browse/SPARK-8365 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.4.0 Reporter: Don Drake Priority: Blocker I downloaded the pre-compiled Spark 1.4.0 and attempted to run an existing Python Spark application against it and got the following error: py4j.protocol.Py4JJavaError: An error occurred while calling o90.save. : java.lang.RuntimeException: Failed to load class for data source: com.databricks.spark.csv I pass the following on the command-line to my spark-submit: --packages com.databricks:spark-csv_2.10:1.0.3 This worked fine on 1.3.1, but not in 1.4. I was able to replicate it with the following pyspark: {code} a = {'a':1.0, 'b':'asdf'} rdd = sc.parallelize([a]) df = sqlContext.createDataFrame(rdd) df.save(/tmp/d.csv, com.databricks.spark.csv) {code} Even using the new df.write.format('com.databricks.spark.csv').save('/tmp/d.csv') gives the same error. I see it was added in the web UI: file:/Users/drake/.ivy2/jars/com.databricks_spark-csv_2.10-1.0.3.jar Added By User file:/Users/drake/.ivy2/jars/org.apache.commons_commons-csv-1.1.jar Added By User http://10.0.0.222:56871/jars/com.databricks_spark-csv_2.10-1.0.3.jar Added By User http://10.0.0.222:56871/jars/org.apache.commons_commons-csv-1.1.jar Added By User Thoughts? *I also attempted using the Scala spark-shell to load a csv using the same package and it worked just fine, so this seems specific to pyspark.* -Don Gory details: {code} $ pyspark --packages com.databricks:spark-csv_2.10:1.0.3 Python 2.7.6 (default, Sep 9 2014, 15:04:36) [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.39)] on darwin Type help, copyright, credits or license for more information. Ivy Default Cache set to: /Users/drake/.ivy2/cache The jars for the packages stored in: /Users/drake/.ivy2/jars :: loading settings :: url = jar:file:/Users/drake/spark/spark-1.4.0-bin-hadoop2.6/lib/spark-assembly-1.4.0-hadoop2.6.0.jar!/org/apache/ivy/core/settings/ivysettings.xml com.databricks#spark-csv_2.10 added as a dependency :: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0 confs: [default] found com.databricks#spark-csv_2.10;1.0.3 in central found org.apache.commons#commons-csv;1.1 in central :: resolution report :: resolve 590ms :: artifacts dl 17ms :: modules in use: com.databricks#spark-csv_2.10;1.0.3 from central in [default] org.apache.commons#commons-csv;1.1 from central in [default] - | |modules|| artifacts | | conf | number| search|dwnlded|evicted|| number|dwnlded| - | default | 2 | 0 | 0 | 0 || 2 | 0 | - :: retrieving :: org.apache.spark#spark-submit-parent confs: [default] 0 artifacts copied, 2 already retrieved (0kB/15ms) Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 15/06/13 11:06:08 INFO SparkContext: Running Spark version 1.4.0 2015-06-13 11:06:08.921 java[19233:2145789] Unable to load realm info from SCDynamicStore 15/06/13 11:06:09 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 15/06/13 11:06:09 WARN Utils: Your hostname, Dons-MacBook-Pro-2.local resolves to a loopback address: 127.0.0.1; using 10.0.0.222 instead (on interface en0) 15/06/13 11:06:09 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address 15/06/13 11:06:09 INFO SecurityManager: Changing view acls to: drake 15/06/13 11:06:09 INFO SecurityManager: Changing modify acls to: drake 15/06/13 11:06:09 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(drake); users with modify permissions: Set(drake) 15/06/13 11:06:10 INFO Slf4jLogger: Slf4jLogger started 15/06/13 11:06:10 INFO Remoting: Starting remoting 15/06/13 11:06:10 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@10.0.0.222:56870] 15/06/13 11:06:10 INFO Utils: Successfully started service 'sparkDriver' on port
[jira] [Created] (SPARK-8365) pyspark does not retain --packages or --jars passed on the command line as of 1.4.0
Don Drake created SPARK-8365: Summary: pyspark does not retain --packages or --jars passed on the command line as of 1.4.0 Key: SPARK-8365 URL: https://issues.apache.org/jira/browse/SPARK-8365 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.4.0 Reporter: Don Drake I downloaded the pre-compiled Spark 1.4.0 and attempted to run an existing Python Spark application against it and got the following error: py4j.protocol.Py4JJavaError: An error occurred while calling o90.save. : java.lang.RuntimeException: Failed to load class for data source: com.databricks.spark.csv I pass the following on the command-line to my spark-submit: --packages com.databricks:spark-csv_2.10:1.0.3 This worked fine on 1.3.1, but not in 1.4. I was able to replicate it with the following pyspark: {code} a = {'a':1.0, 'b':'asdf'} rdd = sc.parallelize([a]) df = sqlContext.createDataFrame(rdd) df.save(/tmp/d.csv, com.databricks.spark.csv) {code} Even using the new df.write.format('com.databricks.spark.csv').save('/tmp/d.csv') gives the same error. I see it was added in the web UI: file:/Users/drake/.ivy2/jars/com.databricks_spark-csv_2.10-1.0.3.jarAdded By User file:/Users/drake/.ivy2/jars/org.apache.commons_commons-csv-1.1.jar Added By User http://10.0.0.222:56871/jars/com.databricks_spark-csv_2.10-1.0.3.jarAdded By User http://10.0.0.222:56871/jars/org.apache.commons_commons-csv-1.1.jar Added By User Thoughts? *I also attempted using the Scala spark-shell to load a csv using the same package and it worked just fine, so this seems specific to pyspark.* -Don Gory details: {code} $ pyspark --packages com.databricks:spark-csv_2.10:1.0.3 Python 2.7.6 (default, Sep 9 2014, 15:04:36) [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.39)] on darwin Type help, copyright, credits or license for more information. Ivy Default Cache set to: /Users/drake/.ivy2/cache The jars for the packages stored in: /Users/drake/.ivy2/jars :: loading settings :: url = jar:file:/Users/drake/spark/spark-1.4.0-bin-hadoop2.6/lib/spark-assembly-1.4.0-hadoop2.6.0.jar!/org/apache/ivy/core/settings/ivysettings.xml com.databricks#spark-csv_2.10 added as a dependency :: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0 confs: [default] found com.databricks#spark-csv_2.10;1.0.3 in central found org.apache.commons#commons-csv;1.1 in central :: resolution report :: resolve 590ms :: artifacts dl 17ms :: modules in use: com.databricks#spark-csv_2.10;1.0.3 from central in [default] org.apache.commons#commons-csv;1.1 from central in [default] - | |modules|| artifacts | | conf | number| search|dwnlded|evicted|| number|dwnlded| - | default | 2 | 0 | 0 | 0 || 2 | 0 | - :: retrieving :: org.apache.spark#spark-submit-parent confs: [default] 0 artifacts copied, 2 already retrieved (0kB/15ms) Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 15/06/13 11:06:08 INFO SparkContext: Running Spark version 1.4.0 2015-06-13 11:06:08.921 java[19233:2145789] Unable to load realm info from SCDynamicStore 15/06/13 11:06:09 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 15/06/13 11:06:09 WARN Utils: Your hostname, Dons-MacBook-Pro-2.local resolves to a loopback address: 127.0.0.1; using 10.0.0.222 instead (on interface en0) 15/06/13 11:06:09 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address 15/06/13 11:06:09 INFO SecurityManager: Changing view acls to: drake 15/06/13 11:06:09 INFO SecurityManager: Changing modify acls to: drake 15/06/13 11:06:09 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(drake); users with modify permissions: Set(drake) 15/06/13 11:06:10 INFO Slf4jLogger: Slf4jLogger started 15/06/13 11:06:10 INFO Remoting: Starting remoting 15/06/13 11:06:10 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@10.0.0.222:56870] 15/06/13 11:06:10 INFO Utils: Successfully started service 'sparkDriver' on port 56870. 15/06/13 11:06:10 INFO SparkEnv: Registering MapOutputTracker 15/06/13 11:06:10 INFO SparkEnv: Registering BlockManagerMaster 15/06/13 11:06:10 INFO DiskBlockManager: Created local directory at /private/var/folders/7_/k5h82ws97b95v5f5h8wf9j0hgn/T/spark-f36f39f5-7f82-42e0-b3e0-9eb1e1cc0816/blockmgr-a1412b71-fe56-429c-a193-ce3fb95d2ffd 15/06/13
[jira] [Created] (SPARK-7781) GradientBoostedTrees.trainRegressor is missing maxBins parameter in pyspark
Don Drake created SPARK-7781: Summary: GradientBoostedTrees.trainRegressor is missing maxBins parameter in pyspark Key: SPARK-7781 URL: https://issues.apache.org/jira/browse/SPARK-7781 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.3.1 Reporter: Don Drake I'm running Spark v1.3.1 and when I run the following against my dataset: {code} model = GradientBoostedTrees.trainRegressor(trainingData, categoricalFeaturesInfo=catFeatures, maxDepth=6, numIterations=3) The job will fail with the following message: Traceback (most recent call last): File /Users/drake/fd/spark/mltest.py, line 73, in module model = GradientBoostedTrees.trainRegressor(trainingData, categoricalFeaturesInfo=catFeatures, maxDepth=6, numIterations=3) File /Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/pyspark/mllib/tree.py, line 553, in trainRegressor loss, numIterations, learningRate, maxDepth) File /Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/pyspark/mllib/tree.py, line 438, in _train loss, numIterations, learningRate, maxDepth) File /Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/pyspark/mllib/common.py, line 120, in callMLlibFunc return callJavaFunc(sc, api, *args) File /Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/pyspark/mllib/common.py, line 113, in callJavaFunc return _java2py(sc, func(*args)) File /Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py, line 538, in __call__ File /Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, line 300, in get_return_value 15/05/20 16:40:12 INFO BlockManager: Removing block rdd_32_95 py4j.protocol.Py4JJavaError: An error occurred while calling o69.trainGradientBoostedTreesModel. : java.lang.IllegalArgumentException: requirement failed: DecisionTree requires maxBins (= 32) = max categories in categorical features (= 1895) at scala.Predef$.require(Predef.scala:233) at org.apache.spark.mllib.tree.impl.DecisionTreeMetadata$.buildMetadata(DecisionTreeMetadata.scala:128) at org.apache.spark.mllib.tree.RandomForest.run(RandomForest.scala:138) at org.apache.spark.mllib.tree.DecisionTree.run(DecisionTree.scala:60) at org.apache.spark.mllib.tree.GradientBoostedTrees$.org$apache$spark$mllib$tree$GradientBoostedTrees$$boost(GradientBoostedTrees.scala:150) at org.apache.spark.mllib.tree.GradientBoostedTrees.run(GradientBoostedTrees.scala:63) at org.apache.spark.mllib.tree.GradientBoostedTrees$.train(GradientBoostedTrees.scala:96) at org.apache.spark.mllib.api.python.PythonMLLibAPI.trainGradientBoostedTreesModel(PythonMLLibAPI.scala:595) {code} So, it's complaining about the maxBins, if I provide maxBins=1900 and re-run it: {code} model = GradientBoostedTrees.trainRegressor(trainingData, categoricalFeaturesInfo=catFeatures, maxDepth=6, numIterations=3, maxBins=1900) Traceback (most recent call last): File /Users/drake/fd/spark/mltest.py, line 73, in module model = GradientBoostedTrees.trainRegressor(trainingData, categoricalFeaturesInfo=catF eatures, maxDepth=6, numIterations=3, maxBins=1900) TypeError: trainRegressor() got an unexpected keyword argument 'maxBins' {code} It now says it knows nothing of maxBins. If I run the same command against DecisionTree or RandomForest (with maxBins=1900) it works just fine. Seems like a bug in GradientBoostedTrees. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7182) [SQL] Can't remove columns from DataFrame or save DataFrame from a join due to duplicate columns
[ https://issues.apache.org/jira/browse/SPARK-7182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Don Drake updated SPARK-7182: - Summary: [SQL] Can't remove columns from DataFrame or save DataFrame from a join due to duplicate columns (was: [SQL] Can't remove or save DataFrame from a join due to duplicate columns) [SQL] Can't remove columns from DataFrame or save DataFrame from a join due to duplicate columns Key: SPARK-7182 URL: https://issues.apache.org/jira/browse/SPARK-7182 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.1 Reporter: Don Drake I'm having trouble saving a dataframe as parquet after performing a simple table join. Below is a trivial example that demonstrates the issue. The following is from a pyspark session: {code} d1=[{'a':1, 'b':2, 'c':3}] d2=[{'a':1, 'b':2, 'd':4}] t1 = sqlContext.createDataFrame(d1) t2 = sqlContext.createDataFrame(d2) j = t1.join(t2, t1.a==t2.a and t1.b==t2.b) j DataFrame[a: bigint, b: bigint, c: bigint, a: bigint, b: bigint, d: bigint] {code} Try to get a unique list of the columns: {code} u = sorted(list(set(j.columns))) nt = j.select(*u) Traceback (most recent call last): File stdin, line 1, in module File /Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/pyspark/sql/dataframe.py, lin e 586, in select jdf = self._jdf.select(self.sql_ctx._sc._jvm.PythonUtils.toSeq(jcols)) File /Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/ java_gateway.py, line 538, in __call__ File /Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/ protocol.py, line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o829.select. : org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous, could be: a#0L, a#3L .; at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:2 29) {code} That didn't work, save the file (that works), but reading it back in fails.: {code} j.saveAsParquetFile('j') z = sqlContext.parquetFile('j') z.take(1) ... : An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 171 in stage 104.0 failed 1 times, most recent failure: Lost task 171.0 in stage 104.0 (TID 1235, localhost): parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file file:/Users/drake/fd/spark/j/part-r-00172.parquet at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7182) [SQL] Can't remove or save DataFrame from a join due to duplicate columns
[ https://issues.apache.org/jira/browse/SPARK-7182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Don Drake updated SPARK-7182: - Description: I'm having trouble saving a dataframe as parquet after performing a simple table join. Below is a trivial example that demonstrates the issue. The following is from a pyspark session: {code} d1=[{'a':1, 'b':2, 'c':3}] d2=[{'a':1, 'b':2, 'd':4}] t1 = sqlContext.createDataFrame(d1) t2 = sqlContext.createDataFrame(d2) j = t1.join(t2, t1.a==t2.a and t1.b==t2.b) j DataFrame[a: bigint, b: bigint, c: bigint, a: bigint, b: bigint, d: bigint] {code} Try to get a unique list of the columns: {code} u = sorted(list(set(j.columns))) nt = j.select(*u) Traceback (most recent call last): File stdin, line 1, in module File /Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/pyspark/sql/dataframe.py, lin e 586, in select jdf = self._jdf.select(self.sql_ctx._sc._jvm.PythonUtils.toSeq(jcols)) File /Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/ java_gateway.py, line 538, in __call__ File /Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/ protocol.py, line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o829.select. : org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous, could be: a#0L, a#3L .; at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:2 29) {code} That didn't work, save the file (that works), but reading it back in fails.: {code} j.saveAsParquetFile('j') z = sqlContext.parquetFile('j') z.take(1) ... : An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 171 in stage 104.0 failed 1 times, most recent failure: Lost task 171.0 in stage 104.0 (TID 1235, localhost): parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file file:/Users/drake/fd/spark/j/part-r-00172.parquet at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213) {code} was: I'm having trouble saving a dataframe as parquet after performing a simple table join. Below is a trivial example that demonstrates the issue. The following is from a pyspark session: {code} d1=[{'a':1, 'b':2, 'c':3}] d2=[{'a':1, 'b':2, 'd':4}] t1 = sqlContext.createDataFrame(d1) t2 = sqlContext.createDataFrame(d2) j = t1.join(t2, t1.a==t2.a and t1.b==t2.b) j DataFrame[a: bigint, b: bigint, c: bigint, a: bigint, b: bigint, d: bigint] u = sorted(list(set(j.columns))) nt = j.select(*u) Traceback (most recent call last): File stdin, line 1, in module File /Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/pyspark/sql/dataframe.py, lin e 586, in select jdf = self._jdf.select(self.sql_ctx._sc._jvm.PythonUtils.toSeq(jcols)) File /Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/ java_gateway.py, line 538, in __call__ File /Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/ protocol.py, line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o829.select. : org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous, could be: a#0L, a#3L .; at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:2 29) j.saveAsParquetFile('j') z = sqlContext.parquetFile('j') z.take(1) ... : An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 171 in stage 104.0 failed 1 times, most recent failure: Lost task 171.0 in stage 104.0 (TID 1235, localhost): parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file file:/Users/drake/fd/spark/j/part-r-00172.parquet at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213) {code} [SQL] Can't remove or save DataFrame from a join due to duplicate columns - Key: SPARK-7182 URL: https://issues.apache.org/jira/browse/SPARK-7182 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.1 Reporter: Don Drake I'm having trouble saving a dataframe as parquet after performing a simple table join. Below is a trivial example that demonstrates the issue. The following is from a pyspark session: {code} d1=[{'a':1, 'b':2, 'c':3}] d2=[{'a':1, 'b':2, 'd':4}] t1 = sqlContext.createDataFrame(d1) t2 = sqlContext.createDataFrame(d2) j = t1.join(t2, t1.a==t2.a and t1.b==t2.b) j DataFrame[a: bigint, b: bigint, c: bigint, a: bigint, b: bigint, d: bigint] {code} Try to get a
[jira] [Created] (SPARK-7182) [SQL] Can't remove or save DataFrame from a join due to duplicate columns
Don Drake created SPARK-7182: Summary: [SQL] Can't remove or save DataFrame from a join due to duplicate columns Key: SPARK-7182 URL: https://issues.apache.org/jira/browse/SPARK-7182 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.1 Reporter: Don Drake I'm having trouble saving a dataframe as parquet after performing a simple table join. Below is a trivial example that demonstrates the issue. The following is from a pyspark session: {code} d1=[{'a':1, 'b':2, 'c':3}] d2=[{'a':1, 'b':2, 'd':4}] t1 = sqlContext.createDataFrame(d1) t2 = sqlContext.createDataFrame(d2) j = t1.join(t2, t1.a==t2.a and t1.b==t2.b) j DataFrame[a: bigint, b: bigint, c: bigint, a: bigint, b: bigint, d: bigint] u = sorted(list(set(j.columns))) nt = j.select(*u) Traceback (most recent call last): File stdin, line 1, in module File /Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/pyspark/sql/dataframe.py, lin e 586, in select jdf = self._jdf.select(self.sql_ctx._sc._jvm.PythonUtils.toSeq(jcols)) File /Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/ java_gateway.py, line 538, in __call__ File /Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/ protocol.py, line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o829.select. : org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous, could be: a#0L, a#3L .; at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:2 29) j.saveAsParquetFile('j') z = sqlContext.parquetFile('j') z.take(1) ... : An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 171 in stage 104.0 failed 1 times, most recent failure: Lost task 171.0 in stage 104.0 (TID 1235, localhost): parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file file:/Users/drake/fd/spark/j/part-r-00172.parquet at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5722) Infer_schema_type incorrect for Integers in pyspark
[ https://issues.apache.org/jira/browse/SPARK-5722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14316940#comment-14316940 ] Don Drake commented on SPARK-5722: -- Hi, I've submitted 2 pull requests for branch-1.2 and branch-1.3. Please approve. Infer_schema_type incorrect for Integers in pyspark --- Key: SPARK-5722 URL: https://issues.apache.org/jira/browse/SPARK-5722 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.2.0 Reporter: Don Drake The Integers datatype in Python does not match what a Scala/Java integer is defined as. This causes inference of data types and schemas to fail when data is larger than 2^32 and it is inferred incorrectly as an Integer. Since the range of valid Python integers is wider than Java Integers, this causes problems when inferring Integer vs. Long datatypes. This will cause problems when attempting to save SchemaRDD as Parquet or JSON. Here's an example: {code} sqlCtx = SQLContext(sc) from pyspark.sql import Row rdd = sc.parallelize([Row(f1='a', f2=100)]) srdd = sqlCtx.inferSchema(rdd) srdd.schema() StructType(List(StructField(f1,StringType,true),StructField(f2,IntegerType,true))) {code} That number is a LongType in Java, but an Integer in python. We need to check the value to see if it should really by a LongType when a IntegerType is initially inferred. More tests: {code} from pyspark.sql import _infer_type # OK print _infer_type(1) IntegerType # OK print _infer_type(2**31-1) IntegerType #WRONG print _infer_type(2**31) #WRONG IntegerType print _infer_type(2**61 ) #OK IntegerType print _infer_type(2**71 ) LongType {code} Java Primitive Types defined: http://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html Python Built-in Types: https://docs.python.org/2/library/stdtypes.html#typesnumeric -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5722) Infer_schema_type incorrect for Integers in pyspark
[ https://issues.apache.org/jira/browse/SPARK-5722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Don Drake updated SPARK-5722: - Description: The Integers datatype in Python does not match what a Scala/Java integer is defined as. This causes inference of data types and schemas to fail when data is larger than 2^32 and it is inferred incorrectly as an Integer. Since the range of valid Python integers is wider than Java Integers, this causes problems when inferring Integer vs. Long datatypes. This will cause problems when attempting to save SchemaRDD as Parquet or JSON. Here's an example: {code} sqlCtx = SQLContext(sc) from pyspark.sql import Row rdd = sc.parallelize([Row(f1='a', f2=100)]) srdd = sqlCtx.inferSchema(rdd) srdd.schema() StructType(List(StructField(f1,StringType,true),StructField(f2,IntegerType,true))) {code} That number is a LongType in Java, but an Integer in python. We need to check the value to see if it should really by a LongType when a IntegerType is initially inferred. More tests: {code} from pyspark.sql import _infer_type # OK print _infer_type(1) IntegerType # OK print _infer_type(2**31-1) IntegerType #WRONG print _infer_type(2**31) #WRONG IntegerType print _infer_type(2**61 ) #OK IntegerType print _infer_type(2**71 ) LongType {code} Java Primitive Types defined: http://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html Python Built-in Types: https://docs.python.org/2/library/stdtypes.html#typesnumeric was: The Integers datatype in Python does not match what a Scala/Java integer is defined as. This causes inference of data types and schemas to fail when data is larger than 2^32 and it is inferred incorrectly as an Integer. Since the range of valid Python integers is wider than Java Integers, this causes problems when inferring Integer vs. Long datatypes. This will cause problems when attempting to save SchemaRDD as Parquet or JSON. Here's an example: sqlCtx = SQLContext(sc) from pyspark.sql import Row rdd = sc.parallelize([Row(f1='a', f2=100)]) srdd = sqlCtx.inferSchema(rdd) srdd.schema() StructType(List(StructField(f1,StringType,true),StructField(f2,IntegerType,true))) That number is a LongType in Java, but an Integer in python. We need to check the value to see if it should really by a LongType when a IntegerType is initially inferred. More tests: from pyspark.sql import _infer_type # OK print _infer_type(1) IntegerType # OK print _infer_type(2**31-1) IntegerType #WRONG print _infer_type(2**31) #WRONG IntegerType print _infer_type(2**61 ) #OK IntegerType print _infer_type(2**71 ) LongType Java Primitive Types defined: http://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html Python Built-in Types: https://docs.python.org/2/library/stdtypes.html#typesnumeric Infer_schema_type incorrect for Integers in pyspark --- Key: SPARK-5722 URL: https://issues.apache.org/jira/browse/SPARK-5722 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.2.0 Reporter: Don Drake The Integers datatype in Python does not match what a Scala/Java integer is defined as. This causes inference of data types and schemas to fail when data is larger than 2^32 and it is inferred incorrectly as an Integer. Since the range of valid Python integers is wider than Java Integers, this causes problems when inferring Integer vs. Long datatypes. This will cause problems when attempting to save SchemaRDD as Parquet or JSON. Here's an example: {code} sqlCtx = SQLContext(sc) from pyspark.sql import Row rdd = sc.parallelize([Row(f1='a', f2=100)]) srdd = sqlCtx.inferSchema(rdd) srdd.schema() StructType(List(StructField(f1,StringType,true),StructField(f2,IntegerType,true))) {code} That number is a LongType in Java, but an Integer in python. We need to check the value to see if it should really by a LongType when a IntegerType is initially inferred. More tests: {code} from pyspark.sql import _infer_type # OK print _infer_type(1) IntegerType # OK print _infer_type(2**31-1) IntegerType #WRONG print _infer_type(2**31) #WRONG IntegerType print _infer_type(2**61 ) #OK IntegerType print _infer_type(2**71 ) LongType {code} Java Primitive Types defined: http://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html Python Built-in Types: https://docs.python.org/2/library/stdtypes.html#typesnumeric -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5722) Infer_schema_type incorrect for Integers in pyspark
[ https://issues.apache.org/jira/browse/SPARK-5722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Don Drake updated SPARK-5722: - Summary: Infer_schema_type incorrect for Integers in pyspark (was: Infer_schma_type incorrect for Integers in pyspark) Infer_schema_type incorrect for Integers in pyspark --- Key: SPARK-5722 URL: https://issues.apache.org/jira/browse/SPARK-5722 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.2.0 Reporter: Don Drake The Integers datatype in Python does not match what a Scala/Java integer is defined as. This causes inference of data types and schemas to fail when data is larger than 2^32 and it is inferred incorrectly as an Integer. Since the range of valid Python integers is wider than Java Integers, this causes problems when inferring Integer vs. Long datatypes. This will cause problems when attempting to save SchemaRDD as Parquet or JSON. Here's an example: sqlCtx = SQLContext(sc) from pyspark.sql import Row rdd = sc.parallelize([Row(f1='a', f2=100)]) srdd = sqlCtx.inferSchema(rdd) srdd.schema() StructType(List(StructField(f1,StringType,true),StructField(f2,IntegerType,true))) That number is a LongType in Java, but an Integer in python. We need to check the value to see if it should really by a LongType when a IntegerType is initially inferred. More tests: from pyspark.sql import _infer_type # OK print _infer_type(1) IntegerType # OK print _infer_type(2**31-1) IntegerType #WRONG print _infer_type(2**31) #WRONG IntegerType print _infer_type(2**61 ) #OK IntegerType print _infer_type(2**71 ) LongType Java Primitive Types defined: http://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html Python Built-in Types: https://docs.python.org/2/library/stdtypes.html#typesnumeric -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5722) Infer_schma_type incorrect for Integers in pyspark
Don Drake created SPARK-5722: Summary: Infer_schma_type incorrect for Integers in pyspark Key: SPARK-5722 URL: https://issues.apache.org/jira/browse/SPARK-5722 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.2.0 Reporter: Don Drake The Integers datatype in Python does not match what a Scala/Java integer is defined as. This causes inference of data types and schemas to fail when data is larger than 2^32 and it is inferred incorrectly as an Integer. Since the range of valid Python integers is wider than Java Integers, this causes problems when inferring Integer vs. Long datatypes. This will cause problems when attempting to save SchemaRDD as Parquet or JSON. Here's an example: sqlCtx = SQLContext(sc) from pyspark.sql import Row rdd = sc.parallelize([Row(f1='a', f2=100)]) srdd = sqlCtx.inferSchema(rdd) srdd.schema() StructType(List(StructField(f1,StringType,true),StructField(f2,IntegerType,true))) That number is a LongType in Java, but an Integer in python. We need to check the value to see if it should really by a LongType when a IntegerType is initially inferred. More tests: from pyspark.sql import _infer_type # OK print _infer_type(1) IntegerType # OK print _infer_type(2**31-1) IntegerType #WRONG print _infer_type(2**31) #WRONG IntegerType print _infer_type(2**61 ) #OK IntegerType print _infer_type(2**71 ) LongType Java Primitive Types defined: http://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html Python Built-in Types: https://docs.python.org/2/library/stdtypes.html#typesnumeric -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org