[jira] [Commented] (SPARK-16745) Spark job completed however have to wait for 13 mins (data size is small)

2017-03-24 Thread Don Drake (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15940654#comment-15940654
 ] 

Don Drake commented on SPARK-16745:
---

I just came across the same exception running Spark 2.1.0 on my Mac, running a 
few spark-shell commands on a tiny dataset that previously worked just fine.  
But, I never got a result, just the timeout exceptions.

The issue is that today I'm running them on a corporate VPN, with proxy setting 
enabled, and the IP address the driver is using is my local (Wifi) address that 
the proxy server cannot connect to.  

This took a while to figure out, but I added {{--conf 
spark.driver.host=127.0.0.1}} to my command-line and that forced all local 
networking between driver and executors (to bypass the proxy server) and the 
query came back in the expected amount of time.

> Spark job completed however have to wait for 13 mins (data size is small)
> -
>
> Key: SPARK-16745
> URL: https://issues.apache.org/jira/browse/SPARK-16745
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.6.1
> Environment: Max OS X Yosemite, Terminal, MacBook Air Late 2014
>Reporter: Joe Chong
>Priority: Minor
>
> I submitted a job in scala spark shell to show a DataFrame. The data size is 
> about 43K. The job was successful in the end, but took more than 13 minutes 
> to resolve. Upon checking the log, there's multiple exception raised on 
> "Failed to check existence of class" with a java.net.connectionexpcetion 
> message indicating timeout trying to connect to the port 52067, the repl port 
> that Spark setup. Please assist to troubleshoot. Thanks. 
> Started Spark in standalone mode
> $ spark-shell --driver-memory 5g --master local[*]
> 16/07/26 21:05:29 WARN util.NativeCodeLoader: Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
> 16/07/26 21:05:30 INFO spark.SecurityManager: Changing view acls to: joechong
> 16/07/26 21:05:30 INFO spark.SecurityManager: Changing modify acls to: 
> joechong
> 16/07/26 21:05:30 INFO spark.SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(joechong); users 
> with modify permissions: Set(joechong)
> 16/07/26 21:05:30 INFO spark.HttpServer: Starting HTTP Server
> 16/07/26 21:05:30 INFO server.Server: jetty-8.y.z-SNAPSHOT
> 16/07/26 21:05:30 INFO server.AbstractConnector: Started 
> SocketConnector@0.0.0.0:52067
> 16/07/26 21:05:30 INFO util.Utils: Successfully started service 'HTTP class 
> server' on port 52067.
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 1.6.1
>   /_/
> Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_66)
> Type in expressions to have them evaluated.
> Type :help for more information.
> 16/07/26 21:05:34 INFO spark.SparkContext: Running Spark version 1.6.1
> 16/07/26 21:05:34 INFO spark.SecurityManager: Changing view acls to: joechong
> 16/07/26 21:05:34 INFO spark.SecurityManager: Changing modify acls to: 
> joechong
> 16/07/26 21:05:34 INFO spark.SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(joechong); users 
> with modify permissions: Set(joechong)
> 16/07/26 21:05:35 INFO util.Utils: Successfully started service 'sparkDriver' 
> on port 52072.
> 16/07/26 21:05:35 INFO slf4j.Slf4jLogger: Slf4jLogger started
> 16/07/26 21:05:35 INFO Remoting: Starting remoting
> 16/07/26 21:05:35 INFO Remoting: Remoting started; listening on addresses 
> :[akka.tcp://sparkDriverActorSystem@10.199.29.218:52074]
> 16/07/26 21:05:35 INFO util.Utils: Successfully started service 
> 'sparkDriverActorSystem' on port 52074.
> 16/07/26 21:05:35 INFO spark.SparkEnv: Registering MapOutputTracker
> 16/07/26 21:05:35 INFO spark.SparkEnv: Registering BlockManagerMaster
> 16/07/26 21:05:35 INFO storage.DiskBlockManager: Created local directory at 
> /private/var/folders/r7/bs2f87nj6lnd5vm51lvxcw68gn/T/blockmgr-cd542a27-6ff1-4f51-a72b-78654142fdb6
> 16/07/26 21:05:35 INFO storage.MemoryStore: MemoryStore started with capacity 
> 3.4 GB
> 16/07/26 21:05:35 INFO spark.SparkEnv: Registering OutputCommitCoordinator
> 16/07/26 21:05:36 INFO server.Server: jetty-8.y.z-SNAPSHOT
> 16/07/26 21:05:36 INFO server.AbstractConnector: Started 
> SelectChannelConnector@0.0.0.0:4040
> 16/07/26 21:05:36 INFO util.Utils: Successfully started service 'SparkUI' on 
> port 4040.
> 16/07/26 21:05:36 INFO ui.SparkUI: Started SparkUI at 
> http://10.199.29.218:4040
> 16/07/26 21:05:36 INFO executor.Executor: Starting executor ID driver on host 
> localhost
> 16/07/26 21:05:36 INFO 

[jira] [Commented] (SPARK-19477) [SQL] Datasets created from a Dataframe with extra columns retain the extra columns

2017-02-11 Thread Don Drake (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15862560#comment-15862560
 ] 

Don Drake commented on SPARK-19477:
---

How does lazy apply here? If I read/create a dataframe with extra columns, then 
do ds = df.as[XYZ], then immediately ds.write.parquet("file"), the write 
trigger should enable any lazy functionality if I understand this.

Do you have a suggested workaround? I'm currently retrieving the encoder for 
the case class to get the schema, then calling ds.select() on the columns from 
the schema.

> [SQL] Datasets created from a Dataframe with extra columns retain the extra 
> columns
> ---
>
> Key: SPARK-19477
> URL: https://issues.apache.org/jira/browse/SPARK-19477
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Don Drake
>
> In 1.6, when you created a Dataset from a Dataframe that had extra columns, 
> the columns not in the case class were dropped from the Dataset.
> For example in 1.6, the column c4 is gone:
> {code}
> scala> case class F(f1: String, f2: String, f3:String)
> defined class F
> scala> import sqlContext.implicits._
> import sqlContext.implicits._
> scala> val df = Seq(("a","b","c","x"), ("d", "e", "f","y"), ("h", "i", 
> "j","z")).toDF("f1", "f2", "f3", "c4")
> df: org.apache.spark.sql.DataFrame = [f1: string, f2: string, f3: string, c4: 
> string]
> scala> val ds = df.as[F]
> ds: org.apache.spark.sql.Dataset[F] = [f1: string, f2: string, f3: string]
> scala> ds.show
> +---+---+---+
> | f1| f2| f3|
> +---+---+---+
> |  a|  b|  c|
> |  d|  e|  f|
> |  h|  i|  j|
> {code}
> This seems to have changed in Spark 2.0 and also 2.1:
> Spark 2.1.0:
> {code}
> scala> case class F(f1: String, f2: String, f3:String)
> defined class F
> scala> import spark.implicits._
> import spark.implicits._
> scala> val df = Seq(("a","b","c","x"), ("d", "e", "f","y"), ("h", "i", 
> "j","z")).toDF("f1", "f2", "f3", "c4")
> df: org.apache.spark.sql.DataFrame = [f1: string, f2: string ... 2 more 
> fields]
> scala> val ds = df.as[F]
> ds: org.apache.spark.sql.Dataset[F] = [f1: string, f2: string ... 2 more 
> fields]
> scala> ds.show
> +---+---+---+---+
> | f1| f2| f3| c4|
> +---+---+---+---+
> |  a|  b|  c|  x|
> |  d|  e|  f|  y|
> |  h|  i|  j|  z|
> +---+---+---+---+
> scala> import org.apache.spark.sql.Encoders
> import org.apache.spark.sql.Encoders
> scala> val fEncoder = Encoders.product[F]
> fEncoder: org.apache.spark.sql.Encoder[F] = class[f1[0]: string, f2[0]: 
> string, f3[0]: string]
> scala> fEncoder.schema == ds.schema
> res2: Boolean = false
> scala> ds.schema
> res3: org.apache.spark.sql.types.StructType = 
> StructType(StructField(f1,StringType,true), StructField(f2,StringType,true), 
> StructField(f3,StringType,true), StructField(c4,StringType,true))
> scala> fEncoder.schema
> res4: org.apache.spark.sql.types.StructType = 
> StructType(StructField(f1,StringType,true), StructField(f2,StringType,true), 
> StructField(f3,StringType,true))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-19477) [SQL] Datasets created from a Dataframe with extra columns retain the extra columns

2017-02-07 Thread Don Drake (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Don Drake reopened SPARK-19477:
---

I'm struggling with this answer.

I thought the point of Datasets, was to have a strongly typed definition, 
rather than the more loosely defined Dataframe.

Why does it matter if I use relational or typed methods to access it?

It works if I call a map() against it:

{code}
scala> ds.map(x => x).take(1)
res7: Array[F] = Array(F(a,b,c))
{code}

But the real problem I'm having is that when I attempt to save the Dataset, the 
schema is ignored:

{code}
scala> ds.write.parquet("a")
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
details.

scala> val ds2 = spark.read.parquet("a").as[F]
ds2: org.apache.spark.sql.Dataset[F] = [f1: string, f2: string ... 2 more 
fields]

scala> ds2.printSchema
root
 |-- f1: string (nullable = true)
 |-- f2: string (nullable = true)
 |-- f3: string (nullable = true)
 |-- c4: string (nullable = true)
{code}

IMHO, the c4 column should not have been saved.



> [SQL] Datasets created from a Dataframe with extra columns retain the extra 
> columns
> ---
>
> Key: SPARK-19477
> URL: https://issues.apache.org/jira/browse/SPARK-19477
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Don Drake
>
> In 1.6, when you created a Dataset from a Dataframe that had extra columns, 
> the columns not in the case class were dropped from the Dataset.
> For example in 1.6, the column c4 is gone:
> {code}
> scala> case class F(f1: String, f2: String, f3:String)
> defined class F
> scala> import sqlContext.implicits._
> import sqlContext.implicits._
> scala> val df = Seq(("a","b","c","x"), ("d", "e", "f","y"), ("h", "i", 
> "j","z")).toDF("f1", "f2", "f3", "c4")
> df: org.apache.spark.sql.DataFrame = [f1: string, f2: string, f3: string, c4: 
> string]
> scala> val ds = df.as[F]
> ds: org.apache.spark.sql.Dataset[F] = [f1: string, f2: string, f3: string]
> scala> ds.show
> +---+---+---+
> | f1| f2| f3|
> +---+---+---+
> |  a|  b|  c|
> |  d|  e|  f|
> |  h|  i|  j|
> {code}
> This seems to have changed in Spark 2.0 and also 2.1:
> Spark 2.1.0:
> {code}
> scala> case class F(f1: String, f2: String, f3:String)
> defined class F
> scala> import spark.implicits._
> import spark.implicits._
> scala> val df = Seq(("a","b","c","x"), ("d", "e", "f","y"), ("h", "i", 
> "j","z")).toDF("f1", "f2", "f3", "c4")
> df: org.apache.spark.sql.DataFrame = [f1: string, f2: string ... 2 more 
> fields]
> scala> val ds = df.as[F]
> ds: org.apache.spark.sql.Dataset[F] = [f1: string, f2: string ... 2 more 
> fields]
> scala> ds.show
> +---+---+---+---+
> | f1| f2| f3| c4|
> +---+---+---+---+
> |  a|  b|  c|  x|
> |  d|  e|  f|  y|
> |  h|  i|  j|  z|
> +---+---+---+---+
> scala> import org.apache.spark.sql.Encoders
> import org.apache.spark.sql.Encoders
> scala> val fEncoder = Encoders.product[F]
> fEncoder: org.apache.spark.sql.Encoder[F] = class[f1[0]: string, f2[0]: 
> string, f3[0]: string]
> scala> fEncoder.schema == ds.schema
> res2: Boolean = false
> scala> ds.schema
> res3: org.apache.spark.sql.types.StructType = 
> StructType(StructField(f1,StringType,true), StructField(f2,StringType,true), 
> StructField(f3,StringType,true), StructField(c4,StringType,true))
> scala> fEncoder.schema
> res4: org.apache.spark.sql.types.StructType = 
> StructType(StructField(f1,StringType,true), StructField(f2,StringType,true), 
> StructField(f3,StringType,true))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19477) [SQL] Datasets created from a Dataframe with extra columns retain the extra columns

2017-02-06 Thread Don Drake (JIRA)
Don Drake created SPARK-19477:
-

 Summary: [SQL] Datasets created from a Dataframe with extra 
columns retain the extra columns
 Key: SPARK-19477
 URL: https://issues.apache.org/jira/browse/SPARK-19477
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Don Drake


In 1.6, when you created a Dataset from a Dataframe that had extra columns, the 
columns not in the case class were dropped from the Dataset.

For example in 1.6, the column c4 is gone:

{code}
scala> case class F(f1: String, f2: String, f3:String)
defined class F

scala> import sqlContext.implicits._
import sqlContext.implicits._

scala> val df = Seq(("a","b","c","x"), ("d", "e", "f","y"), ("h", "i", 
"j","z")).toDF("f1", "f2", "f3", "c4")
df: org.apache.spark.sql.DataFrame = [f1: string, f2: string, f3: string, c4: 
string]

scala> val ds = df.as[F]
ds: org.apache.spark.sql.Dataset[F] = [f1: string, f2: string, f3: string]

scala> ds.show
+---+---+---+
| f1| f2| f3|
+---+---+---+
|  a|  b|  c|
|  d|  e|  f|
|  h|  i|  j|

{code}

This seems to have changed in Spark 2.0 and also 2.1:

Spark 2.1.0:

{code}
scala> case class F(f1: String, f2: String, f3:String)
defined class F

scala> import spark.implicits._
import spark.implicits._

scala> val df = Seq(("a","b","c","x"), ("d", "e", "f","y"), ("h", "i", 
"j","z")).toDF("f1", "f2", "f3", "c4")
df: org.apache.spark.sql.DataFrame = [f1: string, f2: string ... 2 more fields]

scala> val ds = df.as[F]
ds: org.apache.spark.sql.Dataset[F] = [f1: string, f2: string ... 2 more fields]

scala> ds.show
+---+---+---+---+
| f1| f2| f3| c4|
+---+---+---+---+
|  a|  b|  c|  x|
|  d|  e|  f|  y|
|  h|  i|  j|  z|
+---+---+---+---+

scala> import org.apache.spark.sql.Encoders
import org.apache.spark.sql.Encoders

scala> val fEncoder = Encoders.product[F]
fEncoder: org.apache.spark.sql.Encoder[F] = class[f1[0]: string, f2[0]: string, 
f3[0]: string]

scala> fEncoder.schema == ds.schema
res2: Boolean = false

scala> ds.schema
res3: org.apache.spark.sql.types.StructType = 
StructType(StructField(f1,StringType,true), StructField(f2,StringType,true), 
StructField(f3,StringType,true), StructField(c4,StringType,true))

scala> fEncoder.schema
res4: org.apache.spark.sql.types.StructType = 
StructType(StructField(f1,StringType,true), StructField(f2,StringType,true), 
StructField(f3,StringType,true))


{code}





--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18207) class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" grows beyond 64 KB

2016-11-12 Thread Don Drake (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15659978#comment-15659978
 ] 

Don Drake commented on SPARK-18207:
---

Hi, I was able to download a nightly SNAPSHOT release and verify that this 
resolves the issue for my project.  Thanks to everyone who contributed to this 
fix and getting it merged in a timely manner.

> class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection"
>  grows beyond 64 KB
> 
>
> Key: SPARK-18207
> URL: https://issues.apache.org/jira/browse/SPARK-18207
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Don Drake
>Assignee: Kazuaki Ishizaki
> Fix For: 2.1.0
>
> Attachments: spark-18207.txt
>
>
> I have 2 wide dataframes that contain nested data structures, when I explode 
> one of the dataframes, it doesn't include records with an empty nested 
> structure (outer explode not supported).  So, I create a similar dataframe 
> with null values and union them together.  See SPARK-13721 for more details 
> as to why I have to do this.
> I was hoping that SPARK-16845 was going to address my issue, but it does not. 
>  I was asked by [~lwlin] to open this JIRA.  
> I will attach a code snippet that can be pasted into spark-shell that 
> duplicates my code and the exception.  This worked just fine in Spark 1.6.x.
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 35 in 
> stage 5.0 failed 4 times, most recent failure: Lost task 35.3 in stage 5.0 
> (TID 812, somehost.mydomain.com, executor 8): 
> java.util.concurrent.ExecutionException: java.lang.Exception: failed to 
> compile: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "apply(Lorg/apache/spark/sql/catalyst/InternalRow;)Lorg/apache/spark/sql/catalyst/expressions/UnsafeRow;"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection"
>  grows beyond 64 KB
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18207) class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" grows beyond 64 KB

2016-11-02 Thread Don Drake (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15629177#comment-15629177
 ] 

Don Drake commented on SPARK-18207:
---

The difference with my case versus the other test cases is that my scenario 
involves a wide dataframe (800+ columns) that also have multiple nested 
structures (arrays of classes) involved in a SQL query (union).  

I have verified that [~lwlin]'s fix does not work for my case, but it does work 
for wide dataframes without nested structures.

I agree it's similar to the others, but more complicated to reproduce.


> class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection"
>  grows beyond 64 KB
> 
>
> Key: SPARK-18207
> URL: https://issues.apache.org/jira/browse/SPARK-18207
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Don Drake
> Attachments: spark-18207.txt
>
>
> I have 2 wide dataframes that contain nested data structures, when I explode 
> one of the dataframes, it doesn't include records with an empty nested 
> structure (outer explode not supported).  So, I create a similar dataframe 
> with null values and union them together.  See SPARK-13721 for more details 
> as to why I have to do this.
> I was hoping that SPARK-16845 was going to address my issue, but it does not. 
>  I was asked by [~lwlin] to open this JIRA.  
> I will attach a code snippet that can be pasted into spark-shell that 
> duplicates my code and the exception.  This worked just fine in Spark 1.6.x.
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 35 in 
> stage 5.0 failed 4 times, most recent failure: Lost task 35.3 in stage 5.0 
> (TID 812, somehost.mydomain.com, executor 8): 
> java.util.concurrent.ExecutionException: java.lang.Exception: failed to 
> compile: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "apply(Lorg/apache/spark/sql/catalyst/InternalRow;)Lorg/apache/spark/sql/catalyst/expressions/UnsafeRow;"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection"
>  grows beyond 64 KB
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18207) class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" grows beyond 64 KB

2016-11-02 Thread Don Drake (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15629140#comment-15629140
 ] 

Don Drake commented on SPARK-18207:
---

I opened it based on [~lwlin]'s suggestion in the comments of SPARK-16845.  

> class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection"
>  grows beyond 64 KB
> 
>
> Key: SPARK-18207
> URL: https://issues.apache.org/jira/browse/SPARK-18207
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Don Drake
> Attachments: spark-18207.txt
>
>
> I have 2 wide dataframes that contain nested data structures, when I explode 
> one of the dataframes, it doesn't include records with an empty nested 
> structure (outer explode not supported).  So, I create a similar dataframe 
> with null values and union them together.  See SPARK-13721 for more details 
> as to why I have to do this.
> I was hoping that SPARK-16845 was going to address my issue, but it does not. 
>  I was asked by [~lwlin] to open this JIRA.  
> I will attach a code snippet that can be pasted into spark-shell that 
> duplicates my code and the exception.  This worked just fine in Spark 1.6.x.
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 35 in 
> stage 5.0 failed 4 times, most recent failure: Lost task 35.3 in stage 5.0 
> (TID 812, somehost.mydomain.com, executor 8): 
> java.util.concurrent.ExecutionException: java.lang.Exception: failed to 
> compile: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "apply(Lorg/apache/spark/sql/catalyst/InternalRow;)Lorg/apache/spark/sql/catalyst/expressions/UnsafeRow;"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection"
>  grows beyond 64 KB
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16845) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB

2016-11-01 Thread Don Drake (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15626664#comment-15626664
 ] 

Don Drake edited comment on SPARK-16845 at 11/2/16 12:32 AM:
-

I've been struggling to duplicate this and finally came up with a strategy that 
duplicates it in a spark-shell.  It's a combination of a wide dataset with 
nested (array) structures and performing a union that seem to trigger it.

I opened SPARK-18207.


was (Author: dondrake):
I've been struggling to duplicate this and finally came up with a strategy that 
duplicates it in a spark-shell.  It's a combination of a wide dataset with 
nested (array) structures and performing a union that seem to trigger it.

I'll open a new JIRA.

> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
> -
>
> Key: SPARK-16845
> URL: https://issues.apache.org/jira/browse/SPARK-16845
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, ML, MLlib
>Affects Versions: 2.0.0
>Reporter: hejie
> Attachments: error.txt.zip
>
>
> I have a wide table(400 columns), when I try fitting the traindata on all 
> columns,  the fatal error occurs. 
>   ... 46 more
> Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
>   at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
>   at org.codehaus.janino.CodeContext.write(CodeContext.java:854)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18207) class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" grows beyond 64 KB

2016-11-01 Thread Don Drake (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Don Drake updated SPARK-18207:
--
Description: 
I have 2 wide dataframes that contain nested data structures, when I explode 
one of the dataframes, it doesn't include records with an empty nested 
structure (outer explode not supported).  So, I create a similar dataframe with 
null values and union them together.  See SPARK-13721 for more details as to 
why I have to do this.

I was hoping that SPARK-16845 was going to address my issue, but it does not.  
I was asked by [~lwlin] to open this JIRA.  

I will attach a code snippet that can be pasted into spark-shell that 
duplicates my code and the exception.  This worked just fine in Spark 1.6.x.

{code}
org.apache.spark.SparkException: Job aborted due to stage failure: Task 35 in 
stage 5.0 failed 4 times, most recent failure: Lost task 35.3 in stage 5.0 (TID 
812, somehost.mydomain.com, executor 8): 
java.util.concurrent.ExecutionException: java.lang.Exception: failed to 
compile: org.codehaus.janino.JaninoRuntimeException: Code of method 
"apply(Lorg/apache/spark/sql/catalyst/InternalRow;)Lorg/apache/spark/sql/catalyst/expressions/UnsafeRow;"
 of class 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection"
 grows beyond 64 KB
{code}


  was:
I have 2 wide dataframes that contain nested data structures, when I explode 
one of the dataframes, it doesn't include records with an empty nested 
structure (outer explode not supported).  So, I create a similar dataframe with 
null values and union them together.  See SPARK-13721 for more details as to 
why I have to do this.

I was hoping that SPARK-16845 was going to address my issue, but it does not.  
I was asked by [~lwlin] to open this JIRA.  

I will attach a code snippet that can be pasted into spark-shell that 
duplicates my code and the exception.  This worked just fine in Spark 1.6.x.




> class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection"
>  grows beyond 64 KB
> 
>
> Key: SPARK-18207
> URL: https://issues.apache.org/jira/browse/SPARK-18207
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Don Drake
> Attachments: spark-18207.txt
>
>
> I have 2 wide dataframes that contain nested data structures, when I explode 
> one of the dataframes, it doesn't include records with an empty nested 
> structure (outer explode not supported).  So, I create a similar dataframe 
> with null values and union them together.  See SPARK-13721 for more details 
> as to why I have to do this.
> I was hoping that SPARK-16845 was going to address my issue, but it does not. 
>  I was asked by [~lwlin] to open this JIRA.  
> I will attach a code snippet that can be pasted into spark-shell that 
> duplicates my code and the exception.  This worked just fine in Spark 1.6.x.
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 35 in 
> stage 5.0 failed 4 times, most recent failure: Lost task 35.3 in stage 5.0 
> (TID 812, somehost.mydomain.com, executor 8): 
> java.util.concurrent.ExecutionException: java.lang.Exception: failed to 
> compile: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "apply(Lorg/apache/spark/sql/catalyst/InternalRow;)Lorg/apache/spark/sql/catalyst/expressions/UnsafeRow;"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection"
>  grows beyond 64 KB
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18207) class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" grows beyond 64 KB

2016-11-01 Thread Don Drake (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Don Drake updated SPARK-18207:
--
Attachment: spark-18207.txt

Please read the comments at the top of the attachment, you need to :paste 
portions of this into spark-shell.



> class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection"
>  grows beyond 64 KB
> 
>
> Key: SPARK-18207
> URL: https://issues.apache.org/jira/browse/SPARK-18207
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Don Drake
> Attachments: spark-18207.txt
>
>
> I have 2 wide dataframes that contain nested data structures, when I explode 
> one of the dataframes, it doesn't include records with an empty nested 
> structure (outer explode not supported).  So, I create a similar dataframe 
> with null values and union them together.  See SPARK-13721 for more details 
> as to why I have to do this.
> I was hoping that SPARK-16845 was going to address my issue, but it does not. 
>  I was asked by [~lwlin] to open this JIRA.  
> I will attach a code snippet that can be pasted into spark-shell that 
> duplicates my code and the exception.  This worked just fine in Spark 1.6.x.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18207) class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" grows beyond 64 KB

2016-11-01 Thread Don Drake (JIRA)
Don Drake created SPARK-18207:
-

 Summary: class 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection"
 grows beyond 64 KB
 Key: SPARK-18207
 URL: https://issues.apache.org/jira/browse/SPARK-18207
 Project: Spark
  Issue Type: Bug
  Components: Optimizer, SQL
Affects Versions: 2.0.1, 2.1.0
Reporter: Don Drake


I have 2 wide dataframes that contain nested data structures, when I explode 
one of the dataframes, it doesn't include records with an empty nested 
structure (outer explode not supported).  So, I create a similar dataframe with 
null values and union them together.  See SPARK-13721 for more details as to 
why I have to do this.

I was hoping that SPARK-16845 was going to address my issue, but it does not.  
I was asked by [~lwlin] to open this JIRA.  

I will attach a code snippet that can be pasted into spark-shell that 
duplicates my code and the exception.  This worked just fine in Spark 1.6.x.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16845) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB

2016-11-01 Thread Don Drake (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15626664#comment-15626664
 ] 

Don Drake commented on SPARK-16845:
---

I've been struggling to duplicate this and finally came up with a strategy that 
duplicates it in a spark-shell.  It's a combination of a wide dataset with 
nested (array) structures and performing a union that seem to trigger it.

I'll open a new JIRA.

> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
> -
>
> Key: SPARK-16845
> URL: https://issues.apache.org/jira/browse/SPARK-16845
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, ML, MLlib
>Affects Versions: 2.0.0
>Reporter: hejie
> Attachments: error.txt.zip
>
>
> I have a wide table(400 columns), when I try fitting the traindata on all 
> columns,  the fatal error occurs. 
>   ... 46 more
> Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
>   at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
>   at org.codehaus.janino.CodeContext.write(CodeContext.java:854)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16845) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB

2016-10-27 Thread Don Drake (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Don Drake updated SPARK-16845:
--
Attachment: error.txt.zip

Does this generated code help in resolving this?

> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
> -
>
> Key: SPARK-16845
> URL: https://issues.apache.org/jira/browse/SPARK-16845
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, ML, MLlib
>Affects Versions: 2.0.0
>Reporter: hejie
> Attachments: error.txt.zip
>
>
> I have a wide table(400 columns), when I try fitting the traindata on all 
> columns,  the fatal error occurs. 
>   ... 46 more
> Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
>   at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
>   at org.codehaus.janino.CodeContext.write(CodeContext.java:854)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16845) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB

2016-10-27 Thread Don Drake (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15614100#comment-15614100
 ] 

Don Drake commented on SPARK-16845:
---

I'm struggling to get a simple case created. 

I'm curious though, if I compile my .jar file using sbt with Spark 2.0.1 but 
use your compiled branch of Spark 2.1.0-SNAPSHOT as a run-time (spark-submit), 
would you expect it to work? 

When using your compile branch of Spark 2.1.0-SNAPSHOT and execute a 
spark-shell the test cases provided in this JIRA pass.  But my code fails.

Also, the error message says "grows beyond 64k" as the compiler error but the 
output generates over 400k of source code. I'll try to attach the exact error 
message java code.


> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
> -
>
> Key: SPARK-16845
> URL: https://issues.apache.org/jira/browse/SPARK-16845
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, ML, MLlib
>Affects Versions: 2.0.0
>Reporter: hejie
>
> I have a wide table(400 columns), when I try fitting the traindata on all 
> columns,  the fatal error occurs. 
>   ... 46 more
> Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
>   at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
>   at org.codehaus.janino.CodeContext.write(CodeContext.java:854)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16845) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB

2016-10-22 Thread Don Drake (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15598442#comment-15598442
 ] 

Don Drake commented on SPARK-16845:
---

Update: 

It turns out that I am still getting this exception.  I'll try to create a test 
case to duplicate it. Basically, I'm exploding a nested datastructure, then 
doing a union and then saving to Parquet.  The resulting table has over 400 
columns.

I verified in spark-shell the exceptions do not occur with the test cases 
provided.

Can you point me to your other solution? I can see if that works.

> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
> -
>
> Key: SPARK-16845
> URL: https://issues.apache.org/jira/browse/SPARK-16845
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, ML, MLlib
>Affects Versions: 2.0.0
>Reporter: hejie
>
> I have a wide table(400 columns), when I try fitting the traindata on all 
> columns,  the fatal error occurs. 
>   ... 46 more
> Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
>   at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
>   at org.codehaus.janino.CodeContext.write(CodeContext.java:854)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16845) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB

2016-10-20 Thread Don Drake (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15592862#comment-15592862
 ] 

Don Drake commented on SPARK-16845:
---

I compiled your branch and ran my large job and it finished successfully.  

Sorry for the confusion, I wasn't watching the PR, just this JIRA and wasn't 
aware of the changes you were making.

Can this get merged as well as backported to 2.0.x?

Thanks so much.

-Don


> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
> -
>
> Key: SPARK-16845
> URL: https://issues.apache.org/jira/browse/SPARK-16845
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, ML, MLlib
>Affects Versions: 2.0.0
>Reporter: hejie
>
> I have a wide table(400 columns), when I try fitting the traindata on all 
> columns,  the fatal error occurs. 
>   ... 46 more
> Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
>   at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
>   at org.codehaus.janino.CodeContext.write(CodeContext.java:854)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16845) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB

2016-10-19 Thread Don Drake (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15590256#comment-15590256
 ] 

Don Drake commented on SPARK-16845:
---

[~lwlin] I saw your PR, but noticed it's failing some tests. Just curious if 
you will have some time to resolve this. If I can help, please let me know.

-Don

> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
> -
>
> Key: SPARK-16845
> URL: https://issues.apache.org/jira/browse/SPARK-16845
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, ML, MLlib
>Affects Versions: 2.0.0
>Reporter: hejie
>
> I have a wide table(400 columns), when I try fitting the traindata on all 
> columns,  the fatal error occurs. 
>   ... 46 more
> Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
>   at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
>   at org.codehaus.janino.CodeContext.write(CodeContext.java:854)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16845) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB

2016-10-11 Thread Don Drake (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15567358#comment-15567358
 ] 

Don Drake commented on SPARK-16845:
---

I can't at the moment, mine is not simple.  

But this JIRA has one: https://issues.apache.org/jira/browse/SPARK-17092

> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
> -
>
> Key: SPARK-16845
> URL: https://issues.apache.org/jira/browse/SPARK-16845
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, ML, MLlib
>Affects Versions: 2.0.0
>Reporter: hejie
>
> I have a wide table(400 columns), when I try fitting the traindata on all 
> columns,  the fatal error occurs. 
>   ... 46 more
> Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
>   at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
>   at org.codehaus.janino.CodeContext.write(CodeContext.java:854)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16845) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB

2016-10-11 Thread Don Drake (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15565592#comment-15565592
 ] 

Don Drake commented on SPARK-16845:
---

Unfortunately, it does not work around it.


16/10/10 18:19:47 ERROR CodeGenerator: failed to compile: 
org.codehaus.janino.JaninoRuntimeException: Code of method 
"(Lorg/apache/spark/sql/catalyst/InternalRow;)Lorg/apache/spark/sql/catalyst/expressions/UnsafeRow;"
 of class 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection"
 grows beyond 64 KB
/* 001 */ public java.lang.Object generate(Object[] references) {
/* 002 */   return new SpecificUnsafeProjection(references);
/* 003 */ }

> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
> -
>
> Key: SPARK-16845
> URL: https://issues.apache.org/jira/browse/SPARK-16845
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, ML, MLlib
>Affects Versions: 2.0.0
>Reporter: hejie
>
> I have a wide table(400 columns), when I try fitting the traindata on all 
> columns,  the fatal error occurs. 
>   ... 46 more
> Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
>   at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
>   at org.codehaus.janino.CodeContext.write(CodeContext.java:854)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16845) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB

2016-10-03 Thread Don Drake (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15543555#comment-15543555
 ] 

Don Drake commented on SPARK-16845:
---

I just hit this bug as well.  Are there any suggested workarounds?

> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
> -
>
> Key: SPARK-16845
> URL: https://issues.apache.org/jira/browse/SPARK-16845
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, ML, MLlib
>Affects Versions: 2.0.0
>Reporter: hejie
>
> I have a wide table(400 columns), when I try fitting the traindata on all 
> columns,  the fatal error occurs. 
>   ... 46 more
> Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
>   at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
>   at org.codehaus.janino.CodeContext.write(CodeContext.java:854)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17384) SQL - Running query with outer join from 1.6 fails

2016-09-02 Thread Don Drake (JIRA)
Don Drake created SPARK-17384:
-

 Summary: SQL - Running query with outer join from 1.6 fails
 Key: SPARK-17384
 URL: https://issues.apache.org/jira/browse/SPARK-17384
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Don Drake


I have some complex (10-table joins) SQL queries that utilize outer joins that 
work fine in Spark 1.6.2, but fail under Spark 2.0.  I was able to duplicate 
the problem using a simple test case.

Here's the code for Spark 2.0 that doesn't run (this runs fine in Spark 1.6.2):

{code}
case class C1(f1: String, f2: String, f3: String, f4: String)
case class C2(g1: String, g2: String, g3: String, g4: String)
case class C3(h1: String, h2: String, h3: String, h4: String)

val sqlContext = spark.sqlContext 

val c1 = sc.parallelize(Seq(
  C1("h1", "c1a1", "c1b1", "c1c1"),
  C1("h2", "c1a2", "c1b2", "c1c2"),
  C1(null, "c1a3", "c1b3", "c1c3")
  )).toDF
c1.createOrReplaceTempView("c1")

val c2 = sc.parallelize(Seq(
  C2("h1", "c2a1", "c2b1", "c2c1"),
  C2("h2", "c2a2", "c2b2", "c2c2"),
  C2(null, "c2a3", "c2b3", "c2c3"),
  C2(null, "c2a4", "c2b4", "c2c4"),
  C2("h333", "c2a333", "c2b333", "c2c333")
  )).toDF
c2.createOrReplaceTempView("c2")

val c3 = sc.parallelize(Seq(
  C3("h1", "c3a1", "c3b1", "c3c1"),
  C3("h2", "c3a2", "c3b2", "c3c2"),
  C3(null, "c3a3", "c3b3", "c3c3")
  )).toDF
c3.createOrReplaceTempView("c3")

// doesn't work in Spark 2.0, works in Spark 1.6
val bad_df = sqlContext.sql("""
  select * 
  from c1, c3
  left outer join c2 on (c1.f1 = c2.g1)
  where c1.f1 = c3.h1
""").show()

// works in both
val works_df = sqlContext.sql("""
  select * 
  from c1
  left outer join c2 on (c1.f1 = c2.g1), 
  c3
  where c1.f1 = c3.h1
""").show()
{code}

Here's the output after running bad_df in Spark 2.0:

{code}
scala> val bad_df = sqlContext.sql("""
 |   select *
 |   from c1, c3
 |   left outer join c2 on (c1.f1 = c2.g1)
 |   where c1.f1 = c3.h1
 | """).show()
org.apache.spark.sql.AnalysisException: cannot resolve '`c1.f1`' given input 
columns: [h3, g3, h4, g2, g4, h2, h1, g1]; line 4 pos 25
  at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
  at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:298)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:190)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:201)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$5.apply(QueryPlan.scala:209)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:209)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:74)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:125)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:125)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:125)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:125)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:125)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:125)
  at 

[jira] [Commented] (SPARK-13721) Add support for LATERAL VIEW OUTER explode()

2016-09-02 Thread Don Drake (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15459490#comment-15459490
 ] 

Don Drake commented on SPARK-13721:
---

My nested structures aren't simple types, they are structs (case classes) and 
so this existing method works great for me. 

This ticket it about modifying the explode() call to support outer, not adding 
outer to the data frame api.

> Add support for LATERAL VIEW OUTER explode()
> 
>
> Key: SPARK-13721
> URL: https://issues.apache.org/jira/browse/SPARK-13721
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Ian Hellstrom
>
> Hive supports the [LATERAL VIEW 
> OUTER|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LateralView#LanguageManualLateralView-OuterLateralViews]
>  syntax to make sure that when an array is empty, the content from the outer 
> table is still returned. 
> Within Spark, this is currently only possible within the HiveContext and 
> executing HiveQL statements. It would be nice if the standard explode() 
> DataFrame method allows the same. A possible signature would be: 
> {code:scala}
> explode[A, B](inputColumn: String, outputColumn: String, outer: Boolean = 
> false)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17341) Can't read Parquet data with fields containing periods "."

2016-08-31 Thread Don Drake (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15453806#comment-15453806
 ] 

Don Drake commented on SPARK-17341:
---

I just downloaded the nightly build from 8/31/2016 and gave it a try.

And it worked:

{code}
scala> inSquare.take(2)
res2: Array[org.apache.spark.sql.Row] = Array([1,1], [2,4])

scala> inSquare.show(false)
+-+-+
|value|squared.value|
+-+-+
|1|1|
|2|4|
|3|9|
|4|16   |
|5|25   |
+-+-+
{code}

Thanks.

> Can't read Parquet data with fields containing periods "."
> --
>
> Key: SPARK-17341
> URL: https://issues.apache.org/jira/browse/SPARK-17341
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Don Drake
>
> I am porting a set of Spark 1.6.2 applications to Spark 2.0 and I have 
> encountered a showstopper problem with Parquet dataset that have fields 
> containing a "." in a field name.  This data comes from an external provider 
> (CSV) and we just pass through the field names.  This has worked flawlessly 
> in Spark 1.5 and 1.6, but now spark can't seem to read these parquet files.  
> {code}
> Spark context available as 'sc' (master = local[*], app id = 
> local-1472664486578).
> Spark session available as 'spark'.
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.0.0
>   /_/
> Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_51)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> val squaresDF = spark.sparkContext.makeRDD(1 to 5).map(i => (i, i * 
> i)).toDF("value", "squared.value")
> 16/08/31 12:28:44 WARN ObjectStore: Version information not found in 
> metastore. hive.metastore.schema.verification is not enabled so recording the 
> schema version 1.2.0
> 16/08/31 12:28:44 WARN ObjectStore: Failed to get database default, returning 
> NoSuchObjectException
> squaresDF: org.apache.spark.sql.DataFrame = [value: int, squared.value: int]
> scala> squaresDF.take(2)
> res0: Array[org.apache.spark.sql.Row] = Array([1,1], [2,4])
> scala> squaresDF.write.parquet("squares")
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
> details.
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: 
> Compression: SNAPPY
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: 
> Compression: SNAPPY
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: 
> Compression: SNAPPY
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: 
> Compression: SNAPPY
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: 
> Compression: SNAPPY
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: 
> Compression: SNAPPY
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: 
> Compression: SNAPPY
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: 
> Compression: SNAPPY
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet block size to 134217728
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet block size to 134217728
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet page size to 1048576
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet block size to 134217728
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet block size to 134217728
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet block size to 134217728
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet page size to 1048576
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet page size to 1048576
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet block size to 134217728
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet block size to 134217728
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet block size to 134217728
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet page size to 1048576
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet page size to 1048576
> 

[jira] [Created] (SPARK-17341) Can't read Parquet data with fields containing periods "."

2016-08-31 Thread Don Drake (JIRA)
Don Drake created SPARK-17341:
-

 Summary: Can't read Parquet data with fields containing periods "."
 Key: SPARK-17341
 URL: https://issues.apache.org/jira/browse/SPARK-17341
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Don Drake


I am porting a set of Spark 1.6.2 applications to Spark 2.0 and I have 
encountered a showstopper problem with Parquet dataset that have fields 
containing a "." in a field name.  This data comes from an external provider 
(CSV) and we just pass through the field names.  This has worked flawlessly in 
Spark 1.5 and 1.6, but now spark can't seem to read these parquet files.  

{code}
Spark context available as 'sc' (master = local[*], app id = 
local-1472664486578).
Spark session available as 'spark'.
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.0.0
  /_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_51)
Type in expressions to have them evaluated.
Type :help for more information.

scala> val squaresDF = spark.sparkContext.makeRDD(1 to 5).map(i => (i, i * 
i)).toDF("value", "squared.value")
16/08/31 12:28:44 WARN ObjectStore: Version information not found in metastore. 
hive.metastore.schema.verification is not enabled so recording the schema 
version 1.2.0
16/08/31 12:28:44 WARN ObjectStore: Failed to get database default, returning 
NoSuchObjectException
squaresDF: org.apache.spark.sql.DataFrame = [value: int, squared.value: int]

scala> squaresDF.take(2)
res0: Array[org.apache.spark.sql.Row] = Array([1,1], [2,4])

scala> squaresDF.write.parquet("squares")
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
details.
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: 
Compression: SNAPPY
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: 
Compression: SNAPPY
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: 
Compression: SNAPPY
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: 
Compression: SNAPPY
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: 
Compression: SNAPPY
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: 
Compression: SNAPPY
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: 
Compression: SNAPPY
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: 
Compression: SNAPPY
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
Parquet block size to 134217728
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
Parquet block size to 134217728
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
Parquet page size to 1048576
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
Parquet block size to 134217728
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
Parquet block size to 134217728
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
Parquet block size to 134217728
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
Parquet page size to 1048576
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
Parquet page size to 1048576
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
Parquet block size to 134217728
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
Parquet block size to 134217728
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
Parquet block size to 134217728
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
Parquet page size to 1048576
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
Parquet page size to 1048576
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
Parquet dictionary page size to 1048576
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
Parquet dictionary page size to 1048576
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
Dictionary is on
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
Parquet dictionary page size to 1048576
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
Parquet page size to 1048576
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
Parquet page size to 1048576
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
Parquet dictionary page size to 1048576

[jira] [Commented] (SPARK-13721) Add support for LATERAL VIEW OUTER explode()

2016-08-31 Thread Don Drake (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15452846#comment-15452846
 ] 

Don Drake commented on SPARK-13721:
---

Spark 2.0 has deprecated this function, what workarounds are suggested?

> Add support for LATERAL VIEW OUTER explode()
> 
>
> Key: SPARK-13721
> URL: https://issues.apache.org/jira/browse/SPARK-13721
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Ian Hellstrom
>
> Hive supports the [LATERAL VIEW 
> OUTER|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LateralView#LanguageManualLateralView-OuterLateralViews]
>  syntax to make sure that when an array is empty, the content from the outer 
> table is still returned. 
> Within Spark, this is currently only possible within the HiveContext and 
> executing HiveQL statements. It would be nice if the standard explode() 
> DataFrame method allows the same. A possible signature would be: 
> {code:scala}
> explode[A, B](inputColumn: String, outputColumn: String, outer: Boolean = 
> false)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15467) Getting stack overflow when attempting to query a wide Dataset (>200 fields)

2016-05-22 Thread Don Drake (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15295512#comment-15295512
 ] 

Don Drake commented on SPARK-15467:
---

Vishnu, the 22 field limitation is with Scala 2.10.x, Spark 2.0 uses Scala 
2.11.x, which increases the limit to 254 fields. 

> Getting stack overflow when attempting to query a wide Dataset (>200 fields)
> 
>
> Key: SPARK-15467
> URL: https://issues.apache.org/jira/browse/SPARK-15467
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Don Drake
>
> This can be duplicated in a spark-shell, I am running Spark 2.0.0-preview.
> {code}
> import spark.implicits._
> case class Wide(
> val f0:String = "",
> val f1:String = "",
> val f2:String = "",
> val f3:String = "",
> val f4:String = "",
> val f5:String = "",
> val f6:String = "",
> val f7:String = "",
> val f8:String = "",
> val f9:String = "",
> val f10:String = "",
> val f11:String = "",
> val f12:String = "",
> val f13:String = "",
> val f14:String = "",
> val f15:String = "",
> val f16:String = "",
> val f17:String = "",
> val f18:String = "",
> val f19:String = "",
> val f20:String = "",
> val f21:String = "",
> val f22:String = "",
> val f23:String = "",
> val f24:String = "",
> val f25:String = "",
> val f26:String = "",
> val f27:String = "",
> val f28:String = "",
> val f29:String = "",
> val f30:String = "",
> val f31:String = "",
> val f32:String = "",
> val f33:String = "",
> val f34:String = "",
> val f35:String = "",
> val f36:String = "",
> val f37:String = "",
> val f38:String = "",
> val f39:String = "",
> val f40:String = "",
> val f41:String = "",
> val f42:String = "",
> val f43:String = "",
> val f44:String = "",
> val f45:String = "",
> val f46:String = "",
> val f47:String = "",
> val f48:String = "",
> val f49:String = "",
> val f50:String = "",
> val f51:String = "",
> val f52:String = "",
> val f53:String = "",
> val f54:String = "",
> val f55:String = "",
> val f56:String = "",
> val f57:String = "",
> val f58:String = "",
> val f59:String = "",
> val f60:String = "",
> val f61:String = "",
> val f62:String = "",
> val f63:String = "",
> val f64:String = "",
> val f65:String = "",
> val f66:String = "",
> val f67:String = "",
> val f68:String = "",
> val f69:String = "",
> val f70:String = "",
> val f71:String = "",
> val f72:String = "",
> val f73:String = "",
> val f74:String = "",
> val f75:String = "",
> val f76:String = "",
> val f77:String = "",
> val f78:String = "",
> val f79:String = "",
> val f80:String = "",
> val f81:String = "",
> val f82:String = "",
> val f83:String = "",
> val f84:String = "",
> val f85:String = "",
> val f86:String = "",
> val f87:String = "",
> val f88:String = "",
> val f89:String = "",
> val f90:String = "",
> val f91:String = "",
> val f92:String = "",
> val f93:String = "",
> val f94:String = "",
> val f95:String = "",
> val f96:String = "",
> val f97:String = "",
> val f98:String = "",
> val f99:String = "",
> val f100:String = "",
> val f101:String = "",
> val f102:String = "",
> val f103:String = "",
> val f104:String = "",
> val f105:String = "",
> val f106:String = "",
> val f107:String = "",
> val f108:String = "",
> val f109:String = "",
> val f110:String = "",
> val f111:String = "",
> val f112:String = "",
> val f113:String = "",
> val f114:String = "",
> val f115:String = "",
> val f116:String = "",
> val f117:String = "",
> val f118:String = "",
> val f119:String = "",
> val f120:String = "",
> val f121:String = "",
> val f122:String = "",
> val f123:String = "",
> val f124:String = "",
> val f125:String = "",
> val f126:String = "",
> val f127:String = "",
> val f128:String = "",
> val f129:String = "",
> val f130:String = "",
> val f131:String = "",
> val f132:String = "",
> val f133:String = "",
> val f134:String = "",
> val f135:String = "",
> val f136:String = "",
> val f137:String = "",
> val f138:String = "",
> val f139:String = "",
> val f140:String = "",
> val f141:String = "",
> val f142:String = "",
> val f143:String = "",
> val f144:String = "",
> val f145:String = "",
> val f146:String = "",
> val f147:String = "",
> val f148:String = "",
> val f149:String = "",
> val f150:String = "",
> val f151:String = "",
> val f152:String = "",
> val f153:String = "",
> val f154:String = "",
> val f155:String = "",
> val f156:String = "",
> val f157:String = "",
> val f158:String = "",
> val f159:String = "",
> val f160:String = "",
> val f161:String = "",
> val f162:String = "",
> val f163:String = "",
> val f164:String = "",
> val f165:String = "",
> val f166:String = "",
> val f167:String = "",
> val f168:String = "",
> val f169:String = "",
> val f170:String = "",
> val f171:String = "",
> val f172:String = "",
> val f173:String = "",
> val f174:String = "",
> 

[jira] [Created] (SPARK-15467) Getting stack overflow when attempting to query a wide Dataset (>200 fields)

2016-05-21 Thread Don Drake (JIRA)
Don Drake created SPARK-15467:
-

 Summary: Getting stack overflow when attempting to query a wide 
Dataset (>200 fields)
 Key: SPARK-15467
 URL: https://issues.apache.org/jira/browse/SPARK-15467
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Don Drake


This can be duplicated in a spark-shell, I am running Spark 2.0.0-preview.

{code}
import spark.implicits._


case class Wide(
val f0:String = "",
val f1:String = "",
val f2:String = "",
val f3:String = "",
val f4:String = "",
val f5:String = "",
val f6:String = "",
val f7:String = "",
val f8:String = "",
val f9:String = "",
val f10:String = "",
val f11:String = "",
val f12:String = "",
val f13:String = "",
val f14:String = "",
val f15:String = "",
val f16:String = "",
val f17:String = "",
val f18:String = "",
val f19:String = "",
val f20:String = "",
val f21:String = "",
val f22:String = "",
val f23:String = "",
val f24:String = "",
val f25:String = "",
val f26:String = "",
val f27:String = "",
val f28:String = "",
val f29:String = "",
val f30:String = "",
val f31:String = "",
val f32:String = "",
val f33:String = "",
val f34:String = "",
val f35:String = "",
val f36:String = "",
val f37:String = "",
val f38:String = "",
val f39:String = "",
val f40:String = "",
val f41:String = "",
val f42:String = "",
val f43:String = "",
val f44:String = "",
val f45:String = "",
val f46:String = "",
val f47:String = "",
val f48:String = "",
val f49:String = "",
val f50:String = "",
val f51:String = "",
val f52:String = "",
val f53:String = "",
val f54:String = "",
val f55:String = "",
val f56:String = "",
val f57:String = "",
val f58:String = "",
val f59:String = "",
val f60:String = "",
val f61:String = "",
val f62:String = "",
val f63:String = "",
val f64:String = "",
val f65:String = "",
val f66:String = "",
val f67:String = "",
val f68:String = "",
val f69:String = "",
val f70:String = "",
val f71:String = "",
val f72:String = "",
val f73:String = "",
val f74:String = "",
val f75:String = "",
val f76:String = "",
val f77:String = "",
val f78:String = "",
val f79:String = "",
val f80:String = "",
val f81:String = "",
val f82:String = "",
val f83:String = "",
val f84:String = "",
val f85:String = "",
val f86:String = "",
val f87:String = "",
val f88:String = "",
val f89:String = "",
val f90:String = "",
val f91:String = "",
val f92:String = "",
val f93:String = "",
val f94:String = "",
val f95:String = "",
val f96:String = "",
val f97:String = "",
val f98:String = "",
val f99:String = "",
val f100:String = "",
val f101:String = "",
val f102:String = "",
val f103:String = "",
val f104:String = "",
val f105:String = "",
val f106:String = "",
val f107:String = "",
val f108:String = "",
val f109:String = "",
val f110:String = "",
val f111:String = "",
val f112:String = "",
val f113:String = "",
val f114:String = "",
val f115:String = "",
val f116:String = "",
val f117:String = "",
val f118:String = "",
val f119:String = "",
val f120:String = "",
val f121:String = "",
val f122:String = "",
val f123:String = "",
val f124:String = "",
val f125:String = "",
val f126:String = "",
val f127:String = "",
val f128:String = "",
val f129:String = "",
val f130:String = "",
val f131:String = "",
val f132:String = "",
val f133:String = "",
val f134:String = "",
val f135:String = "",
val f136:String = "",
val f137:String = "",
val f138:String = "",
val f139:String = "",
val f140:String = "",
val f141:String = "",
val f142:String = "",
val f143:String = "",
val f144:String = "",
val f145:String = "",
val f146:String = "",
val f147:String = "",
val f148:String = "",
val f149:String = "",
val f150:String = "",
val f151:String = "",
val f152:String = "",
val f153:String = "",
val f154:String = "",
val f155:String = "",
val f156:String = "",
val f157:String = "",
val f158:String = "",
val f159:String = "",
val f160:String = "",
val f161:String = "",
val f162:String = "",
val f163:String = "",
val f164:String = "",
val f165:String = "",
val f166:String = "",
val f167:String = "",
val f168:String = "",
val f169:String = "",
val f170:String = "",
val f171:String = "",
val f172:String = "",
val f173:String = "",
val f174:String = "",
val f175:String = "",
val f176:String = "",
val f177:String = "",
val f178:String = "",
val f179:String = "",
val f180:String = "",
val f181:String = "",
val f182:String = "",
val f183:String = "",
val f184:String = "",
val f185:String = "",
val f186:String = "",
val f187:String = "",
val f188:String = "",
val f189:String = "",
val f190:String = "",
val f191:String = "",
val f192:String = "",
val f193:String = "",
val f194:String = "",
val f195:String = "",
val f196:String = "",
val f197:String = "",
val f198:String = "",
val f199:String = "",
val f200:String = "",
val f201:String = "",
val f202:String = "",
val f203:String = "",
val f204:String = "",
val f205:String = "",
val f206:String = "",
val 

[jira] [Commented] (SPARK-11085) Add support for HTTP proxy

2015-10-13 Thread Don Drake (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14954989#comment-14954989
 ] 

Don Drake commented on SPARK-11085:
---

Neither of the options work.

> Add support for HTTP proxy 
> ---
>
> Key: SPARK-11085
> URL: https://issues.apache.org/jira/browse/SPARK-11085
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell, Spark Submit
>Reporter: Dustin Cote
>Priority: Minor
>
> Add a way to update ivysettings.xml for the spark-shell and spark-submit to 
> support proxy settings for clusters that need to access a remote repository 
> through an http proxy.  Typically this would be done like:
> JAVA_OPTS="$JAVA_OPTS -Dhttp.proxyHost=proxy.host -Dhttp.proxyPort=8080 
> -Dhttps.proxyHost=proxy.host.secure -Dhttps.proxyPort=8080"
> Directly in the ivysettings.xml would look like:
>  
>  proxyport="8080" 
> nonproxyhosts="nonproxy.host"/> 
>  
> Even better would be a way to customize the ivysettings.xml with command 
> options.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10441) Cannot write timestamp to JSON

2015-09-09 Thread Don Drake (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14736986#comment-14736986
 ] 

Don Drake commented on SPARK-10441:
---

Got it, thanks for the clarification.

> Cannot write timestamp to JSON
> --
>
> Key: SPARK-10441
> URL: https://issues.apache.org/jira/browse/SPARK-10441
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Critical
> Fix For: 1.6.0, 1.5.1
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10441) Cannot write timestamp to JSON

2015-09-08 Thread Don Drake (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14735690#comment-14735690
 ] 

Don Drake commented on SPARK-10441:
---

I see that PR 8597 was merged into master.  Does master represent 1.5.1?  I'm 
curious if this will be part of 1.5.0 as it's blocking my from upgrading at the 
moment.

Thanks.

> Cannot write timestamp to JSON
> --
>
> Key: SPARK-10441
> URL: https://issues.apache.org/jira/browse/SPARK-10441
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yin Huai
>Priority: Critical
> Fix For: 1.6.0, 1.5.1
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8368) ClassNotFoundException in closure for map

2015-06-22 Thread Don Drake (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14596001#comment-14596001
 ] 

Don Drake commented on SPARK-8368:
--

I've verified through a nightly build that this resolves my issue (SPARK-8365). 
 Thanks!

 ClassNotFoundException in closure for map 
 --

 Key: SPARK-8368
 URL: https://issues.apache.org/jira/browse/SPARK-8368
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
 Environment: Centos 6.5, java 1.7.0_67, scala 2.10.4. Build the 
 project on Windows 7 and run in a spark standalone cluster(or local) mode on 
 Centos 6.X. 
Reporter: CHEN Zhiwei
Assignee: Yin Huai
Priority: Blocker
 Fix For: 1.4.1, 1.5.0


 After upgraded the cluster from spark 1.3.0 to 1.4.0(rc4), I encountered the 
 following exception:
 ==begin exception
 {quote}
 Exception in thread main java.lang.ClassNotFoundException: 
 com.yhd.ycache.magic.Model$$anonfun$9$$anonfun$10
   at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
   at java.security.AccessController.doPrivileged(Native Method)
   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
   at java.lang.Class.forName0(Native Method)
   at java.lang.Class.forName(Class.java:278)
   at 
 org.apache.spark.util.InnerClosureFinder$$anon$4.visitMethodInsn(ClosureCleaner.scala:455)
   at 
 com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.accept(Unknown
  Source)
   at 
 com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.accept(Unknown
  Source)
   at 
 org.apache.spark.util.ClosureCleaner$.getInnerClosureClasses(ClosureCleaner.scala:101)
   at 
 org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:197)
   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:132)
   at org.apache.spark.SparkContext.clean(SparkContext.scala:1891)
   at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:294)
   at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:293)
   at 
 org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148)
   at 
 org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109)
   at org.apache.spark.rdd.RDD.withScope(RDD.scala:286)
   at org.apache.spark.rdd.RDD.map(RDD.scala:293)
   at org.apache.spark.sql.DataFrame.map(DataFrame.scala:1210)
   at com.yhd.ycache.magic.Model$.main(SSExample.scala:239)
   at com.yhd.ycache.magic.Model.main(SSExample.scala)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:664)
   at 
 org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169)
   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192)
   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111)
   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
 {quote}
 ===end exception===
 I simplify the code that cause this issue, as following:
 ==begin code==
 {noformat}
 object Model extends Serializable{
   def main(args: Array[String]) {
 val Array(sql) = args
 val sparkConf = new SparkConf().setAppName(Mode Example)
 val sc = new SparkContext(sparkConf)
 val hive = new HiveContext(sc)
 //get data by hive sql
 val rows = hive.sql(sql)
 val data = rows.map(r = { 
   val arr = r.toSeq.toArray
   val label = 1.0
   def fmap = ( input: Any ) = 1.0
   val feature = arr.map(_=1.0)
   LabeledPoint(label, Vectors.dense(feature))
 })
 data.count()
   }
 }
 {noformat}
 =end code===
 This code can run pretty well on spark-shell, but error when submit it to 
 spark cluster (standalone or local mode).  I try the same code on spark 
 1.3.0(local mode), and no exception is encountered.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: 

[jira] [Commented] (SPARK-8365) pyspark does not retain --packages or --jars passed on the command line as of 1.4.0

2015-06-17 Thread Don Drake (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590290#comment-14590290
 ] 

Don Drake commented on SPARK-8365:
--

Is there a workaround that you are aware of?

 pyspark does not retain --packages or --jars passed on the command line as of 
 1.4.0
 ---

 Key: SPARK-8365
 URL: https://issues.apache.org/jira/browse/SPARK-8365
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.4.0
Reporter: Don Drake
Priority: Blocker

 I downloaded the pre-compiled Spark 1.4.0 and attempted to run an existing 
 Python Spark application against it and got the following error:
 py4j.protocol.Py4JJavaError: An error occurred while calling o90.save.
 : java.lang.RuntimeException: Failed to load class for data source: 
 com.databricks.spark.csv
 I pass the following on the command-line to my spark-submit:
 --packages com.databricks:spark-csv_2.10:1.0.3
 This worked fine on 1.3.1, but not in 1.4.
 I was able to replicate it with the following pyspark:
 {code}
 a = {'a':1.0, 'b':'asdf'}
 rdd = sc.parallelize([a])
 df = sqlContext.createDataFrame(rdd)
 df.save(/tmp/d.csv, com.databricks.spark.csv)
 {code}
 Even using the new 
 df.write.format('com.databricks.spark.csv').save('/tmp/d.csv') gives the same 
 error. 
 I see it was added in the web UI:
 file:/Users/drake/.ivy2/jars/com.databricks_spark-csv_2.10-1.0.3.jar  Added 
 By User
 file:/Users/drake/.ivy2/jars/org.apache.commons_commons-csv-1.1.jar   Added 
 By User
 http://10.0.0.222:56871/jars/com.databricks_spark-csv_2.10-1.0.3.jar  Added 
 By User
 http://10.0.0.222:56871/jars/org.apache.commons_commons-csv-1.1.jar   Added 
 By User
 Thoughts?
 *I also attempted using the Scala spark-shell to load a csv using the same 
 package and it worked just fine, so this seems specific to pyspark.*
 -Don
 Gory details:
 {code}
 $ pyspark --packages com.databricks:spark-csv_2.10:1.0.3
 Python 2.7.6 (default, Sep  9 2014, 15:04:36)
 [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.39)] on darwin
 Type help, copyright, credits or license for more information.
 Ivy Default Cache set to: /Users/drake/.ivy2/cache
 The jars for the packages stored in: /Users/drake/.ivy2/jars
 :: loading settings :: url = 
 jar:file:/Users/drake/spark/spark-1.4.0-bin-hadoop2.6/lib/spark-assembly-1.4.0-hadoop2.6.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
 com.databricks#spark-csv_2.10 added as a dependency
 :: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
   confs: [default]
   found com.databricks#spark-csv_2.10;1.0.3 in central
   found org.apache.commons#commons-csv;1.1 in central
 :: resolution report :: resolve 590ms :: artifacts dl 17ms
   :: modules in use:
   com.databricks#spark-csv_2.10;1.0.3 from central in [default]
   org.apache.commons#commons-csv;1.1 from central in [default]
   -
   |  |modules||   artifacts   |
   |   conf   | number| search|dwnlded|evicted|| number|dwnlded|
   -
   |  default |   2   |   0   |   0   |   0   ||   2   |   0   |
   -
 :: retrieving :: org.apache.spark#spark-submit-parent
   confs: [default]
   0 artifacts copied, 2 already retrieved (0kB/15ms)
 Using Spark's default log4j profile: 
 org/apache/spark/log4j-defaults.properties
 15/06/13 11:06:08 INFO SparkContext: Running Spark version 1.4.0
 2015-06-13 11:06:08.921 java[19233:2145789] Unable to load realm info from 
 SCDynamicStore
 15/06/13 11:06:09 WARN NativeCodeLoader: Unable to load native-hadoop library 
 for your platform... using builtin-java classes where applicable
 15/06/13 11:06:09 WARN Utils: Your hostname, Dons-MacBook-Pro-2.local 
 resolves to a loopback address: 127.0.0.1; using 10.0.0.222 instead (on 
 interface en0)
 15/06/13 11:06:09 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to 
 another address
 15/06/13 11:06:09 INFO SecurityManager: Changing view acls to: drake
 15/06/13 11:06:09 INFO SecurityManager: Changing modify acls to: drake
 15/06/13 11:06:09 INFO SecurityManager: SecurityManager: authentication 
 disabled; ui acls disabled; users with view permissions: Set(drake); users 
 with modify permissions: Set(drake)
 15/06/13 11:06:10 INFO Slf4jLogger: Slf4jLogger started
 15/06/13 11:06:10 INFO Remoting: Starting remoting
 15/06/13 11:06:10 INFO Remoting: Remoting started; listening on addresses 
 :[akka.tcp://sparkDriver@10.0.0.222:56870]
 15/06/13 11:06:10 INFO Utils: Successfully started service 'sparkDriver' on 
 port 

[jira] [Created] (SPARK-8365) pyspark does not retain --packages or --jars passed on the command line as of 1.4.0

2015-06-14 Thread Don Drake (JIRA)
Don Drake created SPARK-8365:


 Summary: pyspark does not retain --packages or --jars passed on 
the command line as of 1.4.0
 Key: SPARK-8365
 URL: https://issues.apache.org/jira/browse/SPARK-8365
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.4.0
Reporter: Don Drake


I downloaded the pre-compiled Spark 1.4.0 and attempted to run an existing 
Python Spark application against it and got the following error:

py4j.protocol.Py4JJavaError: An error occurred while calling o90.save.
: java.lang.RuntimeException: Failed to load class for data source: 
com.databricks.spark.csv

I pass the following on the command-line to my spark-submit:
--packages com.databricks:spark-csv_2.10:1.0.3

This worked fine on 1.3.1, but not in 1.4.

I was able to replicate it with the following pyspark:

{code}
a = {'a':1.0, 'b':'asdf'}
rdd = sc.parallelize([a])
df = sqlContext.createDataFrame(rdd)
df.save(/tmp/d.csv, com.databricks.spark.csv)
{code}

Even using the new 
df.write.format('com.databricks.spark.csv').save('/tmp/d.csv') gives the same 
error. 

I see it was added in the web UI:
file:/Users/drake/.ivy2/jars/com.databricks_spark-csv_2.10-1.0.3.jarAdded 
By User
file:/Users/drake/.ivy2/jars/org.apache.commons_commons-csv-1.1.jar Added 
By User
http://10.0.0.222:56871/jars/com.databricks_spark-csv_2.10-1.0.3.jarAdded 
By User
http://10.0.0.222:56871/jars/org.apache.commons_commons-csv-1.1.jar Added 
By User

Thoughts?

*I also attempted using the Scala spark-shell to load a csv using the same 
package and it worked just fine, so this seems specific to pyspark.*

-Don



Gory details:
{code}
$ pyspark --packages com.databricks:spark-csv_2.10:1.0.3
Python 2.7.6 (default, Sep  9 2014, 15:04:36)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.39)] on darwin
Type help, copyright, credits or license for more information.
Ivy Default Cache set to: /Users/drake/.ivy2/cache
The jars for the packages stored in: /Users/drake/.ivy2/jars
:: loading settings :: url = 
jar:file:/Users/drake/spark/spark-1.4.0-bin-hadoop2.6/lib/spark-assembly-1.4.0-hadoop2.6.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
com.databricks#spark-csv_2.10 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
confs: [default]
found com.databricks#spark-csv_2.10;1.0.3 in central
found org.apache.commons#commons-csv;1.1 in central
:: resolution report :: resolve 590ms :: artifacts dl 17ms
:: modules in use:
com.databricks#spark-csv_2.10;1.0.3 from central in [default]
org.apache.commons#commons-csv;1.1 from central in [default]
-
|  |modules||   artifacts   |
|   conf   | number| search|dwnlded|evicted|| number|dwnlded|
-
|  default |   2   |   0   |   0   |   0   ||   2   |   0   |
-
:: retrieving :: org.apache.spark#spark-submit-parent
confs: [default]
0 artifacts copied, 2 already retrieved (0kB/15ms)
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/06/13 11:06:08 INFO SparkContext: Running Spark version 1.4.0
2015-06-13 11:06:08.921 java[19233:2145789] Unable to load realm info from 
SCDynamicStore
15/06/13 11:06:09 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
15/06/13 11:06:09 WARN Utils: Your hostname, Dons-MacBook-Pro-2.local resolves 
to a loopback address: 127.0.0.1; using 10.0.0.222 instead (on interface en0)
15/06/13 11:06:09 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another 
address
15/06/13 11:06:09 INFO SecurityManager: Changing view acls to: drake
15/06/13 11:06:09 INFO SecurityManager: Changing modify acls to: drake
15/06/13 11:06:09 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(drake); users with 
modify permissions: Set(drake)
15/06/13 11:06:10 INFO Slf4jLogger: Slf4jLogger started
15/06/13 11:06:10 INFO Remoting: Starting remoting
15/06/13 11:06:10 INFO Remoting: Remoting started; listening on addresses 
:[akka.tcp://sparkDriver@10.0.0.222:56870]
15/06/13 11:06:10 INFO Utils: Successfully started service 'sparkDriver' on 
port 56870.
15/06/13 11:06:10 INFO SparkEnv: Registering MapOutputTracker
15/06/13 11:06:10 INFO SparkEnv: Registering BlockManagerMaster
15/06/13 11:06:10 INFO DiskBlockManager: Created local directory at 
/private/var/folders/7_/k5h82ws97b95v5f5h8wf9j0hgn/T/spark-f36f39f5-7f82-42e0-b3e0-9eb1e1cc0816/blockmgr-a1412b71-fe56-429c-a193-ce3fb95d2ffd
15/06/13 

[jira] [Created] (SPARK-7781) GradientBoostedTrees.trainRegressor is missing maxBins parameter in pyspark

2015-05-20 Thread Don Drake (JIRA)
Don Drake created SPARK-7781:


 Summary: GradientBoostedTrees.trainRegressor is missing maxBins 
parameter in pyspark
 Key: SPARK-7781
 URL: https://issues.apache.org/jira/browse/SPARK-7781
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.3.1
Reporter: Don Drake


I'm running Spark v1.3.1 and when I run the following against my dataset:

{code}
model = GradientBoostedTrees.trainRegressor(trainingData, 
categoricalFeaturesInfo=catFeatures, maxDepth=6, numIterations=3)

The job will fail with the following message:
Traceback (most recent call last):
  File /Users/drake/fd/spark/mltest.py, line 73, in module
model = GradientBoostedTrees.trainRegressor(trainingData, 
categoricalFeaturesInfo=catFeatures, maxDepth=6, numIterations=3)
  File 
/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/pyspark/mllib/tree.py, 
line 553, in trainRegressor
loss, numIterations, learningRate, maxDepth)
  File 
/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/pyspark/mllib/tree.py, 
line 438, in _train
loss, numIterations, learningRate, maxDepth)
  File 
/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/pyspark/mllib/common.py, 
line 120, in callMLlibFunc
return callJavaFunc(sc, api, *args)
  File 
/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/pyspark/mllib/common.py, 
line 113, in callJavaFunc
return _java2py(sc, func(*args))
  File 
/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py,
 line 538, in __call__
  File 
/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py,
 line 300, in get_return_value
15/05/20 16:40:12 INFO BlockManager: Removing block rdd_32_95
py4j.protocol.Py4JJavaError: An error occurred while calling 
o69.trainGradientBoostedTreesModel.
: java.lang.IllegalArgumentException: requirement failed: DecisionTree requires 
maxBins (= 32) = max categories in categorical features (= 1895)
at scala.Predef$.require(Predef.scala:233)
at 
org.apache.spark.mllib.tree.impl.DecisionTreeMetadata$.buildMetadata(DecisionTreeMetadata.scala:128)
at org.apache.spark.mllib.tree.RandomForest.run(RandomForest.scala:138)
at org.apache.spark.mllib.tree.DecisionTree.run(DecisionTree.scala:60)
at 
org.apache.spark.mllib.tree.GradientBoostedTrees$.org$apache$spark$mllib$tree$GradientBoostedTrees$$boost(GradientBoostedTrees.scala:150)
at 
org.apache.spark.mllib.tree.GradientBoostedTrees.run(GradientBoostedTrees.scala:63)
at 
org.apache.spark.mllib.tree.GradientBoostedTrees$.train(GradientBoostedTrees.scala:96)
at 
org.apache.spark.mllib.api.python.PythonMLLibAPI.trainGradientBoostedTreesModel(PythonMLLibAPI.scala:595)
{code}

So, it's complaining about the maxBins, if I provide maxBins=1900 and re-run it:

{code}
model = GradientBoostedTrees.trainRegressor(trainingData, 
categoricalFeaturesInfo=catFeatures, maxDepth=6, numIterations=3, maxBins=1900)

Traceback (most recent call last):
  File /Users/drake/fd/spark/mltest.py, line 73, in module
model = GradientBoostedTrees.trainRegressor(trainingData, 
categoricalFeaturesInfo=catF
eatures, maxDepth=6, numIterations=3, maxBins=1900)
TypeError: trainRegressor() got an unexpected keyword argument 'maxBins'
{code}

It now says it knows nothing of maxBins.

If I run the same command against DecisionTree or RandomForest (with 
maxBins=1900) it works just fine.

Seems like a bug in GradientBoostedTrees. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7182) [SQL] Can't remove columns from DataFrame or save DataFrame from a join due to duplicate columns

2015-04-27 Thread Don Drake (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Don Drake updated SPARK-7182:
-
Summary: [SQL] Can't remove columns from DataFrame or save DataFrame from a 
join due to duplicate columns  (was: [SQL] Can't remove or save DataFrame from 
a join due to duplicate columns)

 [SQL] Can't remove columns from DataFrame or save DataFrame from a join due 
 to duplicate columns
 

 Key: SPARK-7182
 URL: https://issues.apache.org/jira/browse/SPARK-7182
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.1
Reporter: Don Drake

 I'm having trouble saving a dataframe as parquet after performing a simple 
 table join.
 Below is a trivial example that demonstrates the issue.
 The following is from a pyspark session:
 {code}
 d1=[{'a':1, 'b':2, 'c':3}]
 d2=[{'a':1, 'b':2, 'd':4}]
 t1 = sqlContext.createDataFrame(d1)
 t2 = sqlContext.createDataFrame(d2)
 j = t1.join(t2, t1.a==t2.a and t1.b==t2.b)
  j
 DataFrame[a: bigint, b: bigint, c: bigint, a: bigint, b: bigint, d: bigint]
 {code}
 Try to get a unique list of the columns:
 {code}
 u = sorted(list(set(j.columns)))
  nt = j.select(*u)
 Traceback (most recent call last):
   File stdin, line 1, in module
   File 
 /Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/pyspark/sql/dataframe.py,
  lin
 e 586, in select
 jdf = self._jdf.select(self.sql_ctx._sc._jvm.PythonUtils.toSeq(jcols))
   File 
 /Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/
 java_gateway.py, line 538, in __call__
   File 
 /Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/
 protocol.py, line 300, in get_return_value
 py4j.protocol.Py4JJavaError: An error occurred while calling o829.select.
 : org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous, could 
 be: a#0L, a#3L
 .;
 at 
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:2
 29)
 {code}
 That didn't work, save the file (that works), but reading it back in fails.:
 {code}
 j.saveAsParquetFile('j')
  z = sqlContext.parquetFile('j')
  z.take(1)
 ...
 : An error occurred while calling 
 z:org.apache.spark.api.python.PythonRDD.collectAndServe.
 : org.apache.spark.SparkException: Job aborted due to stage failure: Task 171 
 in stage 104.0 failed 1 times, most recent failure: Lost task 171.0 in stage 
 104.0 (TID 1235, localhost): parquet.io.ParquetDecodingException: Can not 
 read value at 0 in block -1 in file 
 file:/Users/drake/fd/spark/j/part-r-00172.parquet
   at 
 parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7182) [SQL] Can't remove or save DataFrame from a join due to duplicate columns

2015-04-27 Thread Don Drake (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Don Drake updated SPARK-7182:
-
Description: 
I'm having trouble saving a dataframe as parquet after performing a simple 
table join.

Below is a trivial example that demonstrates the issue.


The following is from a pyspark session:

{code}
d1=[{'a':1, 'b':2, 'c':3}]
d2=[{'a':1, 'b':2, 'd':4}]

t1 = sqlContext.createDataFrame(d1)
t2 = sqlContext.createDataFrame(d2)

j = t1.join(t2, t1.a==t2.a and t1.b==t2.b)

 j
DataFrame[a: bigint, b: bigint, c: bigint, a: bigint, b: bigint, d: bigint]


{code}

Try to get a unique list of the columns:
{code}
u = sorted(list(set(j.columns)))

 nt = j.select(*u)
Traceback (most recent call last):
  File stdin, line 1, in module
  File 
/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/pyspark/sql/dataframe.py, 
lin
e 586, in select
jdf = self._jdf.select(self.sql_ctx._sc._jvm.PythonUtils.toSeq(jcols))
  File 
/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/
java_gateway.py, line 538, in __call__
  File 
/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/
protocol.py, line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o829.select.
: org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous, could be: 
a#0L, a#3L
.;
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:2
29)

{code}

That didn't work, save the file (that works), but reading it back in fails.:
{code}
j.saveAsParquetFile('j')

 z = sqlContext.parquetFile('j')
 z.take(1)
...
: An error occurred while calling 
z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 171 
in stage 104.0 failed 1 times, most recent failure: Lost task 171.0 in stage 
104.0 (TID 1235, localhost): parquet.io.ParquetDecodingException: Can not read 
value at 0 in block -1 in file file:/Users/drake/fd/spark/j/part-r-00172.parquet
at 
parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213)
{code}

  was:

I'm having trouble saving a dataframe as parquet after performing a simple 
table join.

Below is a trivial example that demonstrates the issue.


The following is from a pyspark session:

{code}
d1=[{'a':1, 'b':2, 'c':3}]
d2=[{'a':1, 'b':2, 'd':4}]

t1 = sqlContext.createDataFrame(d1)
t2 = sqlContext.createDataFrame(d2)

j = t1.join(t2, t1.a==t2.a and t1.b==t2.b)

 j
DataFrame[a: bigint, b: bigint, c: bigint, a: bigint, b: bigint, d: bigint]



u = sorted(list(set(j.columns)))

 nt = j.select(*u)
Traceback (most recent call last):
  File stdin, line 1, in module
  File 
/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/pyspark/sql/dataframe.py, 
lin
e 586, in select
jdf = self._jdf.select(self.sql_ctx._sc._jvm.PythonUtils.toSeq(jcols))
  File 
/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/
java_gateway.py, line 538, in __call__
  File 
/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/
protocol.py, line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o829.select.
: org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous, could be: 
a#0L, a#3L
.;
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:2
29)

j.saveAsParquetFile('j')

 z = sqlContext.parquetFile('j')
 z.take(1)
...
: An error occurred while calling 
z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 171 
in stage 104.0 failed 1 times, most recent failure: Lost task 171.0 in stage 
104.0 (TID 1235, localhost): parquet.io.ParquetDecodingException: Can not read 
value at 0 in block -1 in file file:/Users/drake/fd/spark/j/part-r-00172.parquet
at 
parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213)
{code}


 [SQL] Can't remove or save DataFrame from a join due to duplicate columns
 -

 Key: SPARK-7182
 URL: https://issues.apache.org/jira/browse/SPARK-7182
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.1
Reporter: Don Drake

 I'm having trouble saving a dataframe as parquet after performing a simple 
 table join.
 Below is a trivial example that demonstrates the issue.
 The following is from a pyspark session:
 {code}
 d1=[{'a':1, 'b':2, 'c':3}]
 d2=[{'a':1, 'b':2, 'd':4}]
 t1 = sqlContext.createDataFrame(d1)
 t2 = sqlContext.createDataFrame(d2)
 j = t1.join(t2, t1.a==t2.a and t1.b==t2.b)
  j
 DataFrame[a: bigint, b: bigint, c: bigint, a: bigint, b: bigint, d: bigint]
 {code}
 Try to get a 

[jira] [Created] (SPARK-7182) [SQL] Can't remove or save DataFrame from a join due to duplicate columns

2015-04-27 Thread Don Drake (JIRA)
Don Drake created SPARK-7182:


 Summary: [SQL] Can't remove or save DataFrame from a join due to 
duplicate columns
 Key: SPARK-7182
 URL: https://issues.apache.org/jira/browse/SPARK-7182
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.1
Reporter: Don Drake



I'm having trouble saving a dataframe as parquet after performing a simple 
table join.

Below is a trivial example that demonstrates the issue.


The following is from a pyspark session:

{code}
d1=[{'a':1, 'b':2, 'c':3}]
d2=[{'a':1, 'b':2, 'd':4}]

t1 = sqlContext.createDataFrame(d1)
t2 = sqlContext.createDataFrame(d2)

j = t1.join(t2, t1.a==t2.a and t1.b==t2.b)

 j
DataFrame[a: bigint, b: bigint, c: bigint, a: bigint, b: bigint, d: bigint]



u = sorted(list(set(j.columns)))

 nt = j.select(*u)
Traceback (most recent call last):
  File stdin, line 1, in module
  File 
/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/pyspark/sql/dataframe.py, 
lin
e 586, in select
jdf = self._jdf.select(self.sql_ctx._sc._jvm.PythonUtils.toSeq(jcols))
  File 
/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/
java_gateway.py, line 538, in __call__
  File 
/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/
protocol.py, line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o829.select.
: org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous, could be: 
a#0L, a#3L
.;
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:2
29)

j.saveAsParquetFile('j')

 z = sqlContext.parquetFile('j')
 z.take(1)
...
: An error occurred while calling 
z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 171 
in stage 104.0 failed 1 times, most recent failure: Lost task 171.0 in stage 
104.0 (TID 1235, localhost): parquet.io.ParquetDecodingException: Can not read 
value at 0 in block -1 in file file:/Users/drake/fd/spark/j/part-r-00172.parquet
at 
parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5722) Infer_schema_type incorrect for Integers in pyspark

2015-02-11 Thread Don Drake (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14316940#comment-14316940
 ] 

Don Drake commented on SPARK-5722:
--

Hi, I've submitted 2 pull requests for branch-1.2 and branch-1.3.

Please approve.

 Infer_schema_type incorrect for Integers in pyspark
 ---

 Key: SPARK-5722
 URL: https://issues.apache.org/jira/browse/SPARK-5722
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.2.0
Reporter: Don Drake

 The Integers datatype in Python does not match what a Scala/Java integer is 
 defined as.   This causes inference of data types and schemas to fail when 
 data is larger than 2^32 and it is inferred incorrectly as an Integer.
 Since the range of valid Python integers is wider than Java Integers, this 
 causes problems when inferring Integer vs. Long datatypes.  This will cause 
 problems when attempting to save SchemaRDD as Parquet or JSON.
 Here's an example:
 {code}
  sqlCtx = SQLContext(sc)
  from pyspark.sql import Row
  rdd = sc.parallelize([Row(f1='a', f2=100)])
  srdd = sqlCtx.inferSchema(rdd)
  srdd.schema()
 StructType(List(StructField(f1,StringType,true),StructField(f2,IntegerType,true)))
 {code}
 That number is a LongType in Java, but an Integer in python.  We need to 
 check the value to see if it should really by a LongType when a IntegerType 
 is initially inferred.
 More tests:
 {code}
  from pyspark.sql import _infer_type
 # OK
  print _infer_type(1)
 IntegerType
 # OK
  print _infer_type(2**31-1)
 IntegerType
 #WRONG
  print _infer_type(2**31)
 #WRONG
 IntegerType
  print _infer_type(2**61 )
 #OK
 IntegerType
  print _infer_type(2**71 )
 LongType
 {code}
 Java Primitive Types defined:
 http://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html
 Python Built-in Types:
 https://docs.python.org/2/library/stdtypes.html#typesnumeric



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5722) Infer_schema_type incorrect for Integers in pyspark

2015-02-10 Thread Don Drake (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Don Drake updated SPARK-5722:
-
Description: 
The Integers datatype in Python does not match what a Scala/Java integer is 
defined as.   This causes inference of data types and schemas to fail when data 
is larger than 2^32 and it is inferred incorrectly as an Integer.

Since the range of valid Python integers is wider than Java Integers, this 
causes problems when inferring Integer vs. Long datatypes.  This will cause 
problems when attempting to save SchemaRDD as Parquet or JSON.

Here's an example:
{code}
 sqlCtx = SQLContext(sc)
 from pyspark.sql import Row
 rdd = sc.parallelize([Row(f1='a', f2=100)])
 srdd = sqlCtx.inferSchema(rdd)
 srdd.schema()
StructType(List(StructField(f1,StringType,true),StructField(f2,IntegerType,true)))
{code}
That number is a LongType in Java, but an Integer in python.  We need to check 
the value to see if it should really by a LongType when a IntegerType is 
initially inferred.

More tests:
{code}
 from pyspark.sql import _infer_type
# OK
 print _infer_type(1)
IntegerType
# OK
 print _infer_type(2**31-1)
IntegerType
#WRONG
 print _infer_type(2**31)
#WRONG
IntegerType
 print _infer_type(2**61 )
#OK
IntegerType
 print _infer_type(2**71 )
LongType
{code}

Java Primitive Types defined:
http://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html

Python Built-in Types:
https://docs.python.org/2/library/stdtypes.html#typesnumeric


  was:

The Integers datatype in Python does not match what a Scala/Java integer is 
defined as.   This causes inference of data types and schemas to fail when data 
is larger than 2^32 and it is inferred incorrectly as an Integer.

Since the range of valid Python integers is wider than Java Integers, this 
causes problems when inferring Integer vs. Long datatypes.  This will cause 
problems when attempting to save SchemaRDD as Parquet or JSON.

Here's an example:

 sqlCtx = SQLContext(sc)
 from pyspark.sql import Row
 rdd = sc.parallelize([Row(f1='a', f2=100)])
 srdd = sqlCtx.inferSchema(rdd)
 srdd.schema()
StructType(List(StructField(f1,StringType,true),StructField(f2,IntegerType,true)))

That number is a LongType in Java, but an Integer in python.  We need to check 
the value to see if it should really by a LongType when a IntegerType is 
initially inferred.

More tests:
 from pyspark.sql import _infer_type
# OK
 print _infer_type(1)
IntegerType
# OK
 print _infer_type(2**31-1)
IntegerType
#WRONG
 print _infer_type(2**31)
#WRONG
IntegerType
 print _infer_type(2**61 )
#OK
IntegerType
 print _infer_type(2**71 )
LongType

Java Primitive Types defined:
http://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html

Python Built-in Types:
https://docs.python.org/2/library/stdtypes.html#typesnumeric



 Infer_schema_type incorrect for Integers in pyspark
 ---

 Key: SPARK-5722
 URL: https://issues.apache.org/jira/browse/SPARK-5722
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.2.0
Reporter: Don Drake

 The Integers datatype in Python does not match what a Scala/Java integer is 
 defined as.   This causes inference of data types and schemas to fail when 
 data is larger than 2^32 and it is inferred incorrectly as an Integer.
 Since the range of valid Python integers is wider than Java Integers, this 
 causes problems when inferring Integer vs. Long datatypes.  This will cause 
 problems when attempting to save SchemaRDD as Parquet or JSON.
 Here's an example:
 {code}
  sqlCtx = SQLContext(sc)
  from pyspark.sql import Row
  rdd = sc.parallelize([Row(f1='a', f2=100)])
  srdd = sqlCtx.inferSchema(rdd)
  srdd.schema()
 StructType(List(StructField(f1,StringType,true),StructField(f2,IntegerType,true)))
 {code}
 That number is a LongType in Java, but an Integer in python.  We need to 
 check the value to see if it should really by a LongType when a IntegerType 
 is initially inferred.
 More tests:
 {code}
  from pyspark.sql import _infer_type
 # OK
  print _infer_type(1)
 IntegerType
 # OK
  print _infer_type(2**31-1)
 IntegerType
 #WRONG
  print _infer_type(2**31)
 #WRONG
 IntegerType
  print _infer_type(2**61 )
 #OK
 IntegerType
  print _infer_type(2**71 )
 LongType
 {code}
 Java Primitive Types defined:
 http://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html
 Python Built-in Types:
 https://docs.python.org/2/library/stdtypes.html#typesnumeric



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5722) Infer_schema_type incorrect for Integers in pyspark

2015-02-10 Thread Don Drake (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Don Drake updated SPARK-5722:
-
Summary: Infer_schema_type incorrect for Integers in pyspark  (was: 
Infer_schma_type incorrect for Integers in pyspark)

 Infer_schema_type incorrect for Integers in pyspark
 ---

 Key: SPARK-5722
 URL: https://issues.apache.org/jira/browse/SPARK-5722
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.2.0
Reporter: Don Drake

 The Integers datatype in Python does not match what a Scala/Java integer is 
 defined as.   This causes inference of data types and schemas to fail when 
 data is larger than 2^32 and it is inferred incorrectly as an Integer.
 Since the range of valid Python integers is wider than Java Integers, this 
 causes problems when inferring Integer vs. Long datatypes.  This will cause 
 problems when attempting to save SchemaRDD as Parquet or JSON.
 Here's an example:
  sqlCtx = SQLContext(sc)
  from pyspark.sql import Row
  rdd = sc.parallelize([Row(f1='a', f2=100)])
  srdd = sqlCtx.inferSchema(rdd)
  srdd.schema()
 StructType(List(StructField(f1,StringType,true),StructField(f2,IntegerType,true)))
 That number is a LongType in Java, but an Integer in python.  We need to 
 check the value to see if it should really by a LongType when a IntegerType 
 is initially inferred.
 More tests:
  from pyspark.sql import _infer_type
 # OK
  print _infer_type(1)
 IntegerType
 # OK
  print _infer_type(2**31-1)
 IntegerType
 #WRONG
  print _infer_type(2**31)
 #WRONG
 IntegerType
  print _infer_type(2**61 )
 #OK
 IntegerType
  print _infer_type(2**71 )
 LongType
 Java Primitive Types defined:
 http://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html
 Python Built-in Types:
 https://docs.python.org/2/library/stdtypes.html#typesnumeric



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5722) Infer_schma_type incorrect for Integers in pyspark

2015-02-10 Thread Don Drake (JIRA)
Don Drake created SPARK-5722:


 Summary: Infer_schma_type incorrect for Integers in pyspark
 Key: SPARK-5722
 URL: https://issues.apache.org/jira/browse/SPARK-5722
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.2.0
Reporter: Don Drake



The Integers datatype in Python does not match what a Scala/Java integer is 
defined as.   This causes inference of data types and schemas to fail when data 
is larger than 2^32 and it is inferred incorrectly as an Integer.

Since the range of valid Python integers is wider than Java Integers, this 
causes problems when inferring Integer vs. Long datatypes.  This will cause 
problems when attempting to save SchemaRDD as Parquet or JSON.

Here's an example:

 sqlCtx = SQLContext(sc)
 from pyspark.sql import Row
 rdd = sc.parallelize([Row(f1='a', f2=100)])
 srdd = sqlCtx.inferSchema(rdd)
 srdd.schema()
StructType(List(StructField(f1,StringType,true),StructField(f2,IntegerType,true)))

That number is a LongType in Java, but an Integer in python.  We need to check 
the value to see if it should really by a LongType when a IntegerType is 
initially inferred.

More tests:
 from pyspark.sql import _infer_type
# OK
 print _infer_type(1)
IntegerType
# OK
 print _infer_type(2**31-1)
IntegerType
#WRONG
 print _infer_type(2**31)
#WRONG
IntegerType
 print _infer_type(2**61 )
#OK
IntegerType
 print _infer_type(2**71 )
LongType

Java Primitive Types defined:
http://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html

Python Built-in Types:
https://docs.python.org/2/library/stdtypes.html#typesnumeric




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org