[jira] [Created] (SPARK-17286) The fetched data of Shuffle read stored in the ManagedBuffer, so its underlying data stored in off-memory or file?
song fengfei created SPARK-17286: Summary: The fetched data of Shuffle read stored in the ManagedBuffer, so its underlying data stored in off-memory or file? Key: SPARK-17286 URL: https://issues.apache.org/jira/browse/SPARK-17286 Project: Spark Issue Type: Question Components: Block Manager, Input/Output, Shuffle Reporter: song fengfei The fetched data of Shuffle read stored in the ManagedBuffer, so its underlying data stored in off-memory or file? if stored in off-memory, Whether it is also managed by spark memorymanager ? and if a map output is too big, placed it directly in the memory isn't very easily lead to oom? I did not understand this piece, hoping to get the answer here,thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17198) ORC fixed char literal filter does not work
[ https://issues.apache.org/jira/browse/SPARK-17198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15444781#comment-15444781 ] Dongjoon Hyun commented on SPARK-17198: --- Yes. Spark 2.0 improves SQL features greatly. I think that Spark 2.0.1 will add more stability fixes. > ORC fixed char literal filter does not work > --- > > Key: SPARK-17198 > URL: https://issues.apache.org/jira/browse/SPARK-17198 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: tuming > > I have got wrong result when I run the following query in SparkSQL. > select * from orc_table where char_col ='5LZS'; > Table orc_table is a ORC format table. > Column char_col is defined as char(6). > The hive record reader will return a char(6) string to the spark. And the > spark has no fixed char type. All fixed char type attributes are converted to > String by default. Meanwhile the constant literal is parsed to a string > Literal. So it won't return true forever while doing the equal comparison. > For instance: '5LZS'=='5LZS '. > But I can get correct result in Hive using same data and sql string because > hive append spaces for those constant literal. Please refer to: > https://issues.apache.org/jira/browse/HIVE-11312 > I found there is no such patch for spark. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17198) ORC fixed char literal filter does not work
[ https://issues.apache.org/jira/browse/SPARK-17198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15444755#comment-15444755 ] tuming commented on SPARK-17198: yes, it works fine on spark2.0. I am using spark1.5.1. I can get the column type by the "desc" command. They are different in spark2.0 and spark1.5.1. Spark2.0: spark-sql> desc orc_test; col1string NULL col2string NULL spark-sql> Spark1.5.1: spark-sql> desc orc_test; col1string NULL col2char(10)NULL I have looked into the source code and found spark1.5.1 invoked the hive native command to execute the create table sql(not CATS cases). I have no idea if it is the modification of hive or spark. The spark parser of spark 2.0 is much different from 1.5.1. > ORC fixed char literal filter does not work > --- > > Key: SPARK-17198 > URL: https://issues.apache.org/jira/browse/SPARK-17198 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: tuming > > I have got wrong result when I run the following query in SparkSQL. > select * from orc_table where char_col ='5LZS'; > Table orc_table is a ORC format table. > Column char_col is defined as char(6). > The hive record reader will return a char(6) string to the spark. And the > spark has no fixed char type. All fixed char type attributes are converted to > String by default. Meanwhile the constant literal is parsed to a string > Literal. So it won't return true forever while doing the equal comparison. > For instance: '5LZS'=='5LZS '. > But I can get correct result in Hive using same data and sql string because > hive append spaces for those constant literal. Please refer to: > https://issues.apache.org/jira/browse/HIVE-11312 > I found there is no such patch for spark. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17261) Using HiveContext after re-creating SparkContext in Spark 2.0 throws "Java.lang.illegalStateException: Cannot call methods on a stopped sparkContext"
[ https://issues.apache.org/jira/browse/SPARK-17261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15444694#comment-15444694 ] Jeff Zhang commented on SPARK-17261: [~dongjoon] spark-shell works well for me. It seems your case is due to something else. > Using HiveContext after re-creating SparkContext in Spark 2.0 throws > "Java.lang.illegalStateException: Cannot call methods on a stopped > sparkContext" > - > > Key: SPARK-17261 > URL: https://issues.apache.org/jira/browse/SPARK-17261 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 > Environment: Amazon AWS EMR 5.0 >Reporter: Rahul Jain > Fix For: 2.0.0 > > > After stopping SparkSession if we recreate it and use HiveContext in it. it > will throw error. > Steps to reproduce: > spark = SparkSession.builder.enableHiveSupport().getOrCreate() > spark.sql("show databases") > spark.stop() > spark = SparkSession.builder.enableHiveSupport().getOrCreate() > spark.sql("show databases") > "Java.lang.illegalStateException: Cannot call methods on a stopped > sparkContext" > Above error occurs only in case of Pyspark not in SparkShell -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17261) Using HiveContext after re-creating SparkContext in Spark 2.0 throws "Java.lang.illegalStateException: Cannot call methods on a stopped sparkContext"
[ https://issues.apache.org/jira/browse/SPARK-17261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17261: Assignee: Apache Spark > Using HiveContext after re-creating SparkContext in Spark 2.0 throws > "Java.lang.illegalStateException: Cannot call methods on a stopped > sparkContext" > - > > Key: SPARK-17261 > URL: https://issues.apache.org/jira/browse/SPARK-17261 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 > Environment: Amazon AWS EMR 5.0 >Reporter: Rahul Jain >Assignee: Apache Spark > Fix For: 2.0.0 > > > After stopping SparkSession if we recreate it and use HiveContext in it. it > will throw error. > Steps to reproduce: > spark = SparkSession.builder.enableHiveSupport().getOrCreate() > spark.sql("show databases") > spark.stop() > spark = SparkSession.builder.enableHiveSupport().getOrCreate() > spark.sql("show databases") > "Java.lang.illegalStateException: Cannot call methods on a stopped > sparkContext" > Above error occurs only in case of Pyspark not in SparkShell -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17261) Using HiveContext after re-creating SparkContext in Spark 2.0 throws "Java.lang.illegalStateException: Cannot call methods on a stopped sparkContext"
[ https://issues.apache.org/jira/browse/SPARK-17261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17261: Assignee: (was: Apache Spark) > Using HiveContext after re-creating SparkContext in Spark 2.0 throws > "Java.lang.illegalStateException: Cannot call methods on a stopped > sparkContext" > - > > Key: SPARK-17261 > URL: https://issues.apache.org/jira/browse/SPARK-17261 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 > Environment: Amazon AWS EMR 5.0 >Reporter: Rahul Jain > Fix For: 2.0.0 > > > After stopping SparkSession if we recreate it and use HiveContext in it. it > will throw error. > Steps to reproduce: > spark = SparkSession.builder.enableHiveSupport().getOrCreate() > spark.sql("show databases") > spark.stop() > spark = SparkSession.builder.enableHiveSupport().getOrCreate() > spark.sql("show databases") > "Java.lang.illegalStateException: Cannot call methods on a stopped > sparkContext" > Above error occurs only in case of Pyspark not in SparkShell -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17261) Using HiveContext after re-creating SparkContext in Spark 2.0 throws "Java.lang.illegalStateException: Cannot call methods on a stopped sparkContext"
[ https://issues.apache.org/jira/browse/SPARK-17261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15444690#comment-15444690 ] Apache Spark commented on SPARK-17261: -- User 'zjffdu' has created a pull request for this issue: https://github.com/apache/spark/pull/14857 > Using HiveContext after re-creating SparkContext in Spark 2.0 throws > "Java.lang.illegalStateException: Cannot call methods on a stopped > sparkContext" > - > > Key: SPARK-17261 > URL: https://issues.apache.org/jira/browse/SPARK-17261 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 > Environment: Amazon AWS EMR 5.0 >Reporter: Rahul Jain > Fix For: 2.0.0 > > > After stopping SparkSession if we recreate it and use HiveContext in it. it > will throw error. > Steps to reproduce: > spark = SparkSession.builder.enableHiveSupport().getOrCreate() > spark.sql("show databases") > spark.stop() > spark = SparkSession.builder.enableHiveSupport().getOrCreate() > spark.sql("show databases") > "Java.lang.illegalStateException: Cannot call methods on a stopped > sparkContext" > Above error occurs only in case of Pyspark not in SparkShell -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17061) Incorrect results returned following a join of two datasets and a map step where total number of columns >100
[ https://issues.apache.org/jira/browse/SPARK-17061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15444614#comment-15444614 ] Liwei Lin commented on SPARK-17061: --- Oh cool! Thank you for the well-formed reproducer! > Incorrect results returned following a join of two datasets and a map step > where total number of columns >100 > - > > Key: SPARK-17061 > URL: https://issues.apache.org/jira/browse/SPARK-17061 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0, 2.0.1 >Reporter: Jamie Hutton >Assignee: Liwei Lin >Priority: Critical > Fix For: 2.0.1, 2.1.0 > > > We have hit a consistent bug where we have a dataset with more than 100 > columns. I am raising as a blocker because spark is returning the WRONG > results rather than erroring, leading to data integrity issues > I have put together the following test case which will show the issue (it > will run in spark-shell). In this example i am joining a dataset with lots of > fields onto another dataset. > The join works fine and if you show the dataset you will get the expected > result. However if you run a map step over the dataset you end up with a > strange error where the sequence that is in the right dataset now only > contains the last value. > Whilst this test may seem a rather contrived example, what we are doing here > is a very standard analtical pattern. My original code was designed to: > - take a dataset of child records > - groupByKey up to the parent: giving a Dataset of (ParentID, Seq[Children]) > - join the children onto the parent by parentID: giving > ((Parent),(ParentID,Seq[Children]) > - map over the result to give a tuple of (Parent,Seq[Children]) > Notes: > - The issue is resolved by having less fields - as soon as we go <= 100 the > integrity issue goes away. Try removing one of the fields from BigCaseClass > below > - The issue will arise based on the total number of fields in the resulting > dataset. Below i have a small case class and a big case class, but two case > classes of 50 variable would give the same issue > - the issue occurs where the case class being joined on (on the right) has a > case class type. It doesnt occur if you have a Seq[String] > - If i go back to an RDD for the map step after the join i can workaround the > issue, but i lose all the benefits of datasets > Scala code test case: > case class Name(name: String) > case class SmallCaseClass (joinkey: Integer, names: Seq[Name]) > case class BigCaseClass (field1: Integer,field2: Integer,field3: > Integer,field4: Integer,field5: Integer,field6: Integer,field7: > Integer,field8: Integer,field9: Integer,field10: Integer,field11: > Integer,field12: Integer,field13: Integer,field14: Integer,field15: > Integer,field16: Integer,field17: Integer,field18: Integer,field19: > Integer,field20: Integer,field21: Integer,field22: Integer,field23: > Integer,field24: Integer,field25: Integer,field26: Integer,field27: > Integer,field28: Integer,field29: Integer,field30: Integer,field31: > Integer,field32: Integer,field33: Integer,field34: Integer,field35: > Integer,field36: Integer,field37: Integer,field38: Integer,field39: > Integer,field40: Integer,field41: Integer,field42: Integer,field43: > Integer,field44: Integer,field45: Integer,field46: Integer,field47: > Integer,field48: Integer,field49: Integer,field50: Integer,field51: > Integer,field52: Integer,field53: Integer,field54: Integer,field55: > Integer,field56: Integer,field57: Integer,field58: Integer,field59: > Integer,field60: Integer,field61: Integer,field62: Integer,field63: > Integer,field64: Integer,field65: Integer,field66: Integer,field67: > Integer,field68: Integer,field69: Integer,field70: Integer,field71: > Integer,field72: Integer,field73: Integer,field74: Integer,field75: > Integer,field76: Integer,field77: Integer,field78: Integer,field79: > Integer,field80: Integer,field81: Integer,field82: Integer,field83: > Integer,field84: Integer,field85: Integer,field86: Integer,field87: > Integer,field88: Integer,field89: Integer,field90: Integer,field91: > Integer,field92: Integer,field93: Integer,field94: Integer,field95: > Integer,field96: Integer,field97: Integer,field98: Integer,field99: Integer) > > val bigCC=Seq(BigCaseClass(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, > 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, > 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, > 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, > 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, > 91, 92, 93, 94, 95, 96, 97, 98, 99)) > >
[jira] [Commented] (SPARK-13525) SparkR: java.net.SocketTimeoutException: Accept timed out when running any dataframe function
[ https://issues.apache.org/jira/browse/SPARK-13525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15444579#comment-15444579 ] Sun Rui commented on SPARK-13525: - Could you help to do some debugging by modifying R code? The steps are: 1. Modify /R/lib/SparkR/worker/daemon.R, add some checkpoints like 'cat("xxx", file="/tmp/sparkr-debug")' around some keypoints: {code} p <- parallel:::mcfork() ... source(script) ... {code} 2. Zip the /R/lib/SparkR directory to override the existing zip file : /R/lib/sparkr.zip 3. Run your spark application. And check the local file "/tmp/sparkr-debug" on all your Yarn worker nodes (you can limit the number of executors for debugging) Hope this can help to find where the code is broken > SparkR: java.net.SocketTimeoutException: Accept timed out when running any > dataframe function > - > > Key: SPARK-13525 > URL: https://issues.apache.org/jira/browse/SPARK-13525 > Project: Spark > Issue Type: Bug > Components: SparkR >Reporter: Shubhanshu Mishra > Labels: sparkr > > I am following the code steps from this example: > https://spark.apache.org/docs/1.6.0/sparkr.html > There are multiple issues: > 1. The head and summary and filter methods are not overridden by spark. Hence > I need to call them using `SparkR::` namespace. > 2. When I try to execute the following, I get errors: > {code} > $> $R_HOME/bin/R > R version 3.2.3 (2015-12-10) -- "Wooden Christmas-Tree" > Copyright (C) 2015 The R Foundation for Statistical Computing > Platform: x86_64-pc-linux-gnu (64-bit) > R is free software and comes with ABSOLUTELY NO WARRANTY. > You are welcome to redistribute it under certain conditions. > Type 'license()' or 'licence()' for distribution details. > Natural language support but running in an English locale > R is a collaborative project with many contributors. > Type 'contributors()' for more information and > 'citation()' on how to cite R or R packages in publications. > Type 'demo()' for some demos, 'help()' for on-line help, or > 'help.start()' for an HTML browser interface to help. > Type 'q()' to quit R. > Welcome at Fri Feb 26 16:19:35 2016 > Attaching package: ‘SparkR’ > The following objects are masked from ‘package:base’: > colnames, colnames<-, drop, intersect, rank, rbind, sample, subset, > summary, transform > Launching java with spark-submit command > /content/smishra8/SOFTWARE/spark/bin/spark-submit --driver-memory "50g" > sparkr-shell /tmp/RtmpfBQRg6/backend_portc3bc16f09b1b > > df <- createDataFrame(sqlContext, iris) > Warning messages: > 1: In FUN(X[[i]], ...) : > Use Sepal_Length instead of Sepal.Length as column name > 2: In FUN(X[[i]], ...) : > Use Sepal_Width instead of Sepal.Width as column name > 3: In FUN(X[[i]], ...) : > Use Petal_Length instead of Petal.Length as column name > 4: In FUN(X[[i]], ...) : > Use Petal_Width instead of Petal.Width as column name > > training <- filter(df, df$Species != "setosa") > Error in filter(df, df$Species != "setosa") : > no method for coercing this S4 class to a vector > > training <- SparkR::filter(df, df$Species != "setosa") > > model <- SparkR::glm(Species ~ Sepal_Length + Sepal_Width, data = training, > > family = "binomial") > 16/02/26 16:26:46 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) > java.net.SocketTimeoutException: Accept timed out > at java.net.PlainSocketImpl.socketAccept(Native Method) > at > java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:398) > at java.net.ServerSocket.implAccept(ServerSocket.java:530) > at java.net.ServerSocket.accept(ServerSocket.java:498) > at org.apache.spark.api.r.RRDD$.createRWorker(RRDD.scala:431) > at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:62) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at
[jira] [Created] (SPARK-17285) ZeroOutPaddingBytes Causing Fatal JVM Error
Dean Chen created SPARK-17285: - Summary: ZeroOutPaddingBytes Causing Fatal JVM Error Key: SPARK-17285 URL: https://issues.apache.org/jira/browse/SPARK-17285 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.0.0 Reporter: Dean Chen Log below. Only happens with certain datasets. {code} # # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0x7f16d55157ac, pid=10860, tid=139687214692096 # # JRE version: OpenJDK Runtime Environment (8.0_91-b14) (build 1.8.0_91-8u91-b14-3ubuntu1~16.04.1-b14) # Java VM: OpenJDK 64-Bit Server VM (25.91-b14 mixed mode linux-amd64 ) # Problematic frame: # J 4968 C1 org.apache.spark.unsafe.Platform.putLong(Ljava/lang/Object;JJ)V (10 bytes) @ 0x7f16d55157ac [0x7f16d55157a0+0xc] # # Core dump written. Default location: /spark/scratch/app-20160828232538-0049/3/core or core.10860 # # If you would like to submit a bug report, please visit: # http://bugreport.java.com/bugreport/crash.jsp # --- T H R E A D --- Current thread (0x7f0bd41c1000): JavaThread "Executor task launch worker-8" daemon [_thread_in_Java, id=10957, stack(0x7f0b76bfb000,0x7f0b76cfc000)] siginfo: si_signo: 11 (SIGSEGV), si_code: 2 (SEGV_ACCERR), si_addr: 0x7f0ca8cf6518 Registers: RAX=0x7f0b1badd608, RBX=0x7f0b1badd608, RCX=0x, RDX=0xc28f5ea0 RSP=0x7f0b76cf8ba0, RBP=0x7f0b76cf8c60, RSI=0x7f0ce6400678, RDI=0x7f16d5007fd4 R8 =0x7f0b76cf8af0, R9 =0x0268, R10=0x7f16ebcf8560, R11=0x7f16d55157a0 R12=0x7f1396e553b0, R13=0x7f0b76cf8bf8, R14=0x7f0b76cf8c78, R15=0x7f0bd41c1000 RIP=0x7f16d55157ac, EFLAGS=0x00010206, CSGSFS=0xd6390033, ERR=0x0006 TRAPNO=0x000e Top of Stack: (sp=0x7f0b76cf8ba0) 0x7f0b76cf8ba0: 0018 0278 0x7f0b76cf8bb0: 7f0b76cf8bb0 7f0b1b7d62e8 0x7f0b76cf8bc0: 7f0b76cf8c00 7f0b1b7d6560 0x7f0b76cf8bd0: 7f0b1b7d6308 0x7f0b76cf8be0: 7f0b76cf8c60 7f16d5007fd4 0x7f0b76cf8bf0: 7f16d5007fd4 0x7f0b76cf8c00: 0358 c28f5ea0 0x7f0b76cf8c10: 0278 7f0ce6400678 0x7f0b76cf8c20: 7f0b76cf8c20 7f0b1b7d89cd 0x7f0b76cf8c30: 7f0b76cf8c78 7f0b1b7d9ff0 0x7f0b76cf8c40: 7f0b1b7d89f8 0x7f0b76cf8c50: 7f0b76cf8bf8 7f0b76cf8c70 0x7f0b76cf8c60: 7f0b76cf8cc0 7f16d5007fd4 0x7f0b76cf8c70: c28f5c29 7f0ce6400a10 0x7f0b76cf8c80: 7f0b76cf8c80 7f0b1b7d98de 0x7f0b76cf8c90: 7f0b76cf8cf0 7f0b1b7d9ff0 0x7f0b76cf8ca0: 7f0b1b7d9958 0x7f0b76cf8cb0: 7f0b76cf8c70 7f0b76cf8ce0 0x7f0b76cf8cc0: 7f0b76cf8d38 7f16d5007fd4 0x7f0b76cf8cd0: c28f5c30 c28f5c29 0x7f0b76cf8ce0: 7f1396e55a48 0038 0x7f0b76cf8cf0: 7f0ce6400a10 7f0b76cf8cf8 0x7f0b76cf8d00: 7f0b1b483605 7f0b76cf9988 0x7f0b76cf8d10: 7f0b1b483b18 0x7f0b76cf8d20: 7f0b1b4837e0 7f0b76cf8ce0 0x7f0b76cf8d30: 7f0b76cf9988 7f0b76cf99d0 0x7f0b76cf8d40: 7f16d5007fd4 0x7f0b76cf8d50: 0x7f0b76cf8d60: bff0 0x7f0b76cf8d70: 0001 0x7f0b76cf8d80: 0x7f0b76cf8d90: bff0 Instructions: (pc=0x7f16d55157ac) 0x7f16d551578c: 0a 80 11 64 01 f8 12 fe 06 90 0c 64 50 c0 39 c0 0x7f16d551579c: 4d c0 12 c0 89 84 24 00 c0 fe ff 55 48 83 ec 40 0x7f16d55157ac: 48 89 0c 16 48 83 c4 40 5d 85 05 45 29 62 17 c3 0x7f16d55157bc: 90 90 49 8b 87 90 02 00 00 49 ba 00 00 00 00 00 Register to memory mapping: RAX={method} {0x7f0b1badd608} 'putLong' '(Ljava/lang/Object;JJ)V' in 'org/apache/spark/unsafe/Platform' RBX={method} {0x7f0b1badd608} 'putLong' '(Ljava/lang/Object;JJ)V' in 'org/apache/spark/unsafe/Platform' RCX=0x is an unknown value RDX=0xc28f5ea0 is an unknown value RSP=0x7f0b76cf8ba0 is pointing into the stack for thread: 0x7f0bd41c1000 RBP=0x7f0b76cf8c60 is pointing into the stack for thread: 0x7f0bd41c1000 RSI=0x7f0ce6400678 is an oop [B - klass: {type array byte} - length: 856 RDI=0x7f16d5007fd4 is at code_begin+2292 in an Interpreter codelet invoke return entry points [0x7f16d50076e0, 0x7f16d50080c0] 2528 bytes R8 =0x7f0b76cf8af0 is pointing into the stack for thread: 0x7f0bd41c1000 R9 =0x0268 is an unknown value R10=0x7f16ebcf8560: in
[jira] [Commented] (SPARK-17241) SparkR spark.glm should have configurable regularization parameter
[ https://issues.apache.org/jira/browse/SPARK-17241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15444279#comment-15444279 ] Apache Spark commented on SPARK-17241: -- User 'keypointt' has created a pull request for this issue: https://github.com/apache/spark/pull/14856 > SparkR spark.glm should have configurable regularization parameter > -- > > Key: SPARK-17241 > URL: https://issues.apache.org/jira/browse/SPARK-17241 > Project: Spark > Issue Type: Improvement >Reporter: Junyang Qian > > Spark has configurable L2 regularization parameter for generalized linear > regression. It is very important to have them in SparkR so that users can run > ridge regression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17241) SparkR spark.glm should have configurable regularization parameter
[ https://issues.apache.org/jira/browse/SPARK-17241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17241: Assignee: (was: Apache Spark) > SparkR spark.glm should have configurable regularization parameter > -- > > Key: SPARK-17241 > URL: https://issues.apache.org/jira/browse/SPARK-17241 > Project: Spark > Issue Type: Improvement >Reporter: Junyang Qian > > Spark has configurable L2 regularization parameter for generalized linear > regression. It is very important to have them in SparkR so that users can run > ridge regression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17241) SparkR spark.glm should have configurable regularization parameter
[ https://issues.apache.org/jira/browse/SPARK-17241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17241: Assignee: Apache Spark > SparkR spark.glm should have configurable regularization parameter > -- > > Key: SPARK-17241 > URL: https://issues.apache.org/jira/browse/SPARK-17241 > Project: Spark > Issue Type: Improvement >Reporter: Junyang Qian >Assignee: Apache Spark > > Spark has configurable L2 regularization parameter for generalized linear > regression. It is very important to have them in SparkR so that users can run > ridge regression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17284) Remove statistics-related table properties from SHOW CREATE TABLE
[ https://issues.apache.org/jira/browse/SPARK-17284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17284: Assignee: (was: Apache Spark) > Remove statistics-related table properties from SHOW CREATE TABLE > - > > Key: SPARK-17284 > URL: https://issues.apache.org/jira/browse/SPARK-17284 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > {noformat} > CREATE TABLE t1 ( > c1 INT COMMENT 'bla', > c2 STRING > ) > LOCATION '$dir' > TBLPROPERTIES ( > 'prop1' = 'value1', > 'prop2' = 'value2' > ) > {noformat} > The output of {{SHOW CREATE TABLE t1}} is > {noformat} > CREATE EXTERNAL TABLE `t1`(`c1` int COMMENT 'bla', `c2` string) > ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' > WITH SERDEPROPERTIES ( > 'serialization.format' = '1' > ) > STORED AS > INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' > OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' > LOCATION > 'file:/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-ee317538-0f8c-42d0-b08c-cf077d94fe75' > TBLPROPERTIES ( > 'rawDataSize' = '-1', > 'numFiles' = '0', > 'transient_lastDdlTime' = '1472424052', > 'totalSize' = '0', > 'prop1' = 'value1', > 'prop2' = 'value2', > 'COLUMN_STATS_ACCURATE' = 'false', > 'numRows' = '-1' > ) > {noformat} > The statistics-related table properties should be skipped by {{SHOW CREATE > TABLE}}, since it could be incorrect in the newly created table. See the Hive > JIRA: https://issues.apache.org/jira/browse/HIVE-13792 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17284) Remove statistics-related table properties from SHOW CREATE TABLE
[ https://issues.apache.org/jira/browse/SPARK-17284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15444259#comment-15444259 ] Apache Spark commented on SPARK-17284: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/14855 > Remove statistics-related table properties from SHOW CREATE TABLE > - > > Key: SPARK-17284 > URL: https://issues.apache.org/jira/browse/SPARK-17284 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > {noformat} > CREATE TABLE t1 ( > c1 INT COMMENT 'bla', > c2 STRING > ) > LOCATION '$dir' > TBLPROPERTIES ( > 'prop1' = 'value1', > 'prop2' = 'value2' > ) > {noformat} > The output of {{SHOW CREATE TABLE t1}} is > {noformat} > CREATE EXTERNAL TABLE `t1`(`c1` int COMMENT 'bla', `c2` string) > ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' > WITH SERDEPROPERTIES ( > 'serialization.format' = '1' > ) > STORED AS > INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' > OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' > LOCATION > 'file:/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-ee317538-0f8c-42d0-b08c-cf077d94fe75' > TBLPROPERTIES ( > 'rawDataSize' = '-1', > 'numFiles' = '0', > 'transient_lastDdlTime' = '1472424052', > 'totalSize' = '0', > 'prop1' = 'value1', > 'prop2' = 'value2', > 'COLUMN_STATS_ACCURATE' = 'false', > 'numRows' = '-1' > ) > {noformat} > The statistics-related table properties should be skipped by {{SHOW CREATE > TABLE}}, since it could be incorrect in the newly created table. See the Hive > JIRA: https://issues.apache.org/jira/browse/HIVE-13792 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17284) Remove statistics-related table properties from SHOW CREATE TABLE
[ https://issues.apache.org/jira/browse/SPARK-17284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17284: Assignee: Apache Spark > Remove statistics-related table properties from SHOW CREATE TABLE > - > > Key: SPARK-17284 > URL: https://issues.apache.org/jira/browse/SPARK-17284 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li >Assignee: Apache Spark > > {noformat} > CREATE TABLE t1 ( > c1 INT COMMENT 'bla', > c2 STRING > ) > LOCATION '$dir' > TBLPROPERTIES ( > 'prop1' = 'value1', > 'prop2' = 'value2' > ) > {noformat} > The output of {{SHOW CREATE TABLE t1}} is > {noformat} > CREATE EXTERNAL TABLE `t1`(`c1` int COMMENT 'bla', `c2` string) > ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' > WITH SERDEPROPERTIES ( > 'serialization.format' = '1' > ) > STORED AS > INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' > OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' > LOCATION > 'file:/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-ee317538-0f8c-42d0-b08c-cf077d94fe75' > TBLPROPERTIES ( > 'rawDataSize' = '-1', > 'numFiles' = '0', > 'transient_lastDdlTime' = '1472424052', > 'totalSize' = '0', > 'prop1' = 'value1', > 'prop2' = 'value2', > 'COLUMN_STATS_ACCURATE' = 'false', > 'numRows' = '-1' > ) > {noformat} > The statistics-related table properties should be skipped by {{SHOW CREATE > TABLE}}, since it could be incorrect in the newly created table. See the Hive > JIRA: https://issues.apache.org/jira/browse/HIVE-13792 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17284) Remove statistics-related table properties from SHOW CREATE TABLE
Xiao Li created SPARK-17284: --- Summary: Remove statistics-related table properties from SHOW CREATE TABLE Key: SPARK-17284 URL: https://issues.apache.org/jira/browse/SPARK-17284 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Xiao Li {noformat} CREATE TABLE t1 ( c1 INT COMMENT 'bla', c2 STRING ) LOCATION '$dir' TBLPROPERTIES ( 'prop1' = 'value1', 'prop2' = 'value2' ) {noformat} The output of {{SHOW CREATE TABLE t1}} is {noformat} CREATE EXTERNAL TABLE `t1`(`c1` int COMMENT 'bla', `c2` string) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' WITH SERDEPROPERTIES ( 'serialization.format' = '1' ) STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 'file:/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-ee317538-0f8c-42d0-b08c-cf077d94fe75' TBLPROPERTIES ( 'rawDataSize' = '-1', 'numFiles' = '0', 'transient_lastDdlTime' = '1472424052', 'totalSize' = '0', 'prop1' = 'value1', 'prop2' = 'value2', 'COLUMN_STATS_ACCURATE' = 'false', 'numRows' = '-1' ) {noformat} The statistics-related table properties should be skipped by {{SHOW CREATE TABLE}}, since it could be incorrect in the newly created table. See the Hive JIRA: https://issues.apache.org/jira/browse/HIVE-13792 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17283) Cancel job in RDD.take() as soon as enough output is receieved
[ https://issues.apache.org/jira/browse/SPARK-17283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15443975#comment-15443975 ] Apache Spark commented on SPARK-17283: -- User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/14854 > Cancel job in RDD.take() as soon as enough output is receieved > -- > > Key: SPARK-17283 > URL: https://issues.apache.org/jira/browse/SPARK-17283 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Josh Rosen >Assignee: Josh Rosen > > The current implementation of RDD.take() waits until all partitions of each > job have been computed before checking whether enough rows have been > received. If take() were to perform this check on-the-fly as individual > partitions were completed then it could stop early, offering large speedups > for certain interactive queries. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17283) Cancel job in RDD.take() as soon as enough output is receieved
[ https://issues.apache.org/jira/browse/SPARK-17283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17283: Assignee: Apache Spark (was: Josh Rosen) > Cancel job in RDD.take() as soon as enough output is receieved > -- > > Key: SPARK-17283 > URL: https://issues.apache.org/jira/browse/SPARK-17283 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Josh Rosen >Assignee: Apache Spark > > The current implementation of RDD.take() waits until all partitions of each > job have been computed before checking whether enough rows have been > received. If take() were to perform this check on-the-fly as individual > partitions were completed then it could stop early, offering large speedups > for certain interactive queries. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17283) Cancel job in RDD.take() as soon as enough output is receieved
[ https://issues.apache.org/jira/browse/SPARK-17283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17283: Assignee: Josh Rosen (was: Apache Spark) > Cancel job in RDD.take() as soon as enough output is receieved > -- > > Key: SPARK-17283 > URL: https://issues.apache.org/jira/browse/SPARK-17283 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Josh Rosen >Assignee: Josh Rosen > > The current implementation of RDD.take() waits until all partitions of each > job have been computed before checking whether enough rows have been > received. If take() were to perform this check on-the-fly as individual > partitions were completed then it could stop early, offering large speedups > for certain interactive queries. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17283) Cancel job in RDD.take() as soon as enough output is receieved
Josh Rosen created SPARK-17283: -- Summary: Cancel job in RDD.take() as soon as enough output is receieved Key: SPARK-17283 URL: https://issues.apache.org/jira/browse/SPARK-17283 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Josh Rosen Assignee: Josh Rosen The current implementation of RDD.take() waits until all partitions of each job have been computed before checking whether enough rows have been received. If take() were to perform this check on-the-fly as individual partitions were completed then it could stop early, offering large speedups for certain interactive queries. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17271) Planner adds un-necessary Sort even if child ordering is semantically same as required ordering
[ https://issues.apache.org/jira/browse/SPARK-17271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell resolved SPARK-17271. --- Resolution: Fixed Assignee: Tejas Patil Fix Version/s: 2.1.0 > Planner adds un-necessary Sort even if child ordering is semantically same as > required ordering > --- > > Key: SPARK-17271 > URL: https://issues.apache.org/jira/browse/SPARK-17271 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.2, 2.0.0 >Reporter: Tejas Patil >Assignee: Tejas Patil > Fix For: 2.1.0 > > > Found a case when the planner is adding un-needed SORT operation due to bug > in the way comparison for `SortOrder` is done at > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala#L253 > `SortOrder` needs to be compared semantically because `Expression` within two > `SortOrder` can be "semantically equal" but not literally equal objects. > eg. In case of `sql("SELECT * FROM table1 a JOIN table2 b ON a.col1=b.col1")` > Expression in required SortOrder: > {code} > AttributeReference( > name = "col1", > dataType = LongType, > nullable = false > ) (exprId = exprId, > qualifier = Some("a") > ) > {code} > Expression in child SortOrder: > {code} > AttributeReference( > name = "col1", > dataType = LongType, > nullable = false > ) (exprId = exprId) > {code} > Notice that the output column has a qualifier but the child attribute does > not but the inherent expression is the same and hence in this case we can say > that the child satisfies the required sort order. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17041) Columns in schema are no longer case sensitive when reading csv file
[ https://issues.apache.org/jira/browse/SPARK-17041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15443779#comment-15443779 ] koert kuipers commented on SPARK-17041: --- note that the docs say: "It is highly discouraged to turn on case sensitive mode." which seems to suggest one shouldn't have files like this. that is not very realistic. i feel like this is a case where a bad idea (sql lack of case sensitivity) is being pushed upon us. if only we could ignore the bad sql ideas a bit more and be inspired by R data.frame or pandas. > Columns in schema are no longer case sensitive when reading csv file > > > Key: SPARK-17041 > URL: https://issues.apache.org/jira/browse/SPARK-17041 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.0.0 >Reporter: Barry Becker > > It used to be (in spark 1.6.2) that I could read a csv file that had columns > with names that differed only by case. For example, one column may be > "output" and another called "Output". Now (with spark 2.0.0) if I try to read > such a file, I get an error like this: > {code} > org.apache.spark.sql.AnalysisException: Reference 'Output' is ambiguous, > could be: Output#1263, Output#1295.; > {code} > The schema (dfSchema below) that I pass to the csv read looks like this: > {code} > StructType( StructField(Output,StringType,true), ... > StructField(output,StringType,true), ...) > {code} > The code that does the read is this > {code} > sqlContext.read > .format("csv") > .option("header", "false") // Use first line of all files as header > .option("inferSchema", "false") // Automatically infer data types > .schema(dfSchema) > .csv(dataFile) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17282) Implement ALTER TABLE UPDATE STATISTICS SET
[ https://issues.apache.org/jira/browse/SPARK-17282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-17282: Issue Type: Sub-task (was: Improvement) Parent: SPARK-16026 > Implement ALTER TABLE UPDATE STATISTICS SET > --- > > Key: SPARK-17282 > URL: https://issues.apache.org/jira/browse/SPARK-17282 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.1.0 >Reporter: Xiao Li > > Users can change the statistics by the DDL statement: > {noformat} > ALTER TABLE UPDATE STATISTICS SET > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17282) Implement ALTER TABLE UPDATE STATISTICS SET
Xiao Li created SPARK-17282: --- Summary: Implement ALTER TABLE UPDATE STATISTICS SET Key: SPARK-17282 URL: https://issues.apache.org/jira/browse/SPARK-17282 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.1.0 Reporter: Xiao Li Users can change the statistics by the DDL statement: {noformat} ALTER TABLE UPDATE STATISTICS SET {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12619) Combine small files in a hadoop directory into single split
[ https://issues.apache.org/jira/browse/SPARK-12619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-12619. --- Resolution: Duplicate > Combine small files in a hadoop directory into single split > > > Key: SPARK-12619 > URL: https://issues.apache.org/jira/browse/SPARK-12619 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Navis >Priority: Trivial > > When a directory contains too many (small) files, whole spark cluster will be > exhausted scheduling tasks created for each file. Custom input format can > handle that but if you're using hive metastore, it could hardly be an option. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17214) How to deal with dots (.) present in column names in SparkR
[ https://issues.apache.org/jira/browse/SPARK-17214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mohit Bansal resolved SPARK-17214. -- Resolution: Later > How to deal with dots (.) present in column names in SparkR > --- > > Key: SPARK-17214 > URL: https://issues.apache.org/jira/browse/SPARK-17214 > Project: Spark > Issue Type: Bug >Reporter: Mohit Bansal > > I am trying to load a local csv file into SparkR, which contains dots in > column names. After reading the file I tried to change the names and replaced > "." with "_". Still I am not able to do any operation on the created SDF. > Here is the reproducible code: > --- > #writing iris dataset to local > write.csv(iris,"iris.csv",row.names=F) > #reading it back using read.df > iris_sdf<-read.df("iris.csv","csv",header="true",inferSchema="true") > #changing column names > names(iris_sdf)<-c("Sepal_Length","Sepal_Width","Petal_Length","Petal_Width","Species") > #selecting required columna > head(select(iris_sdf,iris_sdf$Sepal_Length,iris_sdf$Sepal_Width)) > - > 16/08/24 13:51:24 ERROR RBackendHandler: dfToCols on > org.apache.spark.sql.api.r.SQLUtils failed > Error in invokeJava(isStatic = TRUE, className, methodName, ...) : > org.apache.spark.sql.AnalysisException: Unable to resolve Sepal.Length > given [Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species]; > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:133) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:129) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at scala.collection.IterableLike$cl > What should I do to get it work? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16647) sparksql1.6.2 on yarn with hive metastore1.0.0 thows "alter_table_with_cascade" exception
[ https://issues.apache.org/jira/browse/SPARK-16647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15443140#comment-15443140 ] fengchaoge commented on SPARK-16647: Do you have resolve this problem? I have the same problem. my hive's version is 0.13.1 > sparksql1.6.2 on yarn with hive metastore1.0.0 thows > "alter_table_with_cascade" exception > - > > Key: SPARK-16647 > URL: https://issues.apache.org/jira/browse/SPARK-16647 > Project: Spark > Issue Type: Bug >Reporter: zhangshuxin > > my spark version is 1.6.2(1.5.2,1.5.0) and hive version is 1.0.0 > when i execute some sql like 'create table tbl1 as select * from tbl2' or > 'insert overwrite table tabl1 select * from tbl2',i get the following > exception > 16/07/20 10:14:13 WARN metastore.RetryingMetaStoreClient: MetaStoreClient > lost connection. Attempting to reconnect. > org.apache.thrift.TApplicationException: Invalid method name: > 'alter_table_with_cascade' > at > org.apache.thrift.TApplicationException.read(TApplicationException.java:111) > at > org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:71) > at > org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_alter_table_with_cascade(ThriftHiveMetastore.java:1374) > at > org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.alter_table_with_cascade(ThriftHiveMetastore.java:1358) > at > org.apache.hadoop.hive.metastore.HiveMetaStoreClient.alter_table(HiveMetaStoreClient.java:340) > at > org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.alter_table(SessionHiveMetaStoreClient.java:251) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:156) > at com.sun.proxy.$Proxy27.alter_table(Unknown Source) > at org.apache.hadoop.hive.ql.metadata.Hive.alterTable(Hive.java:496) > at org.apache.hadoop.hive.ql.metadata.Hive.alterTable(Hive.java:484) > at org.apache.hadoop.hive.ql.metadata.Hive.loadTable(Hive.java:1668) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.spark.sql.hive.client.Shim_v0_14.loadTable(HiveShim.scala:441) > at > org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$loadTable$1.apply$mcV$sp(ClientWrapper.scala:489) > at > org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$loadTable$1.apply(ClientWrapper.scala:489) > at > org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$loadTable$1.apply(ClientWrapper.scala:489) > at > org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:256) > at > org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:211) > at > org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:248) > at > org.apache.spark.sql.hive.client.ClientWrapper.loadTable(ClientWrapper.scala:488) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:243) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:127) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:263) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) > at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138) > at > org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:933) > at > org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:933) > at > org.apache.spark.sql.hive.execution.CreateTableAsSelect.run(CreateTableAsSelect.scala:89) > at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57) > at >
[jira] [Closed] (SPARK-15044) spark-sql will throw "input path does not exist" exception if it handles a partition which exists in hive table, but the path is removed manually
[ https://issues.apache.org/jira/browse/SPARK-15044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huangyu closed SPARK-15044. --- > spark-sql will throw "input path does not exist" exception if it handles a > partition which exists in hive table, but the path is removed manually > - > > Key: SPARK-15044 > URL: https://issues.apache.org/jira/browse/SPARK-15044 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1, 2.0.0 >Reporter: huangyu > > spark-sql will throw "input path not exist" exception if it handles a > partition which exists in hive table, but the path is removed manually.The > situation is as follows: > 1) Create a table "test". "create table test (n string) partitioned by (p > string)" > 2) Load some data into partition(p='1') > 3)Remove the path related to partition(p='1') of table test manually. "hadoop > fs -rmr /warehouse//test/p=1" > 4)Run spark sql, spark-sql -e "select n from test where p='1';" > Then it throws exception: > {code} > org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: > ./test/p=1 > at > org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285) > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > {code} > The bug is in spark 1.6.1, if I use spark 1.4.0, It is OK > I think spark-sql should ignore the path, just like hive or it dose in early > versions, rather than throw an exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17214) How to deal with dots (.) present in column names in SparkR
[ https://issues.apache.org/jira/browse/SPARK-17214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15442870#comment-15442870 ] Felix Cheung edited comment on SPARK-17214 at 8/28/16 6:14 AM: --- I think the underlining issue is that we should either handle column names with `.` correctly (preferred) or translate them uniformly as in other cases (eg. `as.DataFrame`) As of now a DataFrame from csv source can have `.` in column names and it is unoperable until renamed: {code} > iris_sdf<-read.df("iris.csv","csv",header="true",inferSchema="true") > iris_sdf SparkDataFrame[Sepal.Length:double, Sepal.Width:double, Petal.Length:double, Petal.Width:double, Species:string] > head(select(iris_sdf,iris_sdf$Sepal.Length)) 16/08/28 06:11:16 ERROR RBackendHandler: col on 46 failed Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : org.apache.spark.sql.AnalysisException: Cannot resolve column name "Sepal.Length" among (Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species); {code} was (Author: felixcheung): I think the underlining issue is that we should either handle column names with `.` correctly (preferred) or translate them uniformly as in other cases (eg. `as.DataFrame`) As of now a DataFrame from csv source can have `.` in column names and it is unoperable until renamed: {code} > iris_sdf<-read.df("iris.csv","csv",header="true",inferSchema="true") > iris_sdf SparkDataFrame[Sepal.Length:double, Sepal.Width:double, Petal.Length:double, Petal.Width:double, Species:string] {code} > How to deal with dots (.) present in column names in SparkR > --- > > Key: SPARK-17214 > URL: https://issues.apache.org/jira/browse/SPARK-17214 > Project: Spark > Issue Type: Bug >Reporter: Mohit Bansal > > I am trying to load a local csv file into SparkR, which contains dots in > column names. After reading the file I tried to change the names and replaced > "." with "_". Still I am not able to do any operation on the created SDF. > Here is the reproducible code: > --- > #writing iris dataset to local > write.csv(iris,"iris.csv",row.names=F) > #reading it back using read.df > iris_sdf<-read.df("iris.csv","csv",header="true",inferSchema="true") > #changing column names > names(iris_sdf)<-c("Sepal_Length","Sepal_Width","Petal_Length","Petal_Width","Species") > #selecting required columna > head(select(iris_sdf,iris_sdf$Sepal_Length,iris_sdf$Sepal_Width)) > - > 16/08/24 13:51:24 ERROR RBackendHandler: dfToCols on > org.apache.spark.sql.api.r.SQLUtils failed > Error in invokeJava(isStatic = TRUE, className, methodName, ...) : > org.apache.spark.sql.AnalysisException: Unable to resolve Sepal.Length > given [Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species]; > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:133) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:129) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at scala.collection.IterableLike$cl > What should I do to get it work? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17214) How to deal with dots (.) present in column names in SparkR
[ https://issues.apache.org/jira/browse/SPARK-17214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15442870#comment-15442870 ] Felix Cheung edited comment on SPARK-17214 at 8/28/16 6:15 AM: --- I think the underlining issue is that we should either handle column names with `.` correctly (preferred) or translate them uniformly as in other cases (eg. `as.DataFrame`) As of now a DataFrame from csv source can have `.` in column names and it is unoperable until renamed (which is a known issue): {code} > iris_sdf<-read.df("iris.csv","csv",header="true",inferSchema="true") > iris_sdf SparkDataFrame[Sepal.Length:double, Sepal.Width:double, Petal.Length:double, Petal.Width:double, Species:string] > head(select(iris_sdf,iris_sdf$Sepal.Length)) 16/08/28 06:11:16 ERROR RBackendHandler: col on 46 failed Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : org.apache.spark.sql.AnalysisException: Cannot resolve column name "Sepal.Length" among (Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species); {code} was (Author: felixcheung): I think the underlining issue is that we should either handle column names with `.` correctly (preferred) or translate them uniformly as in other cases (eg. `as.DataFrame`) As of now a DataFrame from csv source can have `.` in column names and it is unoperable until renamed: {code} > iris_sdf<-read.df("iris.csv","csv",header="true",inferSchema="true") > iris_sdf SparkDataFrame[Sepal.Length:double, Sepal.Width:double, Petal.Length:double, Petal.Width:double, Species:string] > head(select(iris_sdf,iris_sdf$Sepal.Length)) 16/08/28 06:11:16 ERROR RBackendHandler: col on 46 failed Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : org.apache.spark.sql.AnalysisException: Cannot resolve column name "Sepal.Length" among (Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species); {code} > How to deal with dots (.) present in column names in SparkR > --- > > Key: SPARK-17214 > URL: https://issues.apache.org/jira/browse/SPARK-17214 > Project: Spark > Issue Type: Bug >Reporter: Mohit Bansal > > I am trying to load a local csv file into SparkR, which contains dots in > column names. After reading the file I tried to change the names and replaced > "." with "_". Still I am not able to do any operation on the created SDF. > Here is the reproducible code: > --- > #writing iris dataset to local > write.csv(iris,"iris.csv",row.names=F) > #reading it back using read.df > iris_sdf<-read.df("iris.csv","csv",header="true",inferSchema="true") > #changing column names > names(iris_sdf)<-c("Sepal_Length","Sepal_Width","Petal_Length","Petal_Width","Species") > #selecting required columna > head(select(iris_sdf,iris_sdf$Sepal_Length,iris_sdf$Sepal_Width)) > - > 16/08/24 13:51:24 ERROR RBackendHandler: dfToCols on > org.apache.spark.sql.api.r.SQLUtils failed > Error in invokeJava(isStatic = TRUE, className, methodName, ...) : > org.apache.spark.sql.AnalysisException: Unable to resolve Sepal.Length > given [Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species]; > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:133) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:129) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at scala.collection.IterableLike$cl > What should I do to get it work? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17214) How to deal with dots (.) present in column names in SparkR
[ https://issues.apache.org/jira/browse/SPARK-17214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15442870#comment-15442870 ] Felix Cheung commented on SPARK-17214: -- I think the underlining issue is that we should either handle column names with `.` correctly (preferred) or translate them uniformly as in other cases (eg. `as.DataFrame`) As of now a DataFrame from csv source can have `.` in column names and it is unoperable until renamed: {code} > iris_sdf<-read.df("iris.csv","csv",header="true",inferSchema="true") > iris_sdf SparkDataFrame[Sepal.Length:double, Sepal.Width:double, Petal.Length:double, Petal.Width:double, Species:string] {code} > How to deal with dots (.) present in column names in SparkR > --- > > Key: SPARK-17214 > URL: https://issues.apache.org/jira/browse/SPARK-17214 > Project: Spark > Issue Type: Bug >Reporter: Mohit Bansal > > I am trying to load a local csv file into SparkR, which contains dots in > column names. After reading the file I tried to change the names and replaced > "." with "_". Still I am not able to do any operation on the created SDF. > Here is the reproducible code: > --- > #writing iris dataset to local > write.csv(iris,"iris.csv",row.names=F) > #reading it back using read.df > iris_sdf<-read.df("iris.csv","csv",header="true",inferSchema="true") > #changing column names > names(iris_sdf)<-c("Sepal_Length","Sepal_Width","Petal_Length","Petal_Width","Species") > #selecting required columna > head(select(iris_sdf,iris_sdf$Sepal_Length,iris_sdf$Sepal_Width)) > - > 16/08/24 13:51:24 ERROR RBackendHandler: dfToCols on > org.apache.spark.sql.api.r.SQLUtils failed > Error in invokeJava(isStatic = TRUE, className, methodName, ...) : > org.apache.spark.sql.AnalysisException: Unable to resolve Sepal.Length > given [Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species]; > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:133) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:129) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at scala.collection.IterableLike$cl > What should I do to get it work? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17214) How to deal with dots (.) present in column names in SparkR
[ https://issues.apache.org/jira/browse/SPARK-17214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15442865#comment-15442865 ] Felix Cheung commented on SPARK-17214: -- [~bansalism] what version of Spark + SparkR are you testing with? I ran your example and it worked {code} > #writing iris dataset to local > write.csv(iris,"iris.csv",row.names=F) > #reading it back using read.df > iris_sdf<-read.df("iris.csv","csv",header="true",inferSchema="true") > #changing column names > names(iris_sdf)<-c("Sepal_Length","Sepal_Width","Petal_Length","Petal_Width","Species") > iris_sdf SparkDataFrame[Sepal_Length:double, Sepal_Width:double, Petal_Length:double, Petal_Width:double, Species:string] > head(iris_sdf) Sepal_Length Sepal_Width Petal_Length Petal_Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 4.6 3.1 1.5 0.2 setosa 5 5.0 3.6 1.4 0.2 setosa 6 5.4 3.9 1.7 0.4 setosa > a <- select(iris_sdf,iris_sdf$Sepal_Length,iris_sdf$Sepal_Width) > head(a) Sepal_Length Sepal_Width 1 5.1 3.5 2 4.9 3.0 3 4.7 3.2 4 4.6 3.1 5 5.0 3.6 6 5.4 3.9 > {code} > How to deal with dots (.) present in column names in SparkR > --- > > Key: SPARK-17214 > URL: https://issues.apache.org/jira/browse/SPARK-17214 > Project: Spark > Issue Type: Bug >Reporter: Mohit Bansal > > I am trying to load a local csv file into SparkR, which contains dots in > column names. After reading the file I tried to change the names and replaced > "." with "_". Still I am not able to do any operation on the created SDF. > Here is the reproducible code: > --- > #writing iris dataset to local > write.csv(iris,"iris.csv",row.names=F) > #reading it back using read.df > iris_sdf<-read.df("iris.csv","csv",header="true",inferSchema="true") > #changing column names > names(iris_sdf)<-c("Sepal_Length","Sepal_Width","Petal_Length","Petal_Width","Species") > #selecting required columna > head(select(iris_sdf,iris_sdf$Sepal_Length,iris_sdf$Sepal_Width)) > - > 16/08/24 13:51:24 ERROR RBackendHandler: dfToCols on > org.apache.spark.sql.api.r.SQLUtils failed > Error in invokeJava(isStatic = TRUE, className, methodName, ...) : > org.apache.spark.sql.AnalysisException: Unable to resolve Sepal.Length > given [Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species]; > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:133) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:129) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at scala.collection.IterableLike$cl > What should I do to get it work? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org