[jira] [Created] (SPARK-17286) The fetched data of Shuffle read stored in the ManagedBuffer, so its underlying data stored in off-memory or file?

2016-08-28 Thread song fengfei (JIRA)
song fengfei created SPARK-17286:


 Summary: The fetched data of Shuffle read  stored in the 
ManagedBuffer, so its underlying data stored in off-memory or file?
 Key: SPARK-17286
 URL: https://issues.apache.org/jira/browse/SPARK-17286
 Project: Spark
  Issue Type: Question
  Components: Block Manager, Input/Output, Shuffle
Reporter: song fengfei


The fetched data of Shuffle read  stored in the ManagedBuffer, so its 
underlying data stored in off-memory or file? if stored in off-memory, Whether 
it is also managed by spark memorymanager ?  and if a map output is too big, 
placed it directly in the memory isn't very easily lead to oom? I did not 
understand this piece, hoping to get the answer here,thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17198) ORC fixed char literal filter does not work

2016-08-28 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15444781#comment-15444781
 ] 

Dongjoon Hyun commented on SPARK-17198:
---

Yes. Spark 2.0 improves SQL features greatly. I think that Spark 2.0.1 will add 
more stability fixes.


> ORC fixed char literal filter does not work
> ---
>
> Key: SPARK-17198
> URL: https://issues.apache.org/jira/browse/SPARK-17198
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: tuming
>
> I have got wrong result when I run the following query in SparkSQL. 
> select * from orc_table where char_col ='5LZS';
> Table orc_table is a ORC format table.
> Column char_col is defined as char(6). 
> The hive record reader will return a char(6) string to the spark. And the 
> spark has no fixed char type. All fixed char type attributes are converted to 
> String by default. Meanwhile the constant literal is parsed to a string 
> Literal.  So it won't return true forever while doing the equal comparison. 
> For instance: '5LZS'=='5LZS  '.
> But I can get correct result in Hive using same data and sql string because 
> hive append spaces for those constant literal. Please refer to:
> https://issues.apache.org/jira/browse/HIVE-11312
> I found there is no such patch for spark.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17198) ORC fixed char literal filter does not work

2016-08-28 Thread tuming (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15444755#comment-15444755
 ] 

tuming commented on SPARK-17198:


yes, it works fine on spark2.0. I am using spark1.5.1. 

I can get the column type by the "desc" command. They are different in spark2.0 
and spark1.5.1.

Spark2.0:
spark-sql> desc orc_test;
col1string  NULL
col2string  NULL
spark-sql> 

Spark1.5.1:
spark-sql> desc orc_test;
col1string  NULL
col2char(10)NULL

I have looked into the source code and found spark1.5.1 invoked the hive native 
command to execute the create table sql(not CATS cases).
I have no idea if it is the modification of hive or spark. The spark parser of 
spark 2.0 is much different from 1.5.1.





> ORC fixed char literal filter does not work
> ---
>
> Key: SPARK-17198
> URL: https://issues.apache.org/jira/browse/SPARK-17198
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: tuming
>
> I have got wrong result when I run the following query in SparkSQL. 
> select * from orc_table where char_col ='5LZS';
> Table orc_table is a ORC format table.
> Column char_col is defined as char(6). 
> The hive record reader will return a char(6) string to the spark. And the 
> spark has no fixed char type. All fixed char type attributes are converted to 
> String by default. Meanwhile the constant literal is parsed to a string 
> Literal.  So it won't return true forever while doing the equal comparison. 
> For instance: '5LZS'=='5LZS  '.
> But I can get correct result in Hive using same data and sql string because 
> hive append spaces for those constant literal. Please refer to:
> https://issues.apache.org/jira/browse/HIVE-11312
> I found there is no such patch for spark.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17261) Using HiveContext after re-creating SparkContext in Spark 2.0 throws "Java.lang.illegalStateException: Cannot call methods on a stopped sparkContext"

2016-08-28 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15444694#comment-15444694
 ] 

Jeff Zhang commented on SPARK-17261:


[~dongjoon] spark-shell works well for me. It seems your case is due to 
something else. 

> Using HiveContext after re-creating SparkContext in Spark 2.0 throws 
> "Java.lang.illegalStateException: Cannot call methods on a stopped 
> sparkContext"
> -
>
> Key: SPARK-17261
> URL: https://issues.apache.org/jira/browse/SPARK-17261
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
> Environment: Amazon AWS EMR 5.0
>Reporter: Rahul Jain
> Fix For: 2.0.0
>
>
> After stopping SparkSession if we recreate it and use HiveContext in it. it 
> will throw error.
> Steps to reproduce:
> spark = SparkSession.builder.enableHiveSupport().getOrCreate()
> spark.sql("show databases")
> spark.stop()
> spark = SparkSession.builder.enableHiveSupport().getOrCreate()
> spark.sql("show databases")
> "Java.lang.illegalStateException: Cannot call methods on a stopped 
> sparkContext"
> Above error occurs only in case of Pyspark not in SparkShell



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17261) Using HiveContext after re-creating SparkContext in Spark 2.0 throws "Java.lang.illegalStateException: Cannot call methods on a stopped sparkContext"

2016-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17261:


Assignee: Apache Spark

> Using HiveContext after re-creating SparkContext in Spark 2.0 throws 
> "Java.lang.illegalStateException: Cannot call methods on a stopped 
> sparkContext"
> -
>
> Key: SPARK-17261
> URL: https://issues.apache.org/jira/browse/SPARK-17261
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
> Environment: Amazon AWS EMR 5.0
>Reporter: Rahul Jain
>Assignee: Apache Spark
> Fix For: 2.0.0
>
>
> After stopping SparkSession if we recreate it and use HiveContext in it. it 
> will throw error.
> Steps to reproduce:
> spark = SparkSession.builder.enableHiveSupport().getOrCreate()
> spark.sql("show databases")
> spark.stop()
> spark = SparkSession.builder.enableHiveSupport().getOrCreate()
> spark.sql("show databases")
> "Java.lang.illegalStateException: Cannot call methods on a stopped 
> sparkContext"
> Above error occurs only in case of Pyspark not in SparkShell



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17261) Using HiveContext after re-creating SparkContext in Spark 2.0 throws "Java.lang.illegalStateException: Cannot call methods on a stopped sparkContext"

2016-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17261:


Assignee: (was: Apache Spark)

> Using HiveContext after re-creating SparkContext in Spark 2.0 throws 
> "Java.lang.illegalStateException: Cannot call methods on a stopped 
> sparkContext"
> -
>
> Key: SPARK-17261
> URL: https://issues.apache.org/jira/browse/SPARK-17261
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
> Environment: Amazon AWS EMR 5.0
>Reporter: Rahul Jain
> Fix For: 2.0.0
>
>
> After stopping SparkSession if we recreate it and use HiveContext in it. it 
> will throw error.
> Steps to reproduce:
> spark = SparkSession.builder.enableHiveSupport().getOrCreate()
> spark.sql("show databases")
> spark.stop()
> spark = SparkSession.builder.enableHiveSupport().getOrCreate()
> spark.sql("show databases")
> "Java.lang.illegalStateException: Cannot call methods on a stopped 
> sparkContext"
> Above error occurs only in case of Pyspark not in SparkShell



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17261) Using HiveContext after re-creating SparkContext in Spark 2.0 throws "Java.lang.illegalStateException: Cannot call methods on a stopped sparkContext"

2016-08-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15444690#comment-15444690
 ] 

Apache Spark commented on SPARK-17261:
--

User 'zjffdu' has created a pull request for this issue:
https://github.com/apache/spark/pull/14857

> Using HiveContext after re-creating SparkContext in Spark 2.0 throws 
> "Java.lang.illegalStateException: Cannot call methods on a stopped 
> sparkContext"
> -
>
> Key: SPARK-17261
> URL: https://issues.apache.org/jira/browse/SPARK-17261
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
> Environment: Amazon AWS EMR 5.0
>Reporter: Rahul Jain
> Fix For: 2.0.0
>
>
> After stopping SparkSession if we recreate it and use HiveContext in it. it 
> will throw error.
> Steps to reproduce:
> spark = SparkSession.builder.enableHiveSupport().getOrCreate()
> spark.sql("show databases")
> spark.stop()
> spark = SparkSession.builder.enableHiveSupport().getOrCreate()
> spark.sql("show databases")
> "Java.lang.illegalStateException: Cannot call methods on a stopped 
> sparkContext"
> Above error occurs only in case of Pyspark not in SparkShell



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17061) Incorrect results returned following a join of two datasets and a map step where total number of columns >100

2016-08-28 Thread Liwei Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15444614#comment-15444614
 ] 

Liwei Lin commented on SPARK-17061:
---

Oh cool! Thank you for the well-formed reproducer!

> Incorrect results returned following a join of two datasets and a map step 
> where total number of columns >100
> -
>
> Key: SPARK-17061
> URL: https://issues.apache.org/jira/browse/SPARK-17061
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Jamie Hutton
>Assignee: Liwei Lin
>Priority: Critical
> Fix For: 2.0.1, 2.1.0
>
>
> We have hit a consistent bug where we have a dataset with more than 100 
> columns. I am raising as a blocker because spark is returning the WRONG 
> results rather than erroring, leading to data integrity issues
> I have put together the following test case which will show the issue (it 
> will run in spark-shell). In this example i am joining a dataset with lots of 
> fields onto another dataset. 
> The join works fine and if you show the dataset you will get the expected 
> result. However if you run a map step over the dataset you end up with a 
> strange error where the sequence that is in the right dataset now only 
> contains the last value.
> Whilst this test may seem a rather contrived example, what we are doing here 
> is a very standard analtical pattern. My original code was designed to:
>  - take a dataset of child records
>  - groupByKey up to the parent: giving a Dataset of (ParentID, Seq[Children])
>  - join the children onto the parent by parentID: giving 
> ((Parent),(ParentID,Seq[Children])
>  - map over the result to give a tuple of (Parent,Seq[Children])
> Notes:
> - The issue is resolved by having less fields - as soon as we go <= 100 the 
> integrity issue goes away. Try removing one of the fields from BigCaseClass 
> below
> - The issue will arise based on the total number of fields in the resulting 
> dataset. Below i have a small case class and a big case class, but two case 
> classes of 50 variable would give the same issue
> - the issue occurs where the case class being joined on (on the right) has a 
> case class type. It doesnt occur if you have a Seq[String]
> - If i go back to an RDD for the map step after the join i can workaround the 
> issue, but i lose all the benefits of datasets
> Scala code test case:
>   case class Name(name: String)
>   case class SmallCaseClass (joinkey: Integer, names: Seq[Name])
>   case class BigCaseClass  (field1: Integer,field2: Integer,field3: 
> Integer,field4: Integer,field5: Integer,field6: Integer,field7: 
> Integer,field8: Integer,field9: Integer,field10: Integer,field11: 
> Integer,field12: Integer,field13: Integer,field14: Integer,field15: 
> Integer,field16: Integer,field17: Integer,field18: Integer,field19: 
> Integer,field20: Integer,field21: Integer,field22: Integer,field23: 
> Integer,field24: Integer,field25: Integer,field26: Integer,field27: 
> Integer,field28: Integer,field29: Integer,field30: Integer,field31: 
> Integer,field32: Integer,field33: Integer,field34: Integer,field35: 
> Integer,field36: Integer,field37: Integer,field38: Integer,field39: 
> Integer,field40: Integer,field41: Integer,field42: Integer,field43: 
> Integer,field44: Integer,field45: Integer,field46: Integer,field47: 
> Integer,field48: Integer,field49: Integer,field50: Integer,field51: 
> Integer,field52: Integer,field53: Integer,field54: Integer,field55: 
> Integer,field56: Integer,field57: Integer,field58: Integer,field59: 
> Integer,field60: Integer,field61: Integer,field62: Integer,field63: 
> Integer,field64: Integer,field65: Integer,field66: Integer,field67: 
> Integer,field68: Integer,field69: Integer,field70: Integer,field71: 
> Integer,field72: Integer,field73: Integer,field74: Integer,field75: 
> Integer,field76: Integer,field77: Integer,field78: Integer,field79: 
> Integer,field80: Integer,field81: Integer,field82: Integer,field83: 
> Integer,field84: Integer,field85: Integer,field86: Integer,field87: 
> Integer,field88: Integer,field89: Integer,field90: Integer,field91: 
> Integer,field92: Integer,field93: Integer,field94: Integer,field95: 
> Integer,field96: Integer,field97: Integer,field98: Integer,field99: Integer)
>   
> val bigCC=Seq(BigCaseClass(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 
> 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 
> 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 
> 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 
> 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 
> 91, 92, 93, 94, 95, 96, 97, 98, 99))
> 
>   

[jira] [Commented] (SPARK-13525) SparkR: java.net.SocketTimeoutException: Accept timed out when running any dataframe function

2016-08-28 Thread Sun Rui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15444579#comment-15444579
 ] 

Sun Rui commented on SPARK-13525:
-

Could you help to do some debugging by modifying R code? 

The steps are:
1. Modify /R/lib/SparkR/worker/daemon.R, add some 
checkpoints like 'cat("xxx", file="/tmp/sparkr-debug")' around some keypoints:
{code}
p <- parallel:::mcfork()
...
source(script)
...
{code}

2. Zip the /R/lib/SparkR directory to override 
the existing zip file : /R/lib/sparkr.zip
3. Run your spark application. And check the local file "/tmp/sparkr-debug" on 
all your Yarn worker nodes (you can limit the number of executors for debugging)

Hope this can help to find where the code is broken

> SparkR: java.net.SocketTimeoutException: Accept timed out when running any 
> dataframe function
> -
>
> Key: SPARK-13525
> URL: https://issues.apache.org/jira/browse/SPARK-13525
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Shubhanshu Mishra
>  Labels: sparkr
>
> I am following the code steps from this example:
> https://spark.apache.org/docs/1.6.0/sparkr.html
> There are multiple issues: 
> 1. The head and summary and filter methods are not overridden by spark. Hence 
> I need to call them using `SparkR::` namespace.
> 2. When I try to execute the following, I get errors:
> {code}
> $> $R_HOME/bin/R
> R version 3.2.3 (2015-12-10) -- "Wooden Christmas-Tree"
> Copyright (C) 2015 The R Foundation for Statistical Computing
> Platform: x86_64-pc-linux-gnu (64-bit)
> R is free software and comes with ABSOLUTELY NO WARRANTY.
> You are welcome to redistribute it under certain conditions.
> Type 'license()' or 'licence()' for distribution details.
>   Natural language support but running in an English locale
> R is a collaborative project with many contributors.
> Type 'contributors()' for more information and
> 'citation()' on how to cite R or R packages in publications.
> Type 'demo()' for some demos, 'help()' for on-line help, or
> 'help.start()' for an HTML browser interface to help.
> Type 'q()' to quit R.
> Welcome at Fri Feb 26 16:19:35 2016 
> Attaching package: ‘SparkR’
> The following objects are masked from ‘package:base’:
> colnames, colnames<-, drop, intersect, rank, rbind, sample, subset,
> summary, transform
> Launching java with spark-submit command 
> /content/smishra8/SOFTWARE/spark/bin/spark-submit   --driver-memory "50g" 
> sparkr-shell /tmp/RtmpfBQRg6/backend_portc3bc16f09b1b 
> > df <- createDataFrame(sqlContext, iris)
> Warning messages:
> 1: In FUN(X[[i]], ...) :
>   Use Sepal_Length instead of Sepal.Length  as column name
> 2: In FUN(X[[i]], ...) :
>   Use Sepal_Width instead of Sepal.Width  as column name
> 3: In FUN(X[[i]], ...) :
>   Use Petal_Length instead of Petal.Length  as column name
> 4: In FUN(X[[i]], ...) :
>   Use Petal_Width instead of Petal.Width  as column name
> > training <- filter(df, df$Species != "setosa")
> Error in filter(df, df$Species != "setosa") : 
>   no method for coercing this S4 class to a vector
> > training <- SparkR::filter(df, df$Species != "setosa")
> > model <- SparkR::glm(Species ~ Sepal_Length + Sepal_Width, data = training, 
> > family = "binomial")
> 16/02/26 16:26:46 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.net.SocketTimeoutException: Accept timed out
> at java.net.PlainSocketImpl.socketAccept(Native Method)
> at 
> java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:398)
> at java.net.ServerSocket.implAccept(ServerSocket.java:530)
> at java.net.ServerSocket.accept(ServerSocket.java:498)
> at org.apache.spark.api.r.RRDD$.createRWorker(RRDD.scala:431)
> at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:62)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at 

[jira] [Created] (SPARK-17285) ZeroOutPaddingBytes Causing Fatal JVM Error

2016-08-28 Thread Dean Chen (JIRA)
Dean Chen created SPARK-17285:
-

 Summary: ZeroOutPaddingBytes Causing Fatal JVM Error
 Key: SPARK-17285
 URL: https://issues.apache.org/jira/browse/SPARK-17285
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.0.0
Reporter: Dean Chen


Log below. Only happens with certain datasets.

{code}
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x7f16d55157ac, pid=10860, tid=139687214692096
#
# JRE version: OpenJDK Runtime Environment (8.0_91-b14) (build 
1.8.0_91-8u91-b14-3ubuntu1~16.04.1-b14)
# Java VM: OpenJDK 64-Bit Server VM (25.91-b14 mixed mode linux-amd64 )
# Problematic frame:
# J 4968 C1 org.apache.spark.unsafe.Platform.putLong(Ljava/lang/Object;JJ)V (10 
bytes) @ 0x7f16d55157ac [0x7f16d55157a0+0xc]
#
# Core dump written. Default location: 
/spark/scratch/app-20160828232538-0049/3/core or core.10860
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
#

---  T H R E A D  ---

Current thread (0x7f0bd41c1000):  JavaThread "Executor task launch 
worker-8" daemon [_thread_in_Java, id=10957, 
stack(0x7f0b76bfb000,0x7f0b76cfc000)]

siginfo: si_signo: 11 (SIGSEGV), si_code: 2 (SEGV_ACCERR), si_addr: 
0x7f0ca8cf6518

Registers:
RAX=0x7f0b1badd608, RBX=0x7f0b1badd608, RCX=0x, 
RDX=0xc28f5ea0
RSP=0x7f0b76cf8ba0, RBP=0x7f0b76cf8c60, RSI=0x7f0ce6400678, 
RDI=0x7f16d5007fd4
R8 =0x7f0b76cf8af0, R9 =0x0268, R10=0x7f16ebcf8560, 
R11=0x7f16d55157a0
R12=0x7f1396e553b0, R13=0x7f0b76cf8bf8, R14=0x7f0b76cf8c78, 
R15=0x7f0bd41c1000
RIP=0x7f16d55157ac, EFLAGS=0x00010206, CSGSFS=0xd6390033, 
ERR=0x0006
  TRAPNO=0x000e

Top of Stack: (sp=0x7f0b76cf8ba0)
0x7f0b76cf8ba0:   0018 0278
0x7f0b76cf8bb0:   7f0b76cf8bb0 7f0b1b7d62e8
0x7f0b76cf8bc0:   7f0b76cf8c00 7f0b1b7d6560
0x7f0b76cf8bd0:    7f0b1b7d6308
0x7f0b76cf8be0:   7f0b76cf8c60 7f16d5007fd4
0x7f0b76cf8bf0:   7f16d5007fd4 
0x7f0b76cf8c00:   0358 c28f5ea0
0x7f0b76cf8c10:   0278 7f0ce6400678
0x7f0b76cf8c20:   7f0b76cf8c20 7f0b1b7d89cd
0x7f0b76cf8c30:   7f0b76cf8c78 7f0b1b7d9ff0
0x7f0b76cf8c40:    7f0b1b7d89f8
0x7f0b76cf8c50:   7f0b76cf8bf8 7f0b76cf8c70
0x7f0b76cf8c60:   7f0b76cf8cc0 7f16d5007fd4
0x7f0b76cf8c70:   c28f5c29 7f0ce6400a10
0x7f0b76cf8c80:   7f0b76cf8c80 7f0b1b7d98de
0x7f0b76cf8c90:   7f0b76cf8cf0 7f0b1b7d9ff0
0x7f0b76cf8ca0:    7f0b1b7d9958
0x7f0b76cf8cb0:   7f0b76cf8c70 7f0b76cf8ce0
0x7f0b76cf8cc0:   7f0b76cf8d38 7f16d5007fd4
0x7f0b76cf8cd0:   c28f5c30 c28f5c29
0x7f0b76cf8ce0:   7f1396e55a48 0038
0x7f0b76cf8cf0:   7f0ce6400a10 7f0b76cf8cf8
0x7f0b76cf8d00:   7f0b1b483605 7f0b76cf9988
0x7f0b76cf8d10:   7f0b1b483b18 
0x7f0b76cf8d20:   7f0b1b4837e0 7f0b76cf8ce0
0x7f0b76cf8d30:   7f0b76cf9988 7f0b76cf99d0
0x7f0b76cf8d40:   7f16d5007fd4 
0x7f0b76cf8d50:    
0x7f0b76cf8d60:   bff0 
0x7f0b76cf8d70:   0001 
0x7f0b76cf8d80:    
0x7f0b76cf8d90:   bff0  

Instructions: (pc=0x7f16d55157ac)
0x7f16d551578c:   0a 80 11 64 01 f8 12 fe 06 90 0c 64 50 c0 39 c0
0x7f16d551579c:   4d c0 12 c0 89 84 24 00 c0 fe ff 55 48 83 ec 40
0x7f16d55157ac:   48 89 0c 16 48 83 c4 40 5d 85 05 45 29 62 17 c3
0x7f16d55157bc:   90 90 49 8b 87 90 02 00 00 49 ba 00 00 00 00 00 

Register to memory mapping:

RAX={method} {0x7f0b1badd608} 'putLong' '(Ljava/lang/Object;JJ)V' in 
'org/apache/spark/unsafe/Platform'
RBX={method} {0x7f0b1badd608} 'putLong' '(Ljava/lang/Object;JJ)V' in 
'org/apache/spark/unsafe/Platform'
RCX=0x is an unknown value
RDX=0xc28f5ea0 is an unknown value
RSP=0x7f0b76cf8ba0 is pointing into the stack for thread: 0x7f0bd41c1000
RBP=0x7f0b76cf8c60 is pointing into the stack for thread: 0x7f0bd41c1000
RSI=0x7f0ce6400678 is an oop
[B 
 - klass: {type array byte}
 - length: 856
RDI=0x7f16d5007fd4 is at code_begin+2292 in an Interpreter codelet
invoke return entry points  [0x7f16d50076e0, 0x7f16d50080c0]  2528 bytes
R8 =0x7f0b76cf8af0 is pointing into the stack for thread: 0x7f0bd41c1000
R9 =0x0268 is an unknown value
R10=0x7f16ebcf8560:  in 

[jira] [Commented] (SPARK-17241) SparkR spark.glm should have configurable regularization parameter

2016-08-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15444279#comment-15444279
 ] 

Apache Spark commented on SPARK-17241:
--

User 'keypointt' has created a pull request for this issue:
https://github.com/apache/spark/pull/14856

> SparkR spark.glm should have configurable regularization parameter
> --
>
> Key: SPARK-17241
> URL: https://issues.apache.org/jira/browse/SPARK-17241
> Project: Spark
>  Issue Type: Improvement
>Reporter: Junyang Qian
>
> Spark has configurable L2 regularization parameter for generalized linear 
> regression. It is very important to have them in SparkR so that users can run 
> ridge regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17241) SparkR spark.glm should have configurable regularization parameter

2016-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17241:


Assignee: (was: Apache Spark)

> SparkR spark.glm should have configurable regularization parameter
> --
>
> Key: SPARK-17241
> URL: https://issues.apache.org/jira/browse/SPARK-17241
> Project: Spark
>  Issue Type: Improvement
>Reporter: Junyang Qian
>
> Spark has configurable L2 regularization parameter for generalized linear 
> regression. It is very important to have them in SparkR so that users can run 
> ridge regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17241) SparkR spark.glm should have configurable regularization parameter

2016-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17241:


Assignee: Apache Spark

> SparkR spark.glm should have configurable regularization parameter
> --
>
> Key: SPARK-17241
> URL: https://issues.apache.org/jira/browse/SPARK-17241
> Project: Spark
>  Issue Type: Improvement
>Reporter: Junyang Qian
>Assignee: Apache Spark
>
> Spark has configurable L2 regularization parameter for generalized linear 
> regression. It is very important to have them in SparkR so that users can run 
> ridge regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17284) Remove statistics-related table properties from SHOW CREATE TABLE

2016-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17284:


Assignee: (was: Apache Spark)

> Remove statistics-related table properties from SHOW CREATE TABLE
> -
>
> Key: SPARK-17284
> URL: https://issues.apache.org/jira/browse/SPARK-17284
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> {noformat}
> CREATE TABLE t1 (
>   c1 INT COMMENT 'bla',
>   c2 STRING
> )
> LOCATION '$dir'
> TBLPROPERTIES (
>   'prop1' = 'value1',
>   'prop2' = 'value2'
> )
> {noformat}
> The output of {{SHOW CREATE TABLE t1}} is 
> {noformat}
> CREATE EXTERNAL TABLE `t1`(`c1` int COMMENT 'bla', `c2` string)
> ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
> WITH SERDEPROPERTIES (
>   'serialization.format' = '1'
> )
> STORED AS
>   INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
>   OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
> LOCATION 
> 'file:/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-ee317538-0f8c-42d0-b08c-cf077d94fe75'
> TBLPROPERTIES (
>   'rawDataSize' = '-1',
>   'numFiles' = '0',
>   'transient_lastDdlTime' = '1472424052',
>   'totalSize' = '0',
>   'prop1' = 'value1',
>   'prop2' = 'value2',
>   'COLUMN_STATS_ACCURATE' = 'false',
>   'numRows' = '-1'
> )
> {noformat}
> The statistics-related table properties should be skipped by {{SHOW CREATE 
> TABLE}}, since it could be incorrect in the newly created table. See the Hive 
> JIRA: https://issues.apache.org/jira/browse/HIVE-13792



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17284) Remove statistics-related table properties from SHOW CREATE TABLE

2016-08-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15444259#comment-15444259
 ] 

Apache Spark commented on SPARK-17284:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/14855

> Remove statistics-related table properties from SHOW CREATE TABLE
> -
>
> Key: SPARK-17284
> URL: https://issues.apache.org/jira/browse/SPARK-17284
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> {noformat}
> CREATE TABLE t1 (
>   c1 INT COMMENT 'bla',
>   c2 STRING
> )
> LOCATION '$dir'
> TBLPROPERTIES (
>   'prop1' = 'value1',
>   'prop2' = 'value2'
> )
> {noformat}
> The output of {{SHOW CREATE TABLE t1}} is 
> {noformat}
> CREATE EXTERNAL TABLE `t1`(`c1` int COMMENT 'bla', `c2` string)
> ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
> WITH SERDEPROPERTIES (
>   'serialization.format' = '1'
> )
> STORED AS
>   INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
>   OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
> LOCATION 
> 'file:/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-ee317538-0f8c-42d0-b08c-cf077d94fe75'
> TBLPROPERTIES (
>   'rawDataSize' = '-1',
>   'numFiles' = '0',
>   'transient_lastDdlTime' = '1472424052',
>   'totalSize' = '0',
>   'prop1' = 'value1',
>   'prop2' = 'value2',
>   'COLUMN_STATS_ACCURATE' = 'false',
>   'numRows' = '-1'
> )
> {noformat}
> The statistics-related table properties should be skipped by {{SHOW CREATE 
> TABLE}}, since it could be incorrect in the newly created table. See the Hive 
> JIRA: https://issues.apache.org/jira/browse/HIVE-13792



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17284) Remove statistics-related table properties from SHOW CREATE TABLE

2016-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17284:


Assignee: Apache Spark

> Remove statistics-related table properties from SHOW CREATE TABLE
> -
>
> Key: SPARK-17284
> URL: https://issues.apache.org/jira/browse/SPARK-17284
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> {noformat}
> CREATE TABLE t1 (
>   c1 INT COMMENT 'bla',
>   c2 STRING
> )
> LOCATION '$dir'
> TBLPROPERTIES (
>   'prop1' = 'value1',
>   'prop2' = 'value2'
> )
> {noformat}
> The output of {{SHOW CREATE TABLE t1}} is 
> {noformat}
> CREATE EXTERNAL TABLE `t1`(`c1` int COMMENT 'bla', `c2` string)
> ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
> WITH SERDEPROPERTIES (
>   'serialization.format' = '1'
> )
> STORED AS
>   INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
>   OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
> LOCATION 
> 'file:/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-ee317538-0f8c-42d0-b08c-cf077d94fe75'
> TBLPROPERTIES (
>   'rawDataSize' = '-1',
>   'numFiles' = '0',
>   'transient_lastDdlTime' = '1472424052',
>   'totalSize' = '0',
>   'prop1' = 'value1',
>   'prop2' = 'value2',
>   'COLUMN_STATS_ACCURATE' = 'false',
>   'numRows' = '-1'
> )
> {noformat}
> The statistics-related table properties should be skipped by {{SHOW CREATE 
> TABLE}}, since it could be incorrect in the newly created table. See the Hive 
> JIRA: https://issues.apache.org/jira/browse/HIVE-13792



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17284) Remove statistics-related table properties from SHOW CREATE TABLE

2016-08-28 Thread Xiao Li (JIRA)
Xiao Li created SPARK-17284:
---

 Summary: Remove statistics-related table properties from SHOW 
CREATE TABLE
 Key: SPARK-17284
 URL: https://issues.apache.org/jira/browse/SPARK-17284
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Xiao Li


{noformat}
CREATE TABLE t1 (
  c1 INT COMMENT 'bla',
  c2 STRING
)
LOCATION '$dir'
TBLPROPERTIES (
  'prop1' = 'value1',
  'prop2' = 'value2'
)
{noformat}

The output of {{SHOW CREATE TABLE t1}} is 

{noformat}
CREATE EXTERNAL TABLE `t1`(`c1` int COMMENT 'bla', `c2` string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
  'serialization.format' = '1'
)
STORED AS
  INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
  OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 
'file:/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-ee317538-0f8c-42d0-b08c-cf077d94fe75'
TBLPROPERTIES (
  'rawDataSize' = '-1',
  'numFiles' = '0',
  'transient_lastDdlTime' = '1472424052',
  'totalSize' = '0',
  'prop1' = 'value1',
  'prop2' = 'value2',
  'COLUMN_STATS_ACCURATE' = 'false',
  'numRows' = '-1'
)
{noformat}

The statistics-related table properties should be skipped by {{SHOW CREATE 
TABLE}}, since it could be incorrect in the newly created table. See the Hive 
JIRA: https://issues.apache.org/jira/browse/HIVE-13792




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17283) Cancel job in RDD.take() as soon as enough output is receieved

2016-08-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15443975#comment-15443975
 ] 

Apache Spark commented on SPARK-17283:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/14854

> Cancel job in RDD.take() as soon as enough output is receieved
> --
>
> Key: SPARK-17283
> URL: https://issues.apache.org/jira/browse/SPARK-17283
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> The current implementation of RDD.take() waits until all partitions of each 
> job have been computed before checking whether enough rows have been 
> received. If take() were to perform this check on-the-fly as individual 
> partitions were completed then it could stop early, offering large speedups 
> for certain interactive queries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17283) Cancel job in RDD.take() as soon as enough output is receieved

2016-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17283:


Assignee: Apache Spark  (was: Josh Rosen)

> Cancel job in RDD.take() as soon as enough output is receieved
> --
>
> Key: SPARK-17283
> URL: https://issues.apache.org/jira/browse/SPARK-17283
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Josh Rosen
>Assignee: Apache Spark
>
> The current implementation of RDD.take() waits until all partitions of each 
> job have been computed before checking whether enough rows have been 
> received. If take() were to perform this check on-the-fly as individual 
> partitions were completed then it could stop early, offering large speedups 
> for certain interactive queries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17283) Cancel job in RDD.take() as soon as enough output is receieved

2016-08-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17283:


Assignee: Josh Rosen  (was: Apache Spark)

> Cancel job in RDD.take() as soon as enough output is receieved
> --
>
> Key: SPARK-17283
> URL: https://issues.apache.org/jira/browse/SPARK-17283
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> The current implementation of RDD.take() waits until all partitions of each 
> job have been computed before checking whether enough rows have been 
> received. If take() were to perform this check on-the-fly as individual 
> partitions were completed then it could stop early, offering large speedups 
> for certain interactive queries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17283) Cancel job in RDD.take() as soon as enough output is receieved

2016-08-28 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-17283:
--

 Summary: Cancel job in RDD.take() as soon as enough output is 
receieved
 Key: SPARK-17283
 URL: https://issues.apache.org/jira/browse/SPARK-17283
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Josh Rosen
Assignee: Josh Rosen


The current implementation of RDD.take() waits until all partitions of each job 
have been computed before checking whether enough rows have been received. If 
take() were to perform this check on-the-fly as individual partitions were 
completed then it could stop early, offering large speedups for certain 
interactive queries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17271) Planner adds un-necessary Sort even if child ordering is semantically same as required ordering

2016-08-28 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-17271.
---
   Resolution: Fixed
 Assignee: Tejas Patil
Fix Version/s: 2.1.0

> Planner adds un-necessary Sort even if child ordering is semantically same as 
> required ordering
> ---
>
> Key: SPARK-17271
> URL: https://issues.apache.org/jira/browse/SPARK-17271
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2, 2.0.0
>Reporter: Tejas Patil
>Assignee: Tejas Patil
> Fix For: 2.1.0
>
>
> Found a case when the planner is adding un-needed SORT operation due to bug 
> in the way comparison for `SortOrder` is done at 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala#L253
> `SortOrder` needs to be compared semantically because `Expression` within two 
> `SortOrder` can be "semantically equal" but not literally equal objects.
> eg. In case of `sql("SELECT * FROM table1 a JOIN table2 b ON a.col1=b.col1")`
> Expression in required SortOrder:
> {code}
>   AttributeReference(
> name = "col1",
> dataType = LongType,
> nullable = false
>   ) (exprId = exprId,
> qualifier = Some("a")
>   )
> {code}
> Expression in child SortOrder:
> {code}
>   AttributeReference(
> name = "col1",
> dataType = LongType,
> nullable = false
>   ) (exprId = exprId)
> {code}
> Notice that the output column has a qualifier but the child attribute does 
> not but the inherent expression is the same and hence in this case we can say 
> that the child satisfies the required sort order.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17041) Columns in schema are no longer case sensitive when reading csv file

2016-08-28 Thread koert kuipers (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15443779#comment-15443779
 ] 

koert kuipers commented on SPARK-17041:
---

note that the docs say:
"It is highly discouraged to turn on case sensitive mode."

which seems to suggest one shouldn't have files like this. that is not very 
realistic. i feel like this is a case where a bad idea (sql lack of case 
sensitivity) is being pushed upon us. 

if only we could ignore the bad sql ideas a bit more and be inspired by R 
data.frame or pandas.

> Columns in schema are no longer case sensitive when reading csv file
> 
>
> Key: SPARK-17041
> URL: https://issues.apache.org/jira/browse/SPARK-17041
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
>Reporter: Barry Becker
>
> It used to be (in spark 1.6.2) that I could read a csv file that had columns 
> with  names that differed only by case. For example, one column may be 
> "output" and another called "Output". Now (with spark 2.0.0) if I try to read 
> such a file, I get an error like this:
> {code}
> org.apache.spark.sql.AnalysisException: Reference 'Output' is ambiguous, 
> could be: Output#1263, Output#1295.;
> {code}
> The schema (dfSchema below) that I pass to the csv read looks like this:
> {code}
> StructType( StructField(Output,StringType,true), ... 
> StructField(output,StringType,true), ...)
> {code}
> The code that does the read is this
> {code}
> sqlContext.read
>   .format("csv")
>   .option("header", "false") // Use first line of all files as header
>   .option("inferSchema", "false") // Automatically infer data types
>   .schema(dfSchema)
>   .csv(dataFile)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17282) Implement ALTER TABLE UPDATE STATISTICS SET

2016-08-28 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-17282:

Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-16026

> Implement ALTER TABLE UPDATE STATISTICS SET
> ---
>
> Key: SPARK-17282
> URL: https://issues.apache.org/jira/browse/SPARK-17282
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>
> Users can change the statistics by the DDL statement:
> {noformat}
> ALTER TABLE UPDATE STATISTICS SET
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17282) Implement ALTER TABLE UPDATE STATISTICS SET

2016-08-28 Thread Xiao Li (JIRA)
Xiao Li created SPARK-17282:
---

 Summary: Implement ALTER TABLE UPDATE STATISTICS SET
 Key: SPARK-17282
 URL: https://issues.apache.org/jira/browse/SPARK-17282
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.1.0
Reporter: Xiao Li


Users can change the statistics by the DDL statement:
{noformat}
ALTER TABLE UPDATE STATISTICS SET
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12619) Combine small files in a hadoop directory into single split

2016-08-28 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-12619.
---
Resolution: Duplicate

> Combine small files in a hadoop directory into single split 
> 
>
> Key: SPARK-12619
> URL: https://issues.apache.org/jira/browse/SPARK-12619
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Navis
>Priority: Trivial
>
> When a directory contains too many (small) files, whole spark cluster will be 
> exhausted scheduling tasks created for each file. Custom input format can 
> handle that but if you're using hive metastore, it could hardly be an option.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17214) How to deal with dots (.) present in column names in SparkR

2016-08-28 Thread Mohit Bansal (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mohit Bansal resolved SPARK-17214.
--
Resolution: Later

> How to deal with dots (.) present in column names in SparkR
> ---
>
> Key: SPARK-17214
> URL: https://issues.apache.org/jira/browse/SPARK-17214
> Project: Spark
>  Issue Type: Bug
>Reporter: Mohit Bansal
>
> I am trying to load a local csv file into SparkR, which contains dots in 
> column names. After reading the file I tried to change the names and replaced 
> "." with "_". Still I am not able to do any operation on the created SDF. 
> Here is the reproducible code:
> ---
> #writing iris dataset to local
> write.csv(iris,"iris.csv",row.names=F)
> #reading it back using read.df
> iris_sdf<-read.df("iris.csv","csv",header="true",inferSchema="true")
> #changing column names
> names(iris_sdf)<-c("Sepal_Length","Sepal_Width","Petal_Length","Petal_Width","Species")
> #selecting required columna
> head(select(iris_sdf,iris_sdf$Sepal_Length,iris_sdf$Sepal_Width))
> -
> 16/08/24 13:51:24 ERROR RBackendHandler: dfToCols on 
> org.apache.spark.sql.api.r.SQLUtils failed
> Error in invokeJava(isStatic = TRUE, className, methodName, ...) : 
>   org.apache.spark.sql.AnalysisException: Unable to resolve Sepal.Length 
> given [Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species];
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134)
> at scala.Option.getOrElse(Option.scala:121)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:133)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:129)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at scala.collection.IterableLike$cl
> What should I do to get it work?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16647) sparksql1.6.2 on yarn with hive metastore1.0.0 thows "alter_table_with_cascade" exception

2016-08-28 Thread fengchaoge (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15443140#comment-15443140
 ] 

fengchaoge commented on SPARK-16647:


Do you have resolve this problem? I have the same problem. my hive's version is 
0.13.1 

> sparksql1.6.2 on yarn with hive metastore1.0.0 thows 
> "alter_table_with_cascade" exception
> -
>
> Key: SPARK-16647
> URL: https://issues.apache.org/jira/browse/SPARK-16647
> Project: Spark
>  Issue Type: Bug
>Reporter: zhangshuxin
>
> my spark version is 1.6.2(1.5.2,1.5.0) and hive version is 1.0.0
> when i execute some sql like 'create table tbl1 as select * from tbl2' or 
> 'insert overwrite table tabl1 select * from tbl2',i get the following 
> exception
> 16/07/20 10:14:13 WARN metastore.RetryingMetaStoreClient: MetaStoreClient 
> lost connection. Attempting to reconnect.
> org.apache.thrift.TApplicationException: Invalid method name: 
> 'alter_table_with_cascade'
> at 
> org.apache.thrift.TApplicationException.read(TApplicationException.java:111)
> at 
> org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:71)
> at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_alter_table_with_cascade(ThriftHiveMetastore.java:1374)
> at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.alter_table_with_cascade(ThriftHiveMetastore.java:1358)
> at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.alter_table(HiveMetaStoreClient.java:340)
> at 
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.alter_table(SessionHiveMetaStoreClient.java:251)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:156)
> at com.sun.proxy.$Proxy27.alter_table(Unknown Source)
> at org.apache.hadoop.hive.ql.metadata.Hive.alterTable(Hive.java:496)
> at org.apache.hadoop.hive.ql.metadata.Hive.alterTable(Hive.java:484)
> at org.apache.hadoop.hive.ql.metadata.Hive.loadTable(Hive.java:1668)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.spark.sql.hive.client.Shim_v0_14.loadTable(HiveShim.scala:441)
> at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$loadTable$1.apply$mcV$sp(ClientWrapper.scala:489)
> at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$loadTable$1.apply(ClientWrapper.scala:489)
> at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$loadTable$1.apply(ClientWrapper.scala:489)
> at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:256)
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:211)
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:248)
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.loadTable(ClientWrapper.scala:488)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:243)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:127)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:263)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
> at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138)
> at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:933)
> at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:933)
> at 
> org.apache.spark.sql.hive.execution.CreateTableAsSelect.run(CreateTableAsSelect.scala:89)
> at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57)
> at 
> 

[jira] [Closed] (SPARK-15044) spark-sql will throw "input path does not exist" exception if it handles a partition which exists in hive table, but the path is removed manually

2016-08-28 Thread huangyu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huangyu closed SPARK-15044.
---

> spark-sql will throw "input path does not exist" exception if it handles a 
> partition which exists in hive table, but the path is removed manually
> -
>
> Key: SPARK-15044
> URL: https://issues.apache.org/jira/browse/SPARK-15044
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 2.0.0
>Reporter: huangyu
>
> spark-sql will throw "input path not exist" exception if it handles a 
> partition which exists in hive table, but the path is removed manually.The 
> situation is as follows:
> 1) Create a table "test". "create table test (n string) partitioned by (p 
> string)"
> 2) Load some data into partition(p='1')
> 3)Remove the path related to partition(p='1') of table test manually. "hadoop 
> fs -rmr /warehouse//test/p=1"
> 4)Run spark sql, spark-sql -e "select n from test where p='1';"
> Then it throws exception:
> {code}
> org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: 
> ./test/p=1
> at 
> org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285)
> at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
> at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304)
> at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
> at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> {code}
> The bug is in spark 1.6.1, if I use spark 1.4.0, It is OK
> I think spark-sql should ignore the path, just like hive or it dose in early 
> versions, rather than throw an exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17214) How to deal with dots (.) present in column names in SparkR

2016-08-28 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15442870#comment-15442870
 ] 

Felix Cheung edited comment on SPARK-17214 at 8/28/16 6:14 AM:
---

I think the underlining issue is that we should either handle column names with 
`.` correctly (preferred) or translate them uniformly as in other cases (eg. 
`as.DataFrame`)

As of now a DataFrame from csv source can have `.` in column names and it is 
unoperable until renamed:
{code}
> iris_sdf<-read.df("iris.csv","csv",header="true",inferSchema="true")
> iris_sdf
SparkDataFrame[Sepal.Length:double, Sepal.Width:double, Petal.Length:double, 
Petal.Width:double, Species:string]
> head(select(iris_sdf,iris_sdf$Sepal.Length))
16/08/28 06:11:16 ERROR RBackendHandler: col on 46 failed
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
  org.apache.spark.sql.AnalysisException: Cannot resolve column name 
"Sepal.Length" among (Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, 
Species);
{code}


was (Author: felixcheung):
I think the underlining issue is that we should either handle column names with 
`.` correctly (preferred) or translate them uniformly as in other cases (eg. 
`as.DataFrame`)

As of now a DataFrame from csv source can have `.` in column names and it is 
unoperable until renamed:
{code}
> iris_sdf<-read.df("iris.csv","csv",header="true",inferSchema="true")
> iris_sdf
SparkDataFrame[Sepal.Length:double, Sepal.Width:double, Petal.Length:double, 
Petal.Width:double, Species:string]
{code}

> How to deal with dots (.) present in column names in SparkR
> ---
>
> Key: SPARK-17214
> URL: https://issues.apache.org/jira/browse/SPARK-17214
> Project: Spark
>  Issue Type: Bug
>Reporter: Mohit Bansal
>
> I am trying to load a local csv file into SparkR, which contains dots in 
> column names. After reading the file I tried to change the names and replaced 
> "." with "_". Still I am not able to do any operation on the created SDF. 
> Here is the reproducible code:
> ---
> #writing iris dataset to local
> write.csv(iris,"iris.csv",row.names=F)
> #reading it back using read.df
> iris_sdf<-read.df("iris.csv","csv",header="true",inferSchema="true")
> #changing column names
> names(iris_sdf)<-c("Sepal_Length","Sepal_Width","Petal_Length","Petal_Width","Species")
> #selecting required columna
> head(select(iris_sdf,iris_sdf$Sepal_Length,iris_sdf$Sepal_Width))
> -
> 16/08/24 13:51:24 ERROR RBackendHandler: dfToCols on 
> org.apache.spark.sql.api.r.SQLUtils failed
> Error in invokeJava(isStatic = TRUE, className, methodName, ...) : 
>   org.apache.spark.sql.AnalysisException: Unable to resolve Sepal.Length 
> given [Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species];
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134)
> at scala.Option.getOrElse(Option.scala:121)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:133)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:129)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at scala.collection.IterableLike$cl
> What should I do to get it work?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17214) How to deal with dots (.) present in column names in SparkR

2016-08-28 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15442870#comment-15442870
 ] 

Felix Cheung edited comment on SPARK-17214 at 8/28/16 6:15 AM:
---

I think the underlining issue is that we should either handle column names with 
`.` correctly (preferred) or translate them uniformly as in other cases (eg. 
`as.DataFrame`)

As of now a DataFrame from csv source can have `.` in column names and it is 
unoperable until renamed (which is a known issue):
{code}
> iris_sdf<-read.df("iris.csv","csv",header="true",inferSchema="true")
> iris_sdf
SparkDataFrame[Sepal.Length:double, Sepal.Width:double, Petal.Length:double, 
Petal.Width:double, Species:string]
> head(select(iris_sdf,iris_sdf$Sepal.Length))
16/08/28 06:11:16 ERROR RBackendHandler: col on 46 failed
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
  org.apache.spark.sql.AnalysisException: Cannot resolve column name 
"Sepal.Length" among (Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, 
Species);
{code}


was (Author: felixcheung):
I think the underlining issue is that we should either handle column names with 
`.` correctly (preferred) or translate them uniformly as in other cases (eg. 
`as.DataFrame`)

As of now a DataFrame from csv source can have `.` in column names and it is 
unoperable until renamed:
{code}
> iris_sdf<-read.df("iris.csv","csv",header="true",inferSchema="true")
> iris_sdf
SparkDataFrame[Sepal.Length:double, Sepal.Width:double, Petal.Length:double, 
Petal.Width:double, Species:string]
> head(select(iris_sdf,iris_sdf$Sepal.Length))
16/08/28 06:11:16 ERROR RBackendHandler: col on 46 failed
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
  org.apache.spark.sql.AnalysisException: Cannot resolve column name 
"Sepal.Length" among (Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, 
Species);
{code}

> How to deal with dots (.) present in column names in SparkR
> ---
>
> Key: SPARK-17214
> URL: https://issues.apache.org/jira/browse/SPARK-17214
> Project: Spark
>  Issue Type: Bug
>Reporter: Mohit Bansal
>
> I am trying to load a local csv file into SparkR, which contains dots in 
> column names. After reading the file I tried to change the names and replaced 
> "." with "_". Still I am not able to do any operation on the created SDF. 
> Here is the reproducible code:
> ---
> #writing iris dataset to local
> write.csv(iris,"iris.csv",row.names=F)
> #reading it back using read.df
> iris_sdf<-read.df("iris.csv","csv",header="true",inferSchema="true")
> #changing column names
> names(iris_sdf)<-c("Sepal_Length","Sepal_Width","Petal_Length","Petal_Width","Species")
> #selecting required columna
> head(select(iris_sdf,iris_sdf$Sepal_Length,iris_sdf$Sepal_Width))
> -
> 16/08/24 13:51:24 ERROR RBackendHandler: dfToCols on 
> org.apache.spark.sql.api.r.SQLUtils failed
> Error in invokeJava(isStatic = TRUE, className, methodName, ...) : 
>   org.apache.spark.sql.AnalysisException: Unable to resolve Sepal.Length 
> given [Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species];
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134)
> at scala.Option.getOrElse(Option.scala:121)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:133)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:129)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at scala.collection.IterableLike$cl
> What should I do to get it work?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17214) How to deal with dots (.) present in column names in SparkR

2016-08-28 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15442870#comment-15442870
 ] 

Felix Cheung commented on SPARK-17214:
--

I think the underlining issue is that we should either handle column names with 
`.` correctly (preferred) or translate them uniformly as in other cases (eg. 
`as.DataFrame`)

As of now a DataFrame from csv source can have `.` in column names and it is 
unoperable until renamed:
{code}
> iris_sdf<-read.df("iris.csv","csv",header="true",inferSchema="true")
> iris_sdf
SparkDataFrame[Sepal.Length:double, Sepal.Width:double, Petal.Length:double, 
Petal.Width:double, Species:string]
{code}

> How to deal with dots (.) present in column names in SparkR
> ---
>
> Key: SPARK-17214
> URL: https://issues.apache.org/jira/browse/SPARK-17214
> Project: Spark
>  Issue Type: Bug
>Reporter: Mohit Bansal
>
> I am trying to load a local csv file into SparkR, which contains dots in 
> column names. After reading the file I tried to change the names and replaced 
> "." with "_". Still I am not able to do any operation on the created SDF. 
> Here is the reproducible code:
> ---
> #writing iris dataset to local
> write.csv(iris,"iris.csv",row.names=F)
> #reading it back using read.df
> iris_sdf<-read.df("iris.csv","csv",header="true",inferSchema="true")
> #changing column names
> names(iris_sdf)<-c("Sepal_Length","Sepal_Width","Petal_Length","Petal_Width","Species")
> #selecting required columna
> head(select(iris_sdf,iris_sdf$Sepal_Length,iris_sdf$Sepal_Width))
> -
> 16/08/24 13:51:24 ERROR RBackendHandler: dfToCols on 
> org.apache.spark.sql.api.r.SQLUtils failed
> Error in invokeJava(isStatic = TRUE, className, methodName, ...) : 
>   org.apache.spark.sql.AnalysisException: Unable to resolve Sepal.Length 
> given [Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species];
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134)
> at scala.Option.getOrElse(Option.scala:121)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:133)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:129)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at scala.collection.IterableLike$cl
> What should I do to get it work?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17214) How to deal with dots (.) present in column names in SparkR

2016-08-28 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15442865#comment-15442865
 ] 

Felix Cheung commented on SPARK-17214:
--

[~bansalism] what version of Spark + SparkR are you testing with?
I ran your example and it worked

{code}
> #writing iris dataset to local
> write.csv(iris,"iris.csv",row.names=F)
> #reading it back using read.df
> iris_sdf<-read.df("iris.csv","csv",header="true",inferSchema="true")
> #changing column names
> names(iris_sdf)<-c("Sepal_Length","Sepal_Width","Petal_Length","Petal_Width","Species")
> iris_sdf
SparkDataFrame[Sepal_Length:double, Sepal_Width:double, Petal_Length:double, 
Petal_Width:double, Species:string]
> head(iris_sdf)
  Sepal_Length Sepal_Width Petal_Length Petal_Width Species
1  5.1 3.5  1.4 0.2  setosa
2  4.9 3.0  1.4 0.2  setosa
3  4.7 3.2  1.3 0.2  setosa
4  4.6 3.1  1.5 0.2  setosa
5  5.0 3.6  1.4 0.2  setosa
6  5.4 3.9  1.7 0.4  setosa
> a <- select(iris_sdf,iris_sdf$Sepal_Length,iris_sdf$Sepal_Width)
> head(a)
  Sepal_Length Sepal_Width
1  5.1 3.5
2  4.9 3.0
3  4.7 3.2
4  4.6 3.1
5  5.0 3.6
6  5.4 3.9
>
{code}

> How to deal with dots (.) present in column names in SparkR
> ---
>
> Key: SPARK-17214
> URL: https://issues.apache.org/jira/browse/SPARK-17214
> Project: Spark
>  Issue Type: Bug
>Reporter: Mohit Bansal
>
> I am trying to load a local csv file into SparkR, which contains dots in 
> column names. After reading the file I tried to change the names and replaced 
> "." with "_". Still I am not able to do any operation on the created SDF. 
> Here is the reproducible code:
> ---
> #writing iris dataset to local
> write.csv(iris,"iris.csv",row.names=F)
> #reading it back using read.df
> iris_sdf<-read.df("iris.csv","csv",header="true",inferSchema="true")
> #changing column names
> names(iris_sdf)<-c("Sepal_Length","Sepal_Width","Petal_Length","Petal_Width","Species")
> #selecting required columna
> head(select(iris_sdf,iris_sdf$Sepal_Length,iris_sdf$Sepal_Width))
> -
> 16/08/24 13:51:24 ERROR RBackendHandler: dfToCols on 
> org.apache.spark.sql.api.r.SQLUtils failed
> Error in invokeJava(isStatic = TRUE, className, methodName, ...) : 
>   org.apache.spark.sql.AnalysisException: Unable to resolve Sepal.Length 
> given [Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species];
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134)
> at scala.Option.getOrElse(Option.scala:121)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:133)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:129)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at scala.collection.IterableLike$cl
> What should I do to get it work?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org