[jira] [Commented] (SPARK-12981) Dataframe distinct() followed by a filter(udf) in pyspark throws a casting error

2016-03-19 Thread Xiu (Joe) Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15198280#comment-15198280
 ] 

Xiu (Joe) Guo commented on SPARK-12981:
---

Yes [~fabboe], my PR will fix your scenario too.

> Dataframe distinct() followed by a filter(udf) in pyspark throws a casting 
> error
> 
>
> Key: SPARK-12981
> URL: https://issues.apache.org/jira/browse/SPARK-12981
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.6.0
> Environment: Running on Mac OSX (El Capitan) with Spark 1.6 (Java 1.8)
>Reporter: Tom Arnfeld
>Priority: Critical
>
> We noticed a regression when testing out an upgrade of Spark 1.6 for our 
> systems, where pyspark throws a casting exception when using `filter(udf)` 
> after a `distinct` operation on a DataFrame. This does not occur on Spark 1.5.
> Here's a little notebook that demonstrates the exception clearly... 
> https://gist.github.com/tarnfeld/ab9b298ae67f697894cd
> Though for the sake of here... the following code will throw an exception...
> {code}
> data.select(col("a")).distinct().filter(my_filter(col("a"))).count()
> {code}
> {code}
> java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.plans.logical.Project cannot be cast to 
> org.apache.spark.sql.catalyst.plans.logical.Aggregate
> {code}
> Whereas not using a UDF does not throw any errors...
> {code}
> data.select(col("a")).distinct().filter("a = 1").count()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13366) Support Cartesian join for Datasets

2016-02-17 Thread Xiu (Joe) Guo (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiu (Joe) Guo updated SPARK-13366:
--
Description: 
Saw a comment from [~marmbrus] regarding Cartesian join for Datasets:

"You will get a cartesian if you do a join/joinWith using lit(true) as the 
condition.  We could consider adding an API for doing that more concisely."

  was:
Saw a comment from [~marmbrus] about this:

"You will get a cartesian if you do a join/joinWith using lit(true) as the 
condition.  We could consider adding an API for doing that more concisely."


> Support Cartesian join for Datasets
> ---
>
> Key: SPARK-13366
> URL: https://issues.apache.org/jira/browse/SPARK-13366
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiu (Joe) Guo
>Priority: Minor
>
> Saw a comment from [~marmbrus] regarding Cartesian join for Datasets:
> "You will get a cartesian if you do a join/joinWith using lit(true) as the 
> condition.  We could consider adding an API for doing that more concisely."



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13366) Support Cartesian join for Datasets

2016-02-17 Thread Xiu (Joe) Guo (JIRA)
Xiu (Joe) Guo created SPARK-13366:
-

 Summary: Support Cartesian join for Datasets
 Key: SPARK-13366
 URL: https://issues.apache.org/jira/browse/SPARK-13366
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.0.0
Reporter: Xiu (Joe) Guo
Priority: Minor


Saw a comment from Michael about this:

"You will get a cartesian if you do a join/joinWith using lit(true) as the 
condition.  We could consider adding an API for doing that more concisely."



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13283) Spark doesn't escape column names when creating table on JDBC

2016-02-16 Thread Xiu (Joe) Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15149387#comment-15149387
 ] 

Xiu (Joe) Guo commented on SPARK-13283:
---

Yes, it is a different problem from 
[SPARK-13297|https://issues.apache.org/jira/browse/SPARK-13297]. We should 
escape the column name based on JdbcDialect.

> Spark doesn't escape column names when creating table on JDBC
> -
>
> Key: SPARK-13283
> URL: https://issues.apache.org/jira/browse/SPARK-13283
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Maciej Bryński
>
> Hi,
> I have following problem.
> I have DF where one of the columns has 'from' name.
> {code}
> root
>  |-- from: decimal(20,0) (nullable = true)
> {code}
> When I'm saving it to MySQL database I'm getting error:
> {code}
> Py4JJavaError: An error occurred while calling o183.jdbc.
> : com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an 
> error in your SQL syntax; check the manual that corresponds to your MySQL 
> server version for the right syntax to use near 'from DECIMAL(20,0) , ' at 
> line 1
> {code}
> I think the problem is that Spark doesn't escape column names with ` sign on 
> creating table.
> {code}
> `from`
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13301) PySpark Dataframe return wrong results with custom UDF

2016-02-12 Thread Xiu (Joe) Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15145173#comment-15145173
 ] 

Xiu (Joe) Guo commented on SPARK-13301:
---

Hi Simone:

How long is the string length for each row in col1? Can you do a:

myDF.select("col1", "col2").withColumn("col3", myFunc(myDF["col1"])).show(3, 
False)

> PySpark Dataframe return wrong results with custom UDF
> --
>
> Key: SPARK-13301
> URL: https://issues.apache.org/jira/browse/SPARK-13301
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
> Environment: PySpark in yarn-client mode - CDH 5.5.1
>Reporter: Simone
>Priority: Critical
>
> Using a User Defined Function in PySpark inside the withColumn() method of 
> Dataframe, gives wrong results.
> Here an example:
> from pyspark.sql import functions
> import string
> myFunc = functions.udf(lambda s: string.lower(s))
> myDF.select("col1", "col2").withColumn("col3", myFunc(myDF["col1"])).show()
> |col1|   col2|col3|
> |1265AB4F65C05740E...|Ivo|4f00ae514e7c015be...|
> |1D94AB4F75C83B51E...|   Raffaele|4f00dcf6422100c0e...|
> |4F008903600A0133E...|   Cristina|4f008903600a0133e...|
> The results are wrong and seem to be random: some record are OK (for example 
> the third) some others NO (for example the first 2).
> The problem seems not occur with Spark built-in functions:
> from pyspark.sql.functions import *
> myDF.select("col1", "col2").withColumn("col3", lower(myDF["col1"])).show()
> Without the withColumn() method, results seems to be always correct:
> myDF.select("col1", "col2", myFunc(myDF["col1"])).show()
> This can be considered only in part a workaround because you have to list 
> each time all column of your Dataframe.
> Also in Scala/Java the problems seems not occur.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13297) [SQL] Backticks cannot be escaped in column names

2016-02-12 Thread Xiu (Joe) Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15145566#comment-15145566
 ] 

Xiu (Joe) Guo commented on SPARK-13297:
---

Looks like in the current [master 
branch|https://github.com/apache/spark/tree/42d656814f756599a2bc426f0e1f32bd4cc4470f],
 this problem is fixed.

{code}
scala> val columnName = "col`s"
columnName: String = col`s

scala> val rows = List(Row("foo"), Row("bar"))
rows: List[org.apache.spark.sql.Row] = List([foo], [bar])

scala> val schema = StructType(Seq(StructField(columnName, StringType)))
schema: org.apache.spark.sql.types.StructType = 
StructType(StructField(col`s,StringType,true))

scala> val rdd = sc.parallelize(rows)
rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
ParallelCollectionRDD[0] at parallelize at :28

scala> val df = sqlContext.createDataFrame(rdd, schema)
df: org.apache.spark.sql.DataFrame = [col`s: string]

scala> val selectingColumnName = "`" + columnName.replace("`", "``") + "`"
selectingColumnName: String = `col``s`

scala> selectingColumnName
res0: String = `col``s`

scala> val selectedDf = df.selectExpr(selectingColumnName)
selectedDf: org.apache.spark.sql.DataFrame = [col`s: string]

scala> selectedDf.show
+-+
|col`s|
+-+
|  foo|
|  bar|
+-+
{code}

> [SQL] Backticks cannot be escaped in column names
> -
>
> Key: SPARK-13297
> URL: https://issues.apache.org/jira/browse/SPARK-13297
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Grzegorz Chilkiewicz
>Priority: Minor
>
> We want to use backticks to escape spaces & minus signs in column names.
> Are we unable to escape backticks when a column name is surrounded by 
> backticks?
> It is not documented in: 
> http://spark.apache.org/docs/latest/sql-programming-guide.html
> In MySQL there is a way: double the backticks, but this trick doesn't work in 
> Spark-SQL.
> Am I correct or just missing something? Is there a way to escape backticks 
> inside a column name when it is surrounded by backticks?
> Code to reproduce the problem:
> https://github.com/grzegorz-chilkiewicz/SparkSqlEscapeBacktick



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9414) HiveContext:saveAsTable creates wrong partition for existing hive table(append mode)

2016-02-03 Thread Xiu (Joe) Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15131054#comment-15131054
 ] 

Xiu (Joe) Guo commented on SPARK-9414:
--

With the current master 
[b938301|https://github.com/apache/spark/commit/b93830126cc59a26e2cfb5d7b3c17f9cfbf85988],
 I could not reproduce this issue by doing:

>From Hive 1.2.1 CLI:
{code}
create table test4DimBySpark (mydate int, hh int, x int, y int, height float, u 
float, v float, w float, ph float, phb float, p float, pb float, qva float, por 
float, qgraup float, qnice float, qnrain float, tke_pbl float, el_pbl float) 
partitioned by (zone int, z int, year int, month int);
{code}

In Spark-shell, use the first block of scala code from description to insert 
data.

I see correct partition directories in /user/hive/warehouse and Hive can read 
the data back fine.

Can you check with the newer versions of the code? It's probably fixed.

> HiveContext:saveAsTable creates wrong partition for existing hive 
> table(append mode)
> 
>
> Key: SPARK-9414
> URL: https://issues.apache.org/jira/browse/SPARK-9414
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
> Environment: Hadoop 2.6, Spark 1.4.0, Hive 0.14.0.
>Reporter: Chetan Dalal
>Priority: Critical
>
> Raising this bug because I found this issue was ready reported on Apache mail 
> archive and I am facing a similar issue.
> ---original--
> I am using spark 1.4 and HiveContext to append data into a partitioned
> hive table. I found that the data insert into the table is correct, but the
> partition(folder) created is totally wrong.
> {code}
>  val schemaString = "zone z year month date hh x y height u v w ph phb 
> p pb qvapor qgraup qnice qnrain tke_pbl el_pbl"
> val schema =
>   StructType(
> schemaString.split(" ").map(fieldName =>
>   if (fieldName.equals("zone") || fieldName.equals("z") ||
> fieldName.equals("year") || fieldName.equals("month") ||
>   fieldName.equals("date") || fieldName.equals("hh") ||
> fieldName.equals("x") || fieldName.equals("y"))
> StructField(fieldName, IntegerType, true)
>   else
> StructField(fieldName, FloatType, true)
> ))
> val pairVarRDD =
> sc.parallelize(Seq((Row(2,42,2009,3,1,0,218,365,9989.497.floatValue(),29.627113.floatValue(),19.071793.floatValue(),0.11982734.floatValue(),3174.6812.floatValue(),
> 97735.2.floatValue(),16.389032.floatValue(),-96.62891.floatValue(),25135.365.floatValue(),2.6476808E-5.floatValue(),0.0.floatValue(),13195.351.floatValue(),
> 0.0.floatValue(),0.1.floatValue(),0.0.floatValue()))
> ))
> val partitionedTestDF2 = sqlContext.createDataFrame(pairVarRDD, schema)
> partitionedTestDF2.write.format("org.apache.spark.sql.hive.orc.DefaultSource")
> .mode(org.apache.spark.sql.SaveMode.Append).partitionBy("zone","z","year","month").saveAsTable("test4DimBySpark")
> {code}
> -
> The table contains 23 columns (longer than Tuple maximum length), so I
> use Row Object to store raw data, not Tuple.
> Here is some message from spark when it saved data>>
> {code}
> 
> 15/06/16 10:39:22 INFO metadata.Hive: Renaming
> src:hdfs://service-10-0.local:8020/tmp/hive-patcharee/hive_2015-06-16_10-39-21_205_8768669104487548472-1/-ext-1/zone=13195/z=0/year=0/month=0/part-1;dest:
> hdfs://service-10-0.local:8020/apps/hive/warehouse/test4dimBySpark/zone=13195/z=0/year=0/month=0/part-1;Status:true
> 
> 15/06/16 10:39:22 INFO metadata.Hive: New loading path =
> hdfs://service-10-0.local:8020/tmp/hive-patcharee/hive_2015-06-16_10-39-21_205_8768669104487548472-1/-ext-1/zone=13195/z=0/year=0/month=0
> with partSpec {zone=13195, z=0, year=0, month=0}
> 
> From the raw data (pairVarRDD) zone = 2, z = 42, year = 2009, month =
> 3. But spark created a partition {zone=13195, z=0, year=0, month=0}. (x)
> 
> When I queried from hive>>
> 
> hive> select * from test4dimBySpark;
> OK
> 242200931.00.0218.0365.09989.497
> 29.62711319.0717930.11982734-3174.681297735.2 16.389032
> -96.6289125135.3652.6476808E-50.0 13195000
> hive> select zone, z, year, month from test4dimBySpark;
> OK
> 13195000
> hive> dfs -ls /apps/hive/warehouse/test4dimBySpark/*/*/*/*;
> Found 2 items
> -rw-r--r--   3 patcharee hdfs   1411 2015-06-16 10:39
> /apps/hive/warehouse/test4dimBySpark/zone=13195/z=0/year=0/month=0/part-1
> 
> The data stored in the table is correct zone = 2, z = 42, year = 2009,
> month = 3, but the partition created was wrong
> "zone=13195/z=0/year=0/month=0" 

[jira] [Commented] (SPARK-12262) describe extended doesn't return table on detail info tabled stored as PARQUET format

2016-01-14 Thread Xiu (Joe) Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15099007#comment-15099007
 ] 

Xiu (Joe) Guo commented on SPARK-12262:
---

You might want to check out this JIRA:

https://issues.apache.org/jira/browse/SPARK-6413

> describe extended doesn't return table on detail info tabled stored as 
> PARQUET format
> -
>
> Key: SPARK-12262
> URL: https://issues.apache.org/jira/browse/SPARK-12262
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: pin_zhang
>
> 1. start hive server with start-thriftserver.sh
> 2. create table table1 (id  int) ;
> create table table2(id  int) STORED AS PARQUET;
> 3. describe extended table1 ;
> return detailed info
> 4. describe extended table2 ;
> result has no detailed info



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12521) DataFrame Partitions in java does not work

2015-12-25 Thread Xiu (Joe) Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15071703#comment-15071703
 ] 

Xiu (Joe) Guo commented on SPARK-12521:
---

Thanks [~hvanhovell] to clarifying this up.

Maybe it is a good idea to make the doc clearer here to explicitly mention the 
bound are not supposed to be filters?

[https://spark.apache.org/docs/1.5.2/api/java/org/apache/spark/sql/DataFrameReader.html#jdbc(java.lang.String,%20java.lang.String,%20java.lang.String,%20long,%20long,%20int,%20java.util.Properties)]

> DataFrame Partitions in java does not work
> --
>
> Key: SPARK-12521
> URL: https://issues.apache.org/jira/browse/SPARK-12521
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, SQL
>Affects Versions: 1.5.2
>Reporter: Sergey Podolsky
>
> Hello,
> Partition does not work in Java interface of the DataFrame:
> {code}
> SQLContext sqlContext = new SQLContext(sc);
> Map options = new HashMap<>();
> options.put("driver", ORACLE_DRIVER);
> options.put("url", ORACLE_CONNECTION_URL);
> options.put("dbtable",
> "(SELECT * FROM JOBS WHERE ROWNUM < 1) tt");
> options.put("lowerBound", "2704225000");
> options.put("upperBound", "2704226000");
> options.put("partitionColumn", "ID");
> options.put("numPartitions", "10");
> DataFrame jdbcDF = sqlContext.load("jdbc", options);
> List jobsRows = jdbcDF.collectAsList();
> System.out.println(jobsRows.size());
> {code}
> gives  while expected 1000. Is it because of big decimal of boundaries or 
> partitioins does not work at all in Java?
> Thanks.
> Sergey



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12521) DataFrame Partitions in java does not work

2015-12-24 Thread Xiu (Joe) Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15071347#comment-15071347
 ] 

Xiu (Joe) Guo commented on SPARK-12521:
---

In 1.5.2 {code}sqlContext.load(){code} is deprecated, but I can still reproduce 
with:{code}sqlContext.read.jdbc(){code}
I don't think it is the size of your numbers. I can reproduce with small 
integers given as lowerBound/upperBound with my setup. Can you maybe try adding 
"L" at the end of your number to verify that it still gives wrong results?

I think the problem is the lowerBound and upperBound are not honored here, 
Spark just retrieves every row instead of 1001 rows bounded in your case.

> DataFrame Partitions in java does not work
> --
>
> Key: SPARK-12521
> URL: https://issues.apache.org/jira/browse/SPARK-12521
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 1.5.2
>Reporter: Sergey Podolsky
>
> Hello,
> Partition does not work in Java interface of the DataFrame:
> {code}
> SQLContext sqlContext = new SQLContext(sc);
> Map options = new HashMap<>();
> options.put("driver", ORACLE_DRIVER);
> options.put("url", ORACLE_CONNECTION_URL);
> options.put("dbtable",
> "(SELECT * FROM JOBS WHERE ROWNUM < 1) tt");
> options.put("lowerBound", "2704225000");
> options.put("upperBound", "2704226000");
> options.put("partitionColumn", "ID");
> options.put("numPartitions", "10");
> DataFrame jdbcDF = sqlContext.load("jdbc", options);
> List jobsRows = jdbcDF.collectAsList();
> System.out.println(jobsRows.size());
> {code}
> gives  while expected 1000. Is it because of big decimal of boundaries or 
> partitioins does not work at all in Java?
> Thanks.
> Sergey



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12262) describe extended doesn't return table on detail info tabled stored as PARQUET format

2015-12-23 Thread Xiu (Joe) Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15070503#comment-15070503
 ] 

Xiu (Joe) Guo commented on SPARK-12262:
---

The property {code}spark.sql.hive.convertMetastoreParquet{code} promotes 
parquet tables with built-in access instead of going through Hive metastore 
route. I am thinking maybe the `describe extended` behavior here is intended?

A workaround (or intended usage) would be:
{code}
set spark.sql.hive.convertMetastoreParquet=false;
{code}
before describing a parquet table.

> describe extended doesn't return table on detail info tabled stored as 
> PARQUET format
> -
>
> Key: SPARK-12262
> URL: https://issues.apache.org/jira/browse/SPARK-12262
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: pin_zhang
>
> 1. start hive server with start-thriftserver.sh
> 2. create table table1 (id  int) ;
> create table table2(id  int) STORED AS PARQUET;
> 3. describe extended table1 ;
> return detailed info
> 4. describe extended table2 ;
> result has no detailed info



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12030) Incorrect results when aggregate joined data

2015-11-28 Thread Xiu(Joe) Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15030615#comment-15030615
 ] 

Xiu(Joe) Guo commented on SPARK-12030:
--

I tried your scenario with some TPCDS table last night, joined on integer 
columns, but could not reproduce incorrect results.

Does your table have very large integer values which might overflow?

> Incorrect results when aggregate joined data
> 
>
> Key: SPARK-12030
> URL: https://issues.apache.org/jira/browse/SPARK-12030
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Maciej Bryński
>Priority: Critical
>
> I have following issue.
> I created 2 dataframes from JDBC (MySQL) and joined them (t1 has fk1 to t2)
> {code}
> t1 = sqlCtx.read.jdbc("jdbc:mysql://XXX", t1, id1, 0, size1, 200).cache()
> t2 = sqlCtx.read.jdbc("jdbc:mysql://XXX", t2, id2, 0, size1, 200).cache()
> joined = t1.join(t2, t1.fk1 == t2.id2, "left_outer")
> {code}
> Important: both table are cached, so results should be the same on every 
> query.
> Then I did come counts:
> {code}
> t1.count() -> 5900729
> t1.registerTempTable("t1")
> sqlCtx.sql("select distinct(id1) from t1").count() -> 5900729
> t2.count() -> 54298
> joined.count() -> 5900729
> {code}
> And here magic begins - I counted distinct id1 from joined table
> {code}
> joined.registerTempTable("joined")
> sqlCtx.sql("select distinct(id1) from joined").count()
> {code}
> Results varies *(are different on every run)* between 5899000 and 
> 590 but never are equal to 5900729.
> In addition. I did more queries:
> {code}
> sqlCtx.sql("select id1, count(*) from joined group by id1 having count(*) > 
> 1").collect() 
> {code}
> This gives some results but this query return *1*
> {code}
> len(sqlCtx.sql("select * from joined where id1 = result").collect())
> {code}
> What's wrong ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9701) allow not automatically using HiveContext with spark-shell when hive support built in

2015-11-28 Thread Xiu(Joe) Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15030645#comment-15030645
 ] 

Xiu(Joe) Guo commented on SPARK-9701:
-

[~yhuai][~lian cheng] Would you mind reviewing my PR for SPARK-11562 and give 
me some feedback?

Thanks!

> allow not automatically using HiveContext with spark-shell when hive support 
> built in
> -
>
> Key: SPARK-9701
> URL: https://issues.apache.org/jira/browse/SPARK-9701
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: Thomas Graves
>
> I build the spark jar with hive support as most of our grids have Hive.  We 
> were bringing up a new YARN cluster that didn't have hive installed on it yet 
> which results in the spark-shell failing to launch:
> java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate 
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
> at 
> org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:374)
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:116)
> at 
> org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:163)
> at 
> org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:161)
> at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:168)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:408)
> at 
> org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:1028)
> It would be nice to have a config or something  to tell it not to instantiate 
> a HiveContext



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9701) allow not automatically using HiveContext with spark-shell when hive support built in

2015-11-28 Thread Xiu(Joe) Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15030645#comment-15030645
 ] 

Xiu(Joe) Guo edited comment on SPARK-9701 at 11/28/15 7:44 PM:
---

[~yhuai], [~lian cheng] Would you mind reviewing my PR for SPARK-11562 and give 
me some feedback?

Thanks!


was (Author: xguo27):
[~yhuai][~lian cheng] Would you mind reviewing my PR for SPARK-11562 and give 
me some feedback?

Thanks!

> allow not automatically using HiveContext with spark-shell when hive support 
> built in
> -
>
> Key: SPARK-9701
> URL: https://issues.apache.org/jira/browse/SPARK-9701
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: Thomas Graves
>
> I build the spark jar with hive support as most of our grids have Hive.  We 
> were bringing up a new YARN cluster that didn't have hive installed on it yet 
> which results in the spark-shell failing to launch:
> java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate 
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
> at 
> org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:374)
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:116)
> at 
> org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:163)
> at 
> org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:161)
> at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:168)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:408)
> at 
> org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:1028)
> It would be nice to have a config or something  to tell it not to instantiate 
> a HiveContext



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9701) allow not automatically using HiveContext with spark-shell when hive support built in

2015-11-28 Thread Xiu(Joe) Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15030641#comment-15030641
 ] 

Xiu(Joe) Guo commented on SPARK-9701:
-

I think this is the same issue as SPARK-11562.

> allow not automatically using HiveContext with spark-shell when hive support 
> built in
> -
>
> Key: SPARK-9701
> URL: https://issues.apache.org/jira/browse/SPARK-9701
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: Thomas Graves
>
> I build the spark jar with hive support as most of our grids have Hive.  We 
> were bringing up a new YARN cluster that didn't have hive installed on it yet 
> which results in the spark-shell failing to launch:
> java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate 
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
> at 
> org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:374)
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:116)
> at 
> org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:163)
> at 
> org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:161)
> at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:168)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:408)
> at 
> org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:1028)
> It would be nice to have a config or something  to tell it not to instantiate 
> a HiveContext



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6644) After adding new columns to a partitioned table and inserting data to an old partition, data of newly added columns are all NULL

2015-11-26 Thread Xiu(Joe) Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15029226#comment-15029226
 ] 

Xiu(Joe) Guo commented on SPARK-6644:
-

With the current master branch code line (1.6.0-snapshot), this issue cannot be 
reproduced anymore.

{panel}
scala> sqlContext.sql("DROP TABLE IF EXISTS table_with_partition ")
res6: org.apache.spark.sql.DataFrame = []

scala> sqlContext.sql("CREATE TABLE IF NOT EXISTS table_with_partition (key 
INT, value STRING) PARTITIONED BY (ds STRING)")
res7: org.apache.spark.sql.DataFrame = [result: string]

scala> sqlContext.sql("INSERT OVERWRITE TABLE table_with_partition PARTITION 
(ds = '1') SELECT key, value FROM testData")
res8: org.apache.spark.sql.DataFrame = []

scala> sqlContext.sql("select * from table_with_partition")
res9: org.apache.spark.sql.DataFrame = [key: int, value: string, ds: string]

scala> sqlContext.sql("select * from table_with_partition").show
|key|value| ds|
|  1|1|  1|
|  2|2|  1|

scala> sqlContext.sql("ALTER TABLE table_with_partition ADD COLUMNS (key1 
STRING)")
res11: org.apache.spark.sql.DataFrame = [result: string]

scala> sqlContext.sql("ALTER TABLE table_with_partition ADD COLUMNS (destlng 
DOUBLE)") 
res12: org.apache.spark.sql.DataFrame = [result: string]

scala> sqlContext.sql("INSERT OVERWRITE TABLE table_with_partition PARTITION 
(ds = '1') SELECT key, value, 'test', 1.11 FROM testData")
res13: org.apache.spark.sql.DataFrame = []

scala> sqlContext.sql("SELECT * FROM table_with_partition").show
|key|value|key1|destlng| ds|
|  1|1|test|   1.11|  1|
|  2|2|test|   1.11|  1|
{panel}

> After adding new columns to a partitioned table and inserting data to an old 
> partition, data of newly added columns are all NULL
> 
>
> Key: SPARK-6644
> URL: https://issues.apache.org/jira/browse/SPARK-6644
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: dongxu
>
> In Hive, the schema of a partition may differ from the table schema. For 
> example, we may add new columns to the table after importing existing 
> partitions. When using {{spark-sql}} to query the data in a partition whose 
> schema is different from the table schema, problems may arise. Part of them 
> have been solved in [PR #4289|https://github.com/apache/spark/pull/4289]. 
> However, after adding new column(s) to the table, when inserting data into 
> old partitions, values of newly added columns are all {{NULL}}.
> The following snippet can be used to reproduce this issue:
> {code}
> case class TestData(key: Int, value: String)
> val testData = TestHive.sparkContext.parallelize((1 to 2).map(i => 
> TestData(i, i.toString))).toDF()
> testData.registerTempTable("testData")
> sql("DROP TABLE IF EXISTS table_with_partition ")
> sql(s"CREATE TABLE IF NOT EXISTS table_with_partition (key INT, value STRING) 
> PARTITIONED BY (ds STRING) LOCATION '${tmpDir.toURI.toString}'")
> sql("INSERT OVERWRITE TABLE table_with_partition PARTITION (ds = '1') SELECT 
> key, value FROM testData")
> // Add new columns to the table
> sql("ALTER TABLE table_with_partition ADD COLUMNS (key1 STRING)")
> sql("ALTER TABLE table_with_partition ADD COLUMNS (destlng DOUBLE)") 
> sql("INSERT OVERWRITE TABLE table_with_partition PARTITION (ds = '1') SELECT 
> key, value, 'test', 1.11 FROM testData")
> sql("SELECT * FROM table_with_partition WHERE ds = 
> '1'").collect().foreach(println)
> {code}
> Actual result:
> {noformat}
> [1,1,null,null,1]
> [2,2,null,null,1]
> {noformat}
> Expected result:
> {noformat}
> [1,1,test,1.11,1]
> [2,2,test,1.11,1]
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11631) DAGScheduler prints "Stopping DAGScheduler" at INFO to the logs with no corresponding "Starting"

2015-11-10 Thread Xiu(Joe) Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14999101#comment-14999101
 ] 

Xiu(Joe) Guo commented on SPARK-11631:
--

I am looking at it, will submit a PR shortly.

> DAGScheduler prints "Stopping DAGScheduler" at INFO to the logs with no 
> corresponding "Starting"
> 
>
> Key: SPARK-11631
> URL: https://issues.apache.org/jira/browse/SPARK-11631
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 1.6.0
> Environment: Spark sources as of today - revision {{5039a49}}
>Reporter: Jacek Laskowski
>Priority: Trivial
>
> At stop, DAGScheduler prints out {{INFO DAGScheduler: Stopping 
> DAGScheduler}}, but there's no corresponding Starting INFO message. It can be 
> surprising.
> I think Spark should have a change and pick one:
> 1. {{INFO DAGScheduler: Stopping DAGScheduler}} should be DEBUG at the most 
> (or even TRACE)
> 2. {{INFO DAGScheduler: Stopping DAGScheduler}} should have corresponding 
> {{INFO DAGScheduler: Starting DAGScheduler}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11628) spark-sql do not support for column datatype of CHAR

2015-11-10 Thread Xiu(Joe) Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14999742#comment-14999742
 ] 

Xiu(Joe) Guo commented on SPARK-11628:
--

Hi Shunyu:

I think you are right about the parser part, but on top of the parser, quite a 
few other places also need new code to handle the 'CHAR' type.

I was looking at the entire stack trying not to miss anything, please look at 
my proposed change and see whether it looks good to you.

Thanks!

> spark-sql do not support for column datatype of CHAR
> 
>
> Key: SPARK-11628
> URL: https://issues.apache.org/jira/browse/SPARK-11628
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: zhangshunyu
>  Labels: features
>
> In spark-sql when we create a table using the command as follwing:
>"create table tablename(col char(5));"
> Hive will support for creating the table,  but when we desc the table:
>"desc tablename"
> spark will report the error:
>“org.apache.spark.sql.types.DataTypeException: Unsupported dataType: 
> char(5). If you have a struct and a field name of it has any special 
> characters, please use backticks (`) to quote that field name, e.g. `x+y`. 
> Please note that backtick itself is not supported in a field name.”



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org