[jira] [Created] (SPARK-6803) Support SparkR Streaming

2015-04-09 Thread Hao (JIRA)
Hao created SPARK-6803:
--

 Summary: Support SparkR Streaming
 Key: SPARK-6803
 URL: https://issues.apache.org/jira/browse/SPARK-6803
 Project: Spark
  Issue Type: New Feature
  Components: SparkR, Streaming
Reporter: Hao
 Fix For: 1.4.0


Adds R API for Spark Streaming.

A experimental version is presented in repo [1]. which follows the PySpark 
streaming design. Also, this PR can be further broken down into sub task issues.

[1] https://github.com/hlin09/spark/tree/SparkR-streaming/ 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6803) [SparkR] Support SparkR Streaming

2015-04-09 Thread Hao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hao updated SPARK-6803:
---
Summary: [SparkR] Support SparkR Streaming  (was: Support SparkR Streaming)

> [SparkR] Support SparkR Streaming
> -
>
> Key: SPARK-6803
> URL: https://issues.apache.org/jira/browse/SPARK-6803
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR, Streaming
>Reporter: Hao
> Fix For: 1.4.0
>
>
> Adds R API for Spark Streaming.
> A experimental version is presented in repo [1]. which follows the PySpark 
> streaming design. Also, this PR can be further broken down into sub task 
> issues.
> [1] https://github.com/hlin09/spark/tree/SparkR-streaming/ 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6881) Change the checkpoint directory name from checkpoints to checkpoint

2015-04-13 Thread Hao (JIRA)
Hao created SPARK-6881:
--

 Summary: Change the checkpoint directory name from checkpoints to 
checkpoint
 Key: SPARK-6881
 URL: https://issues.apache.org/jira/browse/SPARK-6881
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Affects Versions: 1.4.0
Reporter: Hao
Priority: Trivial


Name "checkpoint" instead of "checkpoints" is included in .gitignore



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6991) Adds support for zipPartitions for SparkR

2015-04-17 Thread Hao (JIRA)
Hao created SPARK-6991:
--

 Summary: Adds support for zipPartitions for SparkR
 Key: SPARK-6991
 URL: https://issues.apache.org/jira/browse/SPARK-6991
 Project: Spark
  Issue Type: New Feature
  Components: SparkR
Reporter: Hao
Priority: Minor


Adds support for zipPartitions for SparkR:
zipPartitions(..., func)

This gives user more flexibility for zip multiple RDDs with partition-level 
management (potential performance improvement).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8506) SparkR does not provide an easy way to depend on Spark Packages when performing init from inside of R

2015-06-20 Thread Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14594942#comment-14594942
 ] 

Hao commented on SPARK-8506:


This is a good solution. I just ran into the same problem when doing the 
tutorial. 

> SparkR does not provide an easy way to depend on Spark Packages when 
> performing init from inside of R
> -
>
> Key: SPARK-8506
> URL: https://issues.apache.org/jira/browse/SPARK-8506
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.4.0
>Reporter: holdenk
>Priority: Minor
>
> While packages can be specified when using the sparkR or sparkSubmit scripts, 
> the programming guide tells people to create their spark context using the R 
> shell + init. The init does have a parameter for jars but no parameter for 
> packages. Setting the SPARKR_SUBMIT_ARGS overwrites some necessary 
> information. I think a good solution would just be adding another field to 
> the init function to allow people to specify packages in the same way as jars.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46354) Add restriction parameters to dynamic partitions for writing datasource

2023-12-10 Thread hao (Jira)
hao created SPARK-46354:
---

 Summary: Add restriction parameters to dynamic partitions for 
writing datasource
 Key: SPARK-46354
 URL: https://issues.apache.org/jira/browse/SPARK-46354
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.5.0
Reporter: hao


Add restriction parameters to dynamic partitions for writing datasource. 

`InsertIntoHiveTable 'limits the number of single dynamic partition writes, 
while' InsertIntoHadoopFsRelationCommand 'does not limit the number of single 
dynamic partitions and should be aligned with it



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46354) Add restriction parameters to dynamic partitions for writing datasource

2023-12-10 Thread hao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hao updated SPARK-46354:

Priority: Trivial  (was: Major)

> Add restriction parameters to dynamic partitions for writing datasource
> ---
>
> Key: SPARK-46354
> URL: https://issues.apache.org/jira/browse/SPARK-46354
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: hao
>Priority: Trivial
>
> Add restriction parameters to dynamic partitions for writing datasource. 
> `InsertIntoHiveTable 'limits the number of single dynamic partition writes, 
> while' InsertIntoHadoopFsRelationCommand 'does not limit the number of single 
> dynamic partitions and should be aligned with it



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33982) Sparksql does not support when the inserted table is a read table

2021-01-04 Thread hao (Jira)
hao created SPARK-33982:
---

 Summary: Sparksql does not support when the inserted table is a 
read table
 Key: SPARK-33982
 URL: https://issues.apache.org/jira/browse/SPARK-33982
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.1
Reporter: hao


When the inserted table is a read table, sparksql will throw an error - > 
org.apache.spark . sql.AnalysisException : Cannot overwrite a path that is also 
being read from.;



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33982) Sparksql does not support when the inserted table is a read table

2021-01-04 Thread hao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17258052#comment-17258052
 ] 

hao commented on SPARK-33982:
-

我认为sparksql应该得到支持

> Sparksql does not support when the inserted table is a read table
> -
>
> Key: SPARK-33982
> URL: https://issues.apache.org/jira/browse/SPARK-33982
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: hao
>Priority: Major
>
> When the inserted table is a read table, sparksql will throw an error - > 
> org.apache.spark . sql.AnalysisException : Cannot overwrite a path that is 
> also being read from.;



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-33982) Sparksql does not support when the inserted table is a read table

2021-01-04 Thread hao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17258052#comment-17258052
 ] 

hao edited comment on SPARK-33982 at 1/4/21, 11:19 AM:
---

我认为sparksql应该得到支持insert overwrite 读取表中


was (Author: hao.duan):
我认为sparksql应该得到支持

> Sparksql does not support when the inserted table is a read table
> -
>
> Key: SPARK-33982
> URL: https://issues.apache.org/jira/browse/SPARK-33982
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: hao
>Priority: Major
>
> When the inserted table is a read table, sparksql will throw an error - > 
> org.apache.spark . sql.AnalysisException : Cannot overwrite a path that is 
> also being read from.;



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34006) [spark.sql.hive.convertMetastoreOrc]This parameter can solve orc format table insert overwrite read table, it should be stated in the document

2021-01-04 Thread hao (Jira)
hao created SPARK-34006:
---

 Summary: [spark.sql.hive.convertMetastoreOrc]This parameter can 
solve orc format table insert overwrite read table, it should be stated in the 
document
 Key: SPARK-34006
 URL: https://issues.apache.org/jira/browse/SPARK-34006
 Project: Spark
  Issue Type: Bug
  Components: docs
Affects Versions: 3.0.1
Reporter: hao


This parameter can solve orc format table insert overwrite read table, it 
should be stated in the document



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33982) Sparksql does not support when the inserted table is a read table

2021-01-05 Thread hao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17259391#comment-17259391
 ] 

hao commented on SPARK-33982:
-

[~hyukjin.kwon] 小哥,能看懂中文的呀!是的这是同一个问题>.<

> Sparksql does not support when the inserted table is a read table
> -
>
> Key: SPARK-33982
> URL: https://issues.apache.org/jira/browse/SPARK-33982
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: hao
>Priority: Major
>
> When the inserted table is a read table, sparksql will throw an error - > 
> org.apache.spark . sql.AnalysisException : Cannot overwrite a path that is 
> also being read from.;



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34406) When we submit spark core tasks frequently, the submitted nodes will have a lot of resource pressure

2021-02-08 Thread hao (Jira)
hao created SPARK-34406:
---

 Summary: When we submit spark core tasks frequently, the submitted 
nodes will have a lot of resource pressure
 Key: SPARK-34406
 URL: https://issues.apache.org/jira/browse/SPARK-34406
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.0.1
Reporter: hao


When we submit spark core tasks frequently, the submitted node will have a lot 
of resource pressure, because spark will create a process instead of a thread 
for each submitted task. In fact, there is a lot of resource consumption. When 
the QPS of the submitted task is very high, the submission will fail due to 
insufficient resources. I would like to ask how to optimize the amount of 
resources consumed by spark core submission



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34406) When we submit spark core tasks frequently, the submitted nodes will have a lot of resource pressure

2021-02-08 Thread hao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17281560#comment-17281560
 ] 

hao commented on SPARK-34406:
-

Yes, sir. I'm using the Yarn cluster mode. What I mean here is that when the 
spark client submits spark core to the remote yarn, it is submitted in the way 
of process, but this way consumes a lot of resources

> When we submit spark core tasks frequently, the submitted nodes will have a 
> lot of resource pressure
> 
>
> Key: SPARK-34406
> URL: https://issues.apache.org/jira/browse/SPARK-34406
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: hao
>Priority: Major
>
> When we submit spark core tasks frequently, the submitted node will have a 
> lot of resource pressure, because spark will create a process instead of a 
> thread for each submitted task. In fact, there is a lot of resource 
> consumption. When the QPS of the submitted task is very high, the submission 
> will fail due to insufficient resources. I would like to ask how to optimize 
> the amount of resources consumed by spark core submission



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37274) These parameters should be of type long, not int

2021-11-10 Thread hao (Jira)
hao created SPARK-37274:
---

 Summary: These parameters should be of type long, not int
 Key: SPARK-37274
 URL: https://issues.apache.org/jira/browse/SPARK-37274
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: hao


These parameters [spark.sql.orc.columnarReaderBatchSize], 
[spark.sql.inMemoryColumnarStorage.batchSize], 
[spark.sql.parquet.columnarReaderBatchSize] should be of type long, not of type 
int. when the user sets the value to be greater than the maximum value of type 
int, an error will be thrown



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37274) When the value of this parameter is greater than the maximum value of int type, the value will be thrown out of bounds. The document description of this parameter should

2021-11-10 Thread hao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hao updated SPARK-37274:

Description: When the value of this parameter is greater than the maximum 
value of int type, the value will be thrown out of bounds. The document 
description of this parameter should remind the user of this risk point  (was: 
These parameters [spark.sql.orc.columnarReaderBatchSize], 
[spark.sql.inMemoryColumnarStorage.batchSize], 
[spark.sql.parquet.columnarReaderBatchSize] should be of type long, not of type 
int. when the user sets the value to be greater than the maximum value of type 
int, an error will be thrown)

> When the value of this parameter is greater than the maximum value of int 
> type, the value will be thrown out of bounds. The document description of 
> this parameter should remind the user of this risk point
> 
>
> Key: SPARK-37274
> URL: https://issues.apache.org/jira/browse/SPARK-37274
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: hao
>Priority: Major
>
> When the value of this parameter is greater than the maximum value of int 
> type, the value will be thrown out of bounds. The document description of 
> this parameter should remind the user of this risk point



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37274) When the value of this parameter is greater than the maximum value of int type, the value will be thrown out of bounds. The document description of this parameter should

2021-11-10 Thread hao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hao updated SPARK-37274:

Summary: When the value of this parameter is greater than the maximum value 
of int type, the value will be thrown out of bounds. The document description 
of this parameter should remind the user of this risk point  (was: These 
parameters should be of type long, not int)

> When the value of this parameter is greater than the maximum value of int 
> type, the value will be thrown out of bounds. The document description of 
> this parameter should remind the user of this risk point
> 
>
> Key: SPARK-37274
> URL: https://issues.apache.org/jira/browse/SPARK-37274
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: hao
>Priority: Major
>
> These parameters [spark.sql.orc.columnarReaderBatchSize], 
> [spark.sql.inMemoryColumnarStorage.batchSize], 
> [spark.sql.parquet.columnarReaderBatchSize] should be of type long, not of 
> type int. when the user sets the value to be greater than the maximum value 
> of type int, an error will be thrown



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42714) Sparksql temporary file conflict

2023-03-07 Thread hao (Jira)
hao created SPARK-42714:
---

 Summary: Sparksql temporary file conflict
 Key: SPARK-42714
 URL: https://issues.apache.org/jira/browse/SPARK-42714
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.3.2
Reporter: hao


When sparksql inserts overwrite, the name of the temporary file in the middle 
is not unique. This will cause that when multiple applications write different 
partition data to the same partition table, it will be possible to delete each 
other's temporary files between applications, resulting in task failure





--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42714) Sparksql temporary file conflict

2023-03-07 Thread hao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17697760#comment-17697760
 ] 

hao commented on SPARK-42714:
-

This problem will cause the task to throw the problem of deleting the current 
temporary file. The detailed error is as follows:
 -

User class threw exception: org.apache.spark.SparkException: Job aborted.
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:231)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:188)
at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:108)
at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:106)
at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:120)
at 
org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:228)
at 
org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3700)
at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3698)
at org.apache.spark.sql.Dataset.(Dataset.scala:228)
at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:99)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:96)
at 
org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:618)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:613)
at com.ly.process.SparkSQL.main(SparkSQL.java:55)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:732)
Caused by: java.io.FileNotFoundException: File 
/ns-tcly/com//_temporary/0/task_202303070204281920649928402071557_0031_m_001866/type=2
 does not exist.
at 
org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:1058)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.access$1000(DistributedFileSystem.java:131)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1118)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1115)
at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:1125)
at 
org.apache.hadoop.fs.FilterFileSystem.listStatus(FilterFileSystem.java:270)
at 
org.apache.hadoop.fs.viewfs.ChRootedFileSystem.listStatus(ChRootedFileSystem.java:255)
at 
org.apache.hadoop.fs.viewfs.ViewFileSystem.listStatus(ViewFileSystem.java:411)
at 
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:484)
at 
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:486)
at 
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJobInternal(FileOutputCommitter.java:403)
at 
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:375)
at 
org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.commitJob(HadoopMapReduceCommitProtocol.scala:182)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:220)
... 25 more

> Sparksql temporary file conflict
> 
>
> Key: SPARK-42714
> URL: https://issues.apache.org/jira/browse/SPARK-42714
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.2
>Reporter: hao
>Priority: Major
>
> When sparksql inserts overwrite, the name of the temporary file in the middle 
> is not uniq

[jira] [Created] (SPARK-44483) When using Spark to read the hive table, the number of file partitions cannot be set using Spark's configuration settings

2023-07-19 Thread hao (Jira)
hao created SPARK-44483:
---

 Summary: When using Spark to read the hive table, the number of 
file partitions cannot be set using Spark's configuration settings
 Key: SPARK-44483
 URL: https://issues.apache.org/jira/browse/SPARK-44483
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.1
Reporter: hao


When using Spark to read the hive table, the number of file partitions cannot 
be set using Spark's configuration settings



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16214) calculate pi is not correct

2016-06-25 Thread Yang Hao (JIRA)
Yang Hao created SPARK-16214:


 Summary: calculate pi is not correct
 Key: SPARK-16214
 URL: https://issues.apache.org/jira/browse/SPARK-16214
 Project: Spark
  Issue Type: Improvement
  Components: Examples
Affects Versions: 1.6.2
Reporter: Yang Hao


As the denominator is n, the iteration time should also be n

pull request is : https://github.com/apache/spark/pull/13910



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16214) optimize SparkPi

2016-06-25 Thread Yang Hao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Hao updated SPARK-16214:
-
Summary: optimize SparkPi  (was: calculate pi is not correct)

> optimize SparkPi
> 
>
> Key: SPARK-16214
> URL: https://issues.apache.org/jira/browse/SPARK-16214
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples
>Affects Versions: 1.6.2
>Reporter: Yang Hao
>
> As the denominator is n, the iteration time should also be n
> pull request is : https://github.com/apache/spark/pull/13910



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16214) make SparkPi iteration time correct

2016-06-25 Thread Yang Hao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Hao updated SPARK-16214:
-
Summary: make SparkPi iteration time correct  (was: optimize SparkPi)

> make SparkPi iteration time correct
> ---
>
> Key: SPARK-16214
> URL: https://issues.apache.org/jira/browse/SPARK-16214
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples
>Affects Versions: 1.6.2
>Reporter: Yang Hao
>
> As the denominator is n, the iteration time should also be n
> pull request is : https://github.com/apache/spark/pull/13910



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16214) make SparkPi iteration time correct

2016-06-25 Thread Yang Hao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Hao updated SPARK-16214:
-
Attachment: SPARK-16214.patch

> make SparkPi iteration time correct
> ---
>
> Key: SPARK-16214
> URL: https://issues.apache.org/jira/browse/SPARK-16214
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples
>Affects Versions: 1.6.2
>Reporter: Yang Hao
> Attachments: SPARK-16214.patch
>
>
> As the denominator is n, the iteration time should also be n
> pull request is : https://github.com/apache/spark/pull/13910



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16214) make SparkPi iteration time correct

2016-06-25 Thread Yang Hao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Hao resolved SPARK-16214.
--
Resolution: Resolved

> make SparkPi iteration time correct
> ---
>
> Key: SPARK-16214
> URL: https://issues.apache.org/jira/browse/SPARK-16214
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples
>Affects Versions: 1.6.2
>Reporter: Yang Hao
> Attachments: SPARK-16214.patch
>
>
> As the denominator is n, the iteration time should also be n
> pull request is : https://github.com/apache/spark/pull/13910



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16214) make SparkPi iteration time correct

2016-06-25 Thread Yang Hao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Hao updated SPARK-16214:
-
Attachment: SPARK-16214.patch

> make SparkPi iteration time correct
> ---
>
> Key: SPARK-16214
> URL: https://issues.apache.org/jira/browse/SPARK-16214
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples
>Affects Versions: 1.6.2
>Reporter: Yang Hao
> Attachments: SPARK-16214.patch
>
>
> As the denominator is n, the iteration time should also be n
> pull request is : https://github.com/apache/spark/pull/13910



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16214) make SparkPi iteration time correct

2016-06-25 Thread Yang Hao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Hao updated SPARK-16214:
-
Attachment: (was: SPARK-16214.patch)

> make SparkPi iteration time correct
> ---
>
> Key: SPARK-16214
> URL: https://issues.apache.org/jira/browse/SPARK-16214
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples
>Affects Versions: 1.6.2
>Reporter: Yang Hao
> Attachments: SPARK-16214.patch
>
>
> As the denominator is n, the iteration time should also be n
> pull request is : https://github.com/apache/spark/pull/13910



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16214) make SparkPi iteration time correct

2016-06-25 Thread Yang Hao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Hao updated SPARK-16214:
-
Description: 
As the denominator is n, the iteration time should also be n


  was:
As the denominator is n, the iteration time should also be n

pull request is : https://github.com/apache/spark/pull/13910


> make SparkPi iteration time correct
> ---
>
> Key: SPARK-16214
> URL: https://issues.apache.org/jira/browse/SPARK-16214
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples
>Affects Versions: 1.6.2
>Reporter: Yang Hao
> Attachments: SPARK-16214.patch
>
>
> As the denominator is n, the iteration time should also be n



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16214) make SparkPi iteration number correct

2016-06-26 Thread Yang Hao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Hao updated SPARK-16214:
-
Description: 
As the denominator is n, the iteration number should also be n

A pull request is https://github.com/apache/spark/pull/13910


  was:
As the denominator is n, the iteration number should also be n



> make SparkPi iteration number correct
> -
>
> Key: SPARK-16214
> URL: https://issues.apache.org/jira/browse/SPARK-16214
> Project: Spark
>  Issue Type: Bug
>  Components: Examples
>Affects Versions: 1.6.2
>Reporter: Yang Hao
>Assignee: Yang Hao
>Priority: Minor
> Attachments: SPARK-16214.patch
>
>
> As the denominator is n, the iteration number should also be n
> A pull request is https://github.com/apache/spark/pull/13910



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16214) make SparkPi iteration number correct

2016-06-26 Thread Yang Hao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Hao updated SPARK-16214:
-
Description: 
As the denominator is n, the iteration number should also be n



  was:
As the denominator is n, the iteration number should also be n

A pull request is https://github.com/apache/spark/pull/13910



> make SparkPi iteration number correct
> -
>
> Key: SPARK-16214
> URL: https://issues.apache.org/jira/browse/SPARK-16214
> Project: Spark
>  Issue Type: Bug
>  Components: Examples
>Affects Versions: 1.6.2
>Reporter: Yang Hao
>Assignee: Yang Hao
>Priority: Minor
> Attachments: SPARK-16214.patch
>
>
> As the denominator is n, the iteration number should also be n



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16214) make SparkPi iteration number correct

2016-06-26 Thread Yang Hao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Hao updated SPARK-16214:
-
Attachment: (was: SPARK-16214.patch)

> make SparkPi iteration number correct
> -
>
> Key: SPARK-16214
> URL: https://issues.apache.org/jira/browse/SPARK-16214
> Project: Spark
>  Issue Type: Bug
>  Components: Examples
>Affects Versions: 1.6.2
>Reporter: Yang Hao
>Assignee: Yang Hao
>Priority: Minor
>
> As the denominator is n, the iteration number should also be n



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16214) fix the denominator of SparkPi

2016-06-26 Thread Yang Hao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Hao updated SPARK-16214:
-
Summary:  fix the denominator of SparkPi  (was: make SparkPi iteration 
number correct)

>  fix the denominator of SparkPi
> ---
>
> Key: SPARK-16214
> URL: https://issues.apache.org/jira/browse/SPARK-16214
> Project: Spark
>  Issue Type: Bug
>  Components: Examples
>Affects Versions: 1.6.2
>Reporter: Yang Hao
>Assignee: Yang Hao
>Priority: Minor
>
> As the denominator is n, the iteration number should also be n



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16214) fix the denominator of SparkPi

2016-06-26 Thread Yang Hao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Hao updated SPARK-16214:
-
Description: 
As the iteration number is n - 1, then the denominator would also be n, 



  was:
As the denominator is n, the iteration number should also be n




>  fix the denominator of SparkPi
> ---
>
> Key: SPARK-16214
> URL: https://issues.apache.org/jira/browse/SPARK-16214
> Project: Spark
>  Issue Type: Bug
>  Components: Examples
>Affects Versions: 1.6.2
>Reporter: Yang Hao
>Assignee: Yang Hao
>Priority: Minor
>
> As the iteration number is n - 1, then the denominator would also be n, 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16214) fix the denominator of SparkPi

2016-06-26 Thread Yang Hao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Hao updated SPARK-16214:
-
Description: 
As the iteration number is n - 1, the denominator would also be n, 



  was:
As the iteration number is n - 1, then the denominator would also be n, 




>  fix the denominator of SparkPi
> ---
>
> Key: SPARK-16214
> URL: https://issues.apache.org/jira/browse/SPARK-16214
> Project: Spark
>  Issue Type: Bug
>  Components: Examples
>Affects Versions: 1.6.2
>Reporter: Yang Hao
>Assignee: Yang Hao
>Priority: Minor
>
> As the iteration number is n - 1, the denominator would also be n, 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15034) Use the value of spark.sql.warehouse.dir as the warehouse location instead of using hive.metastore.warehouse.dir

2016-05-25 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15300072#comment-15300072
 ] 

Cheng Hao commented on SPARK-15034:
---

[~yhuai], but it probably not respect the `hive-site.xml`, and lots of users 
will be impacted by this configuration change, will that acceptable?

> Use the value of spark.sql.warehouse.dir as the warehouse location instead of 
> using hive.metastore.warehouse.dir
> 
>
> Key: SPARK-15034
> URL: https://issues.apache.org/jira/browse/SPARK-15034
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>  Labels: release_notes, releasenotes
> Fix For: 2.0.0
>
>
> Starting from Spark 2.0, spark.sql.warehouse.dir will be the conf to set 
> warehouse location. We will not use hive.metastore.warehouse.dir.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15730) [Spark SQL] the value of 'hiveconf' parameter in Spark-sql CLI don't take effect in spark-sql session

2016-06-07 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15318654#comment-15318654
 ] 

Cheng Hao commented on SPARK-15730:
---

[~jameszhouyi], can you please verify this fixing?

> [Spark SQL] the value of 'hiveconf' parameter in Spark-sql CLI don't take 
> effect in spark-sql session
> -
>
> Key: SPARK-15730
> URL: https://issues.apache.org/jira/browse/SPARK-15730
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yi Zhou
>Priority: Critical
>
> /usr/lib/spark/bin/spark-sql -v --driver-memory 4g --executor-memory 7g 
> --executor-cores 5 --num-executors 31 --master yarn-client --conf 
> spark.yarn.executor.memoryOverhead=1024 --hiveconf RESULT_TABLE=test_result01
> spark-sql> use test;
> 16/06/02 21:36:15 INFO execution.SparkSqlParser: Parsing command: use test
> 16/06/02 21:36:15 INFO spark.SparkContext: Starting job: processCmd at 
> CliDriver.java:376
> 16/06/02 21:36:15 INFO scheduler.DAGScheduler: Got job 2 (processCmd at 
> CliDriver.java:376) with 1 output partitions
> 16/06/02 21:36:15 INFO scheduler.DAGScheduler: Final stage: ResultStage 2 
> (processCmd at CliDriver.java:376)
> 16/06/02 21:36:15 INFO scheduler.DAGScheduler: Parents of final stage: List()
> 16/06/02 21:36:15 INFO scheduler.DAGScheduler: Missing parents: List()
> 16/06/02 21:36:15 INFO scheduler.DAGScheduler: Submitting ResultStage 2 
> (MapPartitionsRDD[8] at processCmd at CliDriver.java:376), which has no 
> missing parents
> 16/06/02 21:36:15 INFO memory.MemoryStore: Block broadcast_2 stored as values 
> in memory (estimated size 3.2 KB, free 2.4 GB)
> 16/06/02 21:36:15 INFO memory.MemoryStore: Block broadcast_2_piece0 stored as 
> bytes in memory (estimated size 1964.0 B, free 2.4 GB)
> 16/06/02 21:36:15 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in 
> memory on 192.168.3.11:36189 (size: 1964.0 B, free: 2.4 GB)
> 16/06/02 21:36:15 INFO spark.SparkContext: Created broadcast 2 from broadcast 
> at DAGScheduler.scala:1012
> 16/06/02 21:36:15 INFO scheduler.DAGScheduler: Submitting 1 missing tasks 
> from ResultStage 2 (MapPartitionsRDD[8] at processCmd at CliDriver.java:376)
> 16/06/02 21:36:15 INFO cluster.YarnScheduler: Adding task set 2.0 with 1 tasks
> 16/06/02 21:36:15 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 
> 2.0 (TID 2, 192.168.3.13, partition 0, PROCESS_LOCAL, 5362 bytes)
> 16/06/02 21:36:15 INFO cluster.YarnClientSchedulerBackend: Launching task 2 
> on executor id: 10 hostname: 192.168.3.13.
> 16/06/02 21:36:16 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in 
> memory on hw-node3:45924 (size: 1964.0 B, free: 4.4 GB)
> 16/06/02 21:36:17 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 
> 2.0 (TID 2) in 1934 ms on 192.168.3.13 (1/1)
> 16/06/02 21:36:17 INFO cluster.YarnScheduler: Removed TaskSet 2.0, whose 
> tasks have all completed, from pool
> 16/06/02 21:36:17 INFO scheduler.DAGScheduler: ResultStage 2 (processCmd at 
> CliDriver.java:376) finished in 1.937 s
> 16/06/02 21:36:17 INFO scheduler.DAGScheduler: Job 2 finished: processCmd at 
> CliDriver.java:376, took 1.962631 s
> Time taken: 2.027 seconds
> 16/06/02 21:36:17 INFO CliDriver: Time taken: 2.027 seconds
> spark-sql> DROP TABLE IF EXISTS ${hiveconf:RESULT_TABLE};
> 16/06/02 21:36:36 INFO execution.SparkSqlParser: Parsing command: DROP TABLE 
> IF EXISTS ${hiveconf:RESULT_TABLE}
> Error in query:
> mismatched input '$' expecting {'ADD', 'AS', 'ALL', 'GROUP', 'BY', 
> 'GROUPING', 'SETS', 'CUBE', 'ROLLUP', 'ORDER', 'LIMIT', 'AT', 'IN', 'NO', 
> 'EXISTS', 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', 
> 'ASC', 'DESC', 'FOR', 'OUTER', 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 
> 'RANGE', 'ROWS', 'PRECEDING', 'FOLLOWING', 'CURRENT', 'ROW', 'WITH', 
> 'VALUES', 'CREATE', 'TABLE', 'VIEW', 'REPLACE', 'INSERT', 'DELETE', 'INTO', 
> 'DESCRIBE', 'EXPLAIN', 'FORMAT', 'LOGICAL', 'CODEGEN', 'SHOW', 'TABLES', 
> 'COLUMNS', 'COLUMN', 'USE', 'PARTITIONS', 'FUNCTIONS', 'DROP', 'TO', 
> 'TABLESAMPLE', 'ALTER', 'RENAME', 'ARRAY', 'MAP', 'STRUCT', 'COMMENT', 'SET', 
> 'RESET', 'DATA', 'START', 'TRANSACTION', 'COMMIT', 'ROLLBACK', 'IF', 
> 'PERCENT', 'BUCKET', 'OUT', 'OF', 'SORT', 'CLUSTER', 'DISTRIBUTE', 
> 'OVERWRITE', 'TRANSFORM', 'REDUCE', 'USING', 'SERDE', 'SERDEPROPERTIES', 
> 'RECORDREADER', 'RECORDWRITER', 'DELIMITED', 'FIELDS', 'TERMINATED', 
> 'COLLECTION', 'ITEMS', 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 'EXTENDED', 
> 'REFRESH', 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 'FORMATTED', TEMPORARY, 
> 'OPTIONS', 'UNSET', 'TBLPROPERTIES', 'DBPROPERTIES', 'BUCKETS', 'SKEWED', 
> 'STORED', 'DIRECTORIES', 'LOCATION', 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', 
> 'FILEFORMAT',

[jira] [Created] (SPARK-15859) Optimize the Partition Pruning with Disjunction

2016-06-09 Thread Cheng Hao (JIRA)
Cheng Hao created SPARK-15859:
-

 Summary: Optimize the Partition Pruning with Disjunction
 Key: SPARK-15859
 URL: https://issues.apache.org/jira/browse/SPARK-15859
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Cheng Hao
Priority: Critical


Currently we can not optimize the partition pruning in disjunction, for example:

{{(part1=2 and col1='abc') or (part1=5 and col1='cde')}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4452) Shuffle data structures can starve others on the same thread for memory

2016-05-09 Thread Xin Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15276109#comment-15276109
 ] 

Xin Hao commented on SPARK-4452:


Since this is an old issue which impact Spark since 1.1.0, can the patch be 
merged to Spark 1.6.X ? Thanks.

> Shuffle data structures can starve others on the same thread for memory 
> 
>
> Key: SPARK-4452
> URL: https://issues.apache.org/jira/browse/SPARK-4452
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Tianshuo Deng
>Assignee: Lianhui Wang
> Fix For: 2.0.0
>
>
> When an Aggregator is used with ExternalSorter in a task, spark will create 
> many small files and could cause too many files open error during merging.
> Currently, ShuffleMemoryManager does not work well when there are 2 spillable 
> objects in a thread, which are ExternalSorter and ExternalAppendOnlyMap(used 
> by Aggregator) in this case. Here is an example: Due to the usage of mapside 
> aggregation, ExternalAppendOnlyMap is created first to read the RDD. It may 
> ask as much memory as it can, which is totalMem/numberOfThreads. Then later 
> on when ExternalSorter is created in the same thread, the 
> ShuffleMemoryManager could refuse to allocate more memory to it, since the 
> memory is already given to the previous requested 
> object(ExternalAppendOnlyMap). That causes the ExternalSorter keeps spilling 
> small files(due to the lack of memory)
> I'm currently working on a PR to address these two issues. It will include 
> following changes:
> 1. The ShuffleMemoryManager should not only track the memory usage for each 
> thread, but also the object who holds the memory
> 2. The ShuffleMemoryManager should be able to trigger the spilling of a 
> spillable object. In this way, if a new object in a thread is requesting 
> memory, the old occupant could be evicted/spilled. Previously the spillable 
> objects trigger spilling by themselves. So one may not trigger spilling even 
> if another object in the same thread needs more memory. After this change The 
> ShuffleMemoryManager could trigger the spilling of an object if it needs to.
> 3. Make the iterator of ExternalAppendOnlyMap spillable. Previously 
> ExternalAppendOnlyMap returns an destructive iterator and can not be spilled 
> after the iterator is returned. This should be changed so that even after the 
> iterator is returned, the ShuffleMemoryManager can still spill it.
> Currently, I have a working branch in progress: 
> https://github.com/tsdeng/spark/tree/enhance_memory_manager. Already made 
> change 3 and have a prototype of change 1 and 2 to evict spillable from 
> memory manager, still in progress. I will send a PR when it's done.
> Any feedback or thoughts on this change is highly appreciated !



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4452) Shuffle data structures can starve others on the same thread for memory

2016-05-09 Thread Xin Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15276109#comment-15276109
 ] 

Xin Hao edited comment on SPARK-4452 at 5/9/16 8:36 AM:


Since this is an old issue which impact Spark since 1.1.0, can the patch be 
merged to Spark 1.6.X ? This will be very helpful for Spark 1.6.X users. Thanks.


was (Author: xhao1):
Since this is an old issue which impact Spark since 1.1.0, can the patch be 
merged to Spark 1.6.X ? Thanks.

> Shuffle data structures can starve others on the same thread for memory 
> 
>
> Key: SPARK-4452
> URL: https://issues.apache.org/jira/browse/SPARK-4452
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Tianshuo Deng
>Assignee: Lianhui Wang
> Fix For: 2.0.0
>
>
> When an Aggregator is used with ExternalSorter in a task, spark will create 
> many small files and could cause too many files open error during merging.
> Currently, ShuffleMemoryManager does not work well when there are 2 spillable 
> objects in a thread, which are ExternalSorter and ExternalAppendOnlyMap(used 
> by Aggregator) in this case. Here is an example: Due to the usage of mapside 
> aggregation, ExternalAppendOnlyMap is created first to read the RDD. It may 
> ask as much memory as it can, which is totalMem/numberOfThreads. Then later 
> on when ExternalSorter is created in the same thread, the 
> ShuffleMemoryManager could refuse to allocate more memory to it, since the 
> memory is already given to the previous requested 
> object(ExternalAppendOnlyMap). That causes the ExternalSorter keeps spilling 
> small files(due to the lack of memory)
> I'm currently working on a PR to address these two issues. It will include 
> following changes:
> 1. The ShuffleMemoryManager should not only track the memory usage for each 
> thread, but also the object who holds the memory
> 2. The ShuffleMemoryManager should be able to trigger the spilling of a 
> spillable object. In this way, if a new object in a thread is requesting 
> memory, the old occupant could be evicted/spilled. Previously the spillable 
> objects trigger spilling by themselves. So one may not trigger spilling even 
> if another object in the same thread needs more memory. After this change The 
> ShuffleMemoryManager could trigger the spilling of an object if it needs to.
> 3. Make the iterator of ExternalAppendOnlyMap spillable. Previously 
> ExternalAppendOnlyMap returns an destructive iterator and can not be spilled 
> after the iterator is returned. This should be changed so that even after the 
> iterator is returned, the ShuffleMemoryManager can still spill it.
> Currently, I have a working branch in progress: 
> https://github.com/tsdeng/spark/tree/enhance_memory_manager. Already made 
> change 3 and have a prototype of change 1 and 2 to evict spillable from 
> memory manager, still in progress. I will send a PR when it's done.
> Any feedback or thoughts on this change is highly appreciated !



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16869) Wrong projection when join on columns with the same name which are derived from the same dataframe

2016-08-03 Thread GUAN Hao (JIRA)
GUAN Hao created SPARK-16869:


 Summary: Wrong projection when join on columns with the same name 
which are derived from the same dataframe
 Key: SPARK-16869
 URL: https://issues.apache.org/jira/browse/SPARK-16869
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: GUAN Hao


I have to DataFrames, both contain a column named *i* which are derived from a 
same DataFrame (join).

{code}
b
+---+---+---+---+
|  j|  p|  i|  k|
+---+---+---+---+
|  3|  2|  3|  3|
|  2|  1|  2|  2|
+---+---+---+---+

c
+---+---+---+---+
|  j|  k|  q|  i|
+---+---+---+---+
|  1|  1|  0|  1|
|  2|  2|  1|  2|
+---+---+---+---+
{code}

The result of OUTER join of two DataFrames above is:

{code}
i = colaesce(b.i, c.i)
++++---+---+++
| b_i| c_i|   i|  j|  k|   p|   q|
++++---+---+++
|   2|   2|   2|  2|  2|   1|   1|
|null|   1|   1|  1|  1|null|   0|
|   3|null|   3|  3|  3|   2|null|
++++---+---+++
{code}

However, what I got is:

{code}
++++---+---+++
| b_i| c_i|   i|  j|  k|   p|   q|
++++---+---+++
|   2|   2|   2|  2|  2|   1|   1|
|null|null|null|  1|  1|null|   0|
|   3|   3|   3|  3|  3|   2|null|
++++---+---+++
{code}

{code}
== Physical Plan ==
*Project [i#0L AS b_i#146L, i#0L AS c_i#147L, coalesce(i#0L, i#0L) AS i#148L, 
coalesce(j#12L, j#21L) AS j#149L, coalesce(k#2L, k#22L) AS k#150L, p#13L, q#23L]
+- SortMergeJoin [i#0L, j#12L, k#2L], [i#113L, j#21L, k#22L], FullOuter

{code}

As shown in the plan, columns {{b.i}} and {{c.i}} are correctly resolved to 
{{i#0L}} and {{i#113L}} correspondingly in the join condition part. However,
in the projection part, both {{b.i}} and {{c.i}} are resolved to {{i#0L}}.

Complete code to re-produce:

{code}
from pyspark import SparkContext, SQLContext

from pyspark.sql import Row, functions

sc = SparkContext()
sqlContext = SQLContext(sc)

data_a = sc.parallelize([
Row(i=1, j=1, k=1),
Row(i=2, j=2, k=2),
Row(i=3, j=3, k=3),
])
table_a = sqlContext.createDataFrame(data_a)
table_a.show()

data_b = sc.parallelize([
Row(j=2, p=1),
Row(j=3, p=2),
])
table_b = sqlContext.createDataFrame(data_b)
table_b.show()

data_c = sc.parallelize([
Row(j=1, k=1, q=0),
Row(j=2, k=2, q=1),
])
table_c = sqlContext.createDataFrame(data_c)
table_c.show()

b = table_b.join(table_a, table_b.j == table_a.j).drop(table_a.j)

c = table_c.join(table_a, (table_c.j == table_a.j)
  & (table_c.k == table_a.k)) \
.drop(table_a.j) \
.drop(table_a.k)


b.show()
c.show()
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16869) Wrong projection when join on columns with the same name which are derived from the same dataframe

2016-08-03 Thread GUAN Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15405669#comment-15405669
 ] 

GUAN Hao commented on SPARK-16869:
--

Oh, {{ i = colaesce(b.i, c.i) }} is actually just a comment.

Please try the code at the bottom of the original post.

> Wrong projection when join on columns with the same name which are derived 
> from the same dataframe
> --
>
> Key: SPARK-16869
> URL: https://issues.apache.org/jira/browse/SPARK-16869
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: GUAN Hao
>
> I have to DataFrames, both contain a column named *i* which are derived from 
> a same DataFrame (join).
> {code}
> b
> +---+---+---+---+
> |  j|  p|  i|  k|
> +---+---+---+---+
> |  3|  2|  3|  3|
> |  2|  1|  2|  2|
> +---+---+---+---+
> c
> +---+---+---+---+
> |  j|  k|  q|  i|
> +---+---+---+---+
> |  1|  1|  0|  1|
> |  2|  2|  1|  2|
> +---+---+---+---+
> {code}
> The result of OUTER join of two DataFrames above is:
> {code}
> i = colaesce(b.i, c.i)
> ++++---+---+++
> | b_i| c_i|   i|  j|  k|   p|   q|
> ++++---+---+++
> |   2|   2|   2|  2|  2|   1|   1|
> |null|   1|   1|  1|  1|null|   0|
> |   3|null|   3|  3|  3|   2|null|
> ++++---+---+++
> {code}
> However, what I got is:
> {code}
> ++++---+---+++
> | b_i| c_i|   i|  j|  k|   p|   q|
> ++++---+---+++
> |   2|   2|   2|  2|  2|   1|   1|
> |null|null|null|  1|  1|null|   0|
> |   3|   3|   3|  3|  3|   2|null|
> ++++---+---+++
> {code}
> {code}
> == Physical Plan ==
> *Project [i#0L AS b_i#146L, i#0L AS c_i#147L, coalesce(i#0L, i#0L) AS i#148L, 
> coalesce(j#12L, j#21L) AS j#149L, coalesce(k#2L, k#22L) AS k#150L, p#13L, 
> q#23L]
> +- SortMergeJoin [i#0L, j#12L, k#2L], [i#113L, j#21L, k#22L], FullOuter
> 
> {code}
> As shown in the plan, columns {{b.i}} and {{c.i}} are correctly resolved to 
> {{i#0L}} and {{i#113L}} correspondingly in the join condition part. However,
> in the projection part, both {{b.i}} and {{c.i}} are resolved to {{i#0L}}.
> Complete code to re-produce:
> {code}
> from pyspark import SparkContext, SQLContext
> from pyspark.sql import Row, functions
> sc = SparkContext()
> sqlContext = SQLContext(sc)
> data_a = sc.parallelize([
> Row(i=1, j=1, k=1),
> Row(i=2, j=2, k=2),
> Row(i=3, j=3, k=3),
> ])
> table_a = sqlContext.createDataFrame(data_a)
> table_a.show()
> data_b = sc.parallelize([
> Row(j=2, p=1),
> Row(j=3, p=2),
> ])
> table_b = sqlContext.createDataFrame(data_b)
> table_b.show()
> data_c = sc.parallelize([
> Row(j=1, k=1, q=0),
> Row(j=2, k=2, q=1),
> ])
> table_c = sqlContext.createDataFrame(data_c)
> table_c.show()
> b = table_b.join(table_a, table_b.j == table_a.j).drop(table_a.j)
> c = table_c.join(table_a, (table_c.j == table_a.j)
>   & (table_c.k == table_a.k)) \
> .drop(table_a.j) \
> .drop(table_a.k)
> b.show()
> c.show()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16869) Wrong projection when join on columns with the same name which are derived from the same dataframe

2016-08-03 Thread GUAN Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15405669#comment-15405669
 ] 

GUAN Hao edited comment on SPARK-16869 at 8/3/16 10:14 AM:
---

Oh, {{i = colaesce(b.i, c.i)}} is actually just a comment.

Please try the code at the bottom of the original post.


was (Author: raptium):
Oh, {{ i = colaesce(b.i, c.i) }} is actually just a comment.

Please try the code at the bottom of the original post.

> Wrong projection when join on columns with the same name which are derived 
> from the same dataframe
> --
>
> Key: SPARK-16869
> URL: https://issues.apache.org/jira/browse/SPARK-16869
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: GUAN Hao
>
> I have to DataFrames, both contain a column named *i* which are derived from 
> a same DataFrame (join).
> {code}
> b
> +---+---+---+---+
> |  j|  p|  i|  k|
> +---+---+---+---+
> |  3|  2|  3|  3|
> |  2|  1|  2|  2|
> +---+---+---+---+
> c
> +---+---+---+---+
> |  j|  k|  q|  i|
> +---+---+---+---+
> |  1|  1|  0|  1|
> |  2|  2|  1|  2|
> +---+---+---+---+
> {code}
> The result of OUTER join of two DataFrames above is:
> {code}
> i = colaesce(b.i, c.i)
> ++++---+---+++
> | b_i| c_i|   i|  j|  k|   p|   q|
> ++++---+---+++
> |   2|   2|   2|  2|  2|   1|   1|
> |null|   1|   1|  1|  1|null|   0|
> |   3|null|   3|  3|  3|   2|null|
> ++++---+---+++
> {code}
> However, what I got is:
> {code}
> ++++---+---+++
> | b_i| c_i|   i|  j|  k|   p|   q|
> ++++---+---+++
> |   2|   2|   2|  2|  2|   1|   1|
> |null|null|null|  1|  1|null|   0|
> |   3|   3|   3|  3|  3|   2|null|
> ++++---+---+++
> {code}
> {code}
> == Physical Plan ==
> *Project [i#0L AS b_i#146L, i#0L AS c_i#147L, coalesce(i#0L, i#0L) AS i#148L, 
> coalesce(j#12L, j#21L) AS j#149L, coalesce(k#2L, k#22L) AS k#150L, p#13L, 
> q#23L]
> +- SortMergeJoin [i#0L, j#12L, k#2L], [i#113L, j#21L, k#22L], FullOuter
> 
> {code}
> As shown in the plan, columns {{b.i}} and {{c.i}} are correctly resolved to 
> {{i#0L}} and {{i#113L}} correspondingly in the join condition part. However,
> in the projection part, both {{b.i}} and {{c.i}} are resolved to {{i#0L}}.
> Complete code to re-produce:
> {code}
> from pyspark import SparkContext, SQLContext
> from pyspark.sql import Row, functions
> sc = SparkContext()
> sqlContext = SQLContext(sc)
> data_a = sc.parallelize([
> Row(i=1, j=1, k=1),
> Row(i=2, j=2, k=2),
> Row(i=3, j=3, k=3),
> ])
> table_a = sqlContext.createDataFrame(data_a)
> table_a.show()
> data_b = sc.parallelize([
> Row(j=2, p=1),
> Row(j=3, p=2),
> ])
> table_b = sqlContext.createDataFrame(data_b)
> table_b.show()
> data_c = sc.parallelize([
> Row(j=1, k=1, q=0),
> Row(j=2, k=2, q=1),
> ])
> table_c = sqlContext.createDataFrame(data_c)
> table_c.show()
> b = table_b.join(table_a, table_b.j == table_a.j).drop(table_a.j)
> c = table_c.join(table_a, (table_c.j == table_a.j)
>   & (table_c.k == table_a.k)) \
> .drop(table_a.j) \
> .drop(table_a.k)
> b.show()
> c.show()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16869) Wrong projection when join on columns with the same name which are derived from the same dataframe

2016-08-03 Thread GUAN Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15405669#comment-15405669
 ] 

GUAN Hao edited comment on SPARK-16869 at 8/3/16 10:16 AM:
---

{{i = colaesce(b.i, c.i)}} is actually just a comment.


Oh sorry, some line are lost after paste.
{code}
result = b.join(c, (b.i == c.i)
  & (b.j == c.j)
  & (b.k == c.k), 'outer') \
.select(
b.i.alias('b_i'),
c.i.alias('c_i'),
functions.coalesce(b.i, c.i).alias('i'),
functions.coalesce(b.j, c.j).alias('j'),
functions.coalesce(b.k, c.k).alias('k'),
b.p,
c.q,
)

result.explain()
result.show()
{code}


was (Author: raptium):
Oh, {{i = colaesce(b.i, c.i)}} is actually just a comment.

Please try the code at the bottom of the original post.

> Wrong projection when join on columns with the same name which are derived 
> from the same dataframe
> --
>
> Key: SPARK-16869
> URL: https://issues.apache.org/jira/browse/SPARK-16869
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: GUAN Hao
>
> I have to DataFrames, both contain a column named *i* which are derived from 
> a same DataFrame (join).
> {code}
> b
> +---+---+---+---+
> |  j|  p|  i|  k|
> +---+---+---+---+
> |  3|  2|  3|  3|
> |  2|  1|  2|  2|
> +---+---+---+---+
> c
> +---+---+---+---+
> |  j|  k|  q|  i|
> +---+---+---+---+
> |  1|  1|  0|  1|
> |  2|  2|  1|  2|
> +---+---+---+---+
> {code}
> The result of OUTER join of two DataFrames above is:
> {code}
> i = colaesce(b.i, c.i)
> ++++---+---+++
> | b_i| c_i|   i|  j|  k|   p|   q|
> ++++---+---+++
> |   2|   2|   2|  2|  2|   1|   1|
> |null|   1|   1|  1|  1|null|   0|
> |   3|null|   3|  3|  3|   2|null|
> ++++---+---+++
> {code}
> However, what I got is:
> {code}
> ++++---+---+++
> | b_i| c_i|   i|  j|  k|   p|   q|
> ++++---+---+++
> |   2|   2|   2|  2|  2|   1|   1|
> |null|null|null|  1|  1|null|   0|
> |   3|   3|   3|  3|  3|   2|null|
> ++++---+---+++
> {code}
> {code}
> == Physical Plan ==
> *Project [i#0L AS b_i#146L, i#0L AS c_i#147L, coalesce(i#0L, i#0L) AS i#148L, 
> coalesce(j#12L, j#21L) AS j#149L, coalesce(k#2L, k#22L) AS k#150L, p#13L, 
> q#23L]
> +- SortMergeJoin [i#0L, j#12L, k#2L], [i#113L, j#21L, k#22L], FullOuter
> 
> {code}
> As shown in the plan, columns {{b.i}} and {{c.i}} are correctly resolved to 
> {{i#0L}} and {{i#113L}} correspondingly in the join condition part. However,
> in the projection part, both {{b.i}} and {{c.i}} are resolved to {{i#0L}}.
> Complete code to re-produce:
> {code}
> from pyspark import SparkContext, SQLContext
> from pyspark.sql import Row, functions
> sc = SparkContext()
> sqlContext = SQLContext(sc)
> data_a = sc.parallelize([
> Row(i=1, j=1, k=1),
> Row(i=2, j=2, k=2),
> Row(i=3, j=3, k=3),
> ])
> table_a = sqlContext.createDataFrame(data_a)
> table_a.show()
> data_b = sc.parallelize([
> Row(j=2, p=1),
> Row(j=3, p=2),
> ])
> table_b = sqlContext.createDataFrame(data_b)
> table_b.show()
> data_c = sc.parallelize([
> Row(j=1, k=1, q=0),
> Row(j=2, k=2, q=1),
> ])
> table_c = sqlContext.createDataFrame(data_c)
> table_c.show()
> b = table_b.join(table_a, table_b.j == table_a.j).drop(table_a.j)
> c = table_c.join(table_a, (table_c.j == table_a.j)
>   & (table_c.k == table_a.k)) \
> .drop(table_a.j) \
> .drop(table_a.k)
> b.show()
> c.show()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16869) Wrong projection when join on columns with the same name which are derived from the same dataframe

2016-08-03 Thread GUAN Hao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

GUAN Hao updated SPARK-16869:
-
Description: 
I have to DataFrames, both contain a column named *i* which are derived from a 
same DataFrame (join).

{code}
b
+---+---+---+---+
|  j|  p|  i|  k|
+---+---+---+---+
|  3|  2|  3|  3|
|  2|  1|  2|  2|
+---+---+---+---+

c
+---+---+---+---+
|  j|  k|  q|  i|
+---+---+---+---+
|  1|  1|  0|  1|
|  2|  2|  1|  2|
+---+---+---+---+
{code}

The result of OUTER join of two DataFrames above is:

{code}
i = colaesce(b.i, c.i)
++++---+---+++
| b_i| c_i|   i|  j|  k|   p|   q|
++++---+---+++
|   2|   2|   2|  2|  2|   1|   1|
|null|   1|   1|  1|  1|null|   0|
|   3|null|   3|  3|  3|   2|null|
++++---+---+++
{code}

However, what I got is:

{code}
++++---+---+++
| b_i| c_i|   i|  j|  k|   p|   q|
++++---+---+++
|   2|   2|   2|  2|  2|   1|   1|
|null|null|null|  1|  1|null|   0|
|   3|   3|   3|  3|  3|   2|null|
++++---+---+++
{code}

{code}
== Physical Plan ==
*Project [i#0L AS b_i#146L, i#0L AS c_i#147L, coalesce(i#0L, i#0L) AS i#148L, 
coalesce(j#12L, j#21L) AS j#149L, coalesce(k#2L, k#22L) AS k#150L, p#13L, q#23L]
+- SortMergeJoin [i#0L, j#12L, k#2L], [i#113L, j#21L, k#22L], FullOuter

{code}

As shown in the plan, columns {{b.i}} and {{c.i}} are correctly resolved to 
{{i#0L}} and {{i#113L}} correspondingly in the join condition part. However,
in the projection part, both {{b.i}} and {{c.i}} are resolved to {{i#0L}}.

Complete code to re-produce:

{code}
from pyspark import SparkContext, SQLContext

from pyspark.sql import Row, functions

sc = SparkContext()
sqlContext = SQLContext(sc)

data_a = sc.parallelize([
Row(i=1, j=1, k=1),
Row(i=2, j=2, k=2),
Row(i=3, j=3, k=3),
])
table_a = sqlContext.createDataFrame(data_a)
table_a.show()

data_b = sc.parallelize([
Row(j=2, p=1),
Row(j=3, p=2),
])
table_b = sqlContext.createDataFrame(data_b)
table_b.show()

data_c = sc.parallelize([
Row(j=1, k=1, q=0),
Row(j=2, k=2, q=1),
])
table_c = sqlContext.createDataFrame(data_c)
table_c.show()

b = table_b.join(table_a, table_b.j == table_a.j).drop(table_a.j)

c = table_c.join(table_a, (table_c.j == table_a.j)
  & (table_c.k == table_a.k)) \
.drop(table_a.j) \
.drop(table_a.k)


b.show()
c.show()

result = b.join(c, (b.i == c.i)
  & (b.j == c.j)
  & (b.k == c.k), 'outer') \
.select(
b.i.alias('b_i'),
c.i.alias('c_i'),
functions.coalesce(b.i, c.i).alias('i'),
functions.coalesce(b.j, c.j).alias('j'),
functions.coalesce(b.k, c.k).alias('k'),
b.p,
c.q,
)

result.explain()
result.show()
{code}


  was:
I have to DataFrames, both contain a column named *i* which are derived from a 
same DataFrame (join).

{code}
b
+---+---+---+---+
|  j|  p|  i|  k|
+---+---+---+---+
|  3|  2|  3|  3|
|  2|  1|  2|  2|
+---+---+---+---+

c
+---+---+---+---+
|  j|  k|  q|  i|
+---+---+---+---+
|  1|  1|  0|  1|
|  2|  2|  1|  2|
+---+---+---+---+
{code}

The result of OUTER join of two DataFrames above is:

{code}
i = colaesce(b.i, c.i)
++++---+---+++
| b_i| c_i|   i|  j|  k|   p|   q|
++++---+---+++
|   2|   2|   2|  2|  2|   1|   1|
|null|   1|   1|  1|  1|null|   0|
|   3|null|   3|  3|  3|   2|null|
++++---+---+++
{code}

However, what I got is:

{code}
++++---+---+++
| b_i| c_i|   i|  j|  k|   p|   q|
++++---+---+++
|   2|   2|   2|  2|  2|   1|   1|
|null|null|null|  1|  1|null|   0|
|   3|   3|   3|  3|  3|   2|null|
++++---+---+++
{code}

{code}
== Physical Plan ==
*Project [i#0L AS b_i#146L, i#0L AS c_i#147L, coalesce(i#0L, i#0L) AS i#148L, 
coalesce(j#12L, j#21L) AS j#149L, coalesce(k#2L, k#22L) AS k#150L, p#13L, q#23L]
+- SortMergeJoin [i#0L, j#12L, k#2L], [i#113L, j#21L, k#22L], FullOuter

{code}

As shown in the plan, columns {{b.i}} and {{c.i}} are correctly resolved to 
{{i#0L}} and {{i#113L}} correspondingly in the join condition part. However,
in the projection part, both {{b.i}} and {{c.i}} are resolved to {{i#0L}}.

Complete code to re-produce:

{code}
from pyspark import SparkContext, SQLContext

from pyspark.sql import Row, functions

sc = SparkContext()
sqlContext = SQLContext(sc)

data_a = sc.parallelize([
Row(i=1, j=1, k=1),
Row(i=2, j=2, k=2),
Row(i=3, j=3, k=3),
])
table_a = sqlContext.createDataFrame(data_a)
table_a.show()

data_b = sc.parallelize([
Row(j=2, p=1),
Row(j=3, p=2),
])
table_b = sqlContext.createDataFrame(data_b)
table_b.show()

data_c = sc.parallelize([
Row(j=1, k=1, q=0),
Row(j=2, k=2, q=1),
])
table_c = sqlContext.createDataFrame(data_c)
table_c.show()

b = table_b.join(table_a, table_b.j == 

[jira] [Comment Edited] (SPARK-16869) Wrong projection when join on columns with the same name which are derived from the same dataframe

2016-08-03 Thread GUAN Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15405669#comment-15405669
 ] 

GUAN Hao edited comment on SPARK-16869 at 8/3/16 10:17 AM:
---

{{i = colaesce(b.i, c.i)}} is actually just a comment.


Oh sorry, some line are lost after paste. I've updated the original post.


was (Author: raptium):
{{i = colaesce(b.i, c.i)}} is actually just a comment.


Oh sorry, some line are lost after paste.
{code}
result = b.join(c, (b.i == c.i)
  & (b.j == c.j)
  & (b.k == c.k), 'outer') \
.select(
b.i.alias('b_i'),
c.i.alias('c_i'),
functions.coalesce(b.i, c.i).alias('i'),
functions.coalesce(b.j, c.j).alias('j'),
functions.coalesce(b.k, c.k).alias('k'),
b.p,
c.q,
)

result.explain()
result.show()
{code}

> Wrong projection when join on columns with the same name which are derived 
> from the same dataframe
> --
>
> Key: SPARK-16869
> URL: https://issues.apache.org/jira/browse/SPARK-16869
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: GUAN Hao
>
> I have to DataFrames, both contain a column named *i* which are derived from 
> a same DataFrame (join).
> {code}
> b
> +---+---+---+---+
> |  j|  p|  i|  k|
> +---+---+---+---+
> |  3|  2|  3|  3|
> |  2|  1|  2|  2|
> +---+---+---+---+
> c
> +---+---+---+---+
> |  j|  k|  q|  i|
> +---+---+---+---+
> |  1|  1|  0|  1|
> |  2|  2|  1|  2|
> +---+---+---+---+
> {code}
> The result of OUTER join of two DataFrames above is:
> {code}
> i = colaesce(b.i, c.i)
> ++++---+---+++
> | b_i| c_i|   i|  j|  k|   p|   q|
> ++++---+---+++
> |   2|   2|   2|  2|  2|   1|   1|
> |null|   1|   1|  1|  1|null|   0|
> |   3|null|   3|  3|  3|   2|null|
> ++++---+---+++
> {code}
> However, what I got is:
> {code}
> ++++---+---+++
> | b_i| c_i|   i|  j|  k|   p|   q|
> ++++---+---+++
> |   2|   2|   2|  2|  2|   1|   1|
> |null|null|null|  1|  1|null|   0|
> |   3|   3|   3|  3|  3|   2|null|
> ++++---+---+++
> {code}
> {code}
> == Physical Plan ==
> *Project [i#0L AS b_i#146L, i#0L AS c_i#147L, coalesce(i#0L, i#0L) AS i#148L, 
> coalesce(j#12L, j#21L) AS j#149L, coalesce(k#2L, k#22L) AS k#150L, p#13L, 
> q#23L]
> +- SortMergeJoin [i#0L, j#12L, k#2L], [i#113L, j#21L, k#22L], FullOuter
> 
> {code}
> As shown in the plan, columns {{b.i}} and {{c.i}} are correctly resolved to 
> {{i#0L}} and {{i#113L}} correspondingly in the join condition part. However,
> in the projection part, both {{b.i}} and {{c.i}} are resolved to {{i#0L}}.
> Complete code to re-produce:
> {code}
> from pyspark import SparkContext, SQLContext
> from pyspark.sql import Row, functions
> sc = SparkContext()
> sqlContext = SQLContext(sc)
> data_a = sc.parallelize([
> Row(i=1, j=1, k=1),
> Row(i=2, j=2, k=2),
> Row(i=3, j=3, k=3),
> ])
> table_a = sqlContext.createDataFrame(data_a)
> table_a.show()
> data_b = sc.parallelize([
> Row(j=2, p=1),
> Row(j=3, p=2),
> ])
> table_b = sqlContext.createDataFrame(data_b)
> table_b.show()
> data_c = sc.parallelize([
> Row(j=1, k=1, q=0),
> Row(j=2, k=2, q=1),
> ])
> table_c = sqlContext.createDataFrame(data_c)
> table_c.show()
> b = table_b.join(table_a, table_b.j == table_a.j).drop(table_a.j)
> c = table_c.join(table_a, (table_c.j == table_a.j)
>   & (table_c.k == table_a.k)) \
> .drop(table_a.j) \
> .drop(table_a.k)
> b.show()
> c.show()
> result = b.join(c, (b.i == c.i)
>   & (b.j == c.j)
>   & (b.k == c.k), 'outer') \
> .select(
> b.i.alias('b_i'),
> c.i.alias('c_i'),
> functions.coalesce(b.i, c.i).alias('i'),
> functions.coalesce(b.j, c.j).alias('j'),
> functions.coalesce(b.k, c.k).alias('k'),
> b.p,
> c.q,
> )
> result.explain()
> result.show()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16869) Wrong projection when join on columns with the same name which are derived from the same dataframe

2016-08-03 Thread GUAN Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15405686#comment-15405686
 ] 

GUAN Hao commented on SPARK-16869:
--

I've update the outer joining part (which was missing) to the code snippet in 
the original post.

{code}
result = b.join(c, (b.i == c.i)
  & (b.j == c.j)
  & (b.k == c.k), 'outer') \
.select(
b.i.alias('b_i'),
c.i.alias('c_i'),
functions.coalesce(b.i, c.i).alias('i'),
functions.coalesce(b.j, c.j).alias('j'),
functions.coalesce(b.k, c.k).alias('k'),
b.p,
c.q,
)

result.explain()
result.show()
{code}

> Wrong projection when join on columns with the same name which are derived 
> from the same dataframe
> --
>
> Key: SPARK-16869
> URL: https://issues.apache.org/jira/browse/SPARK-16869
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: GUAN Hao
>
> I have to DataFrames, both contain a column named *i* which are derived from 
> a same DataFrame (join).
> {code}
> b
> +---+---+---+---+
> |  j|  p|  i|  k|
> +---+---+---+---+
> |  3|  2|  3|  3|
> |  2|  1|  2|  2|
> +---+---+---+---+
> c
> +---+---+---+---+
> |  j|  k|  q|  i|
> +---+---+---+---+
> |  1|  1|  0|  1|
> |  2|  2|  1|  2|
> +---+---+---+---+
> {code}
> The result of OUTER join of two DataFrames above is:
> {code}
> i = colaesce(b.i, c.i)
> ++++---+---+++
> | b_i| c_i|   i|  j|  k|   p|   q|
> ++++---+---+++
> |   2|   2|   2|  2|  2|   1|   1|
> |null|   1|   1|  1|  1|null|   0|
> |   3|null|   3|  3|  3|   2|null|
> ++++---+---+++
> {code}
> However, what I got is:
> {code}
> ++++---+---+++
> | b_i| c_i|   i|  j|  k|   p|   q|
> ++++---+---+++
> |   2|   2|   2|  2|  2|   1|   1|
> |null|null|null|  1|  1|null|   0|
> |   3|   3|   3|  3|  3|   2|null|
> ++++---+---+++
> {code}
> {code}
> == Physical Plan ==
> *Project [i#0L AS b_i#146L, i#0L AS c_i#147L, coalesce(i#0L, i#0L) AS i#148L, 
> coalesce(j#12L, j#21L) AS j#149L, coalesce(k#2L, k#22L) AS k#150L, p#13L, 
> q#23L]
> +- SortMergeJoin [i#0L, j#12L, k#2L], [i#113L, j#21L, k#22L], FullOuter
> 
> {code}
> As shown in the plan, columns {{b.i}} and {{c.i}} are correctly resolved to 
> {{i#0L}} and {{i#113L}} correspondingly in the join condition part. However,
> in the projection part, both {{b.i}} and {{c.i}} are resolved to {{i#0L}}.
> Complete code to re-produce:
> {code}
> from pyspark import SparkContext, SQLContext
> from pyspark.sql import Row, functions
> sc = SparkContext()
> sqlContext = SQLContext(sc)
> data_a = sc.parallelize([
> Row(i=1, j=1, k=1),
> Row(i=2, j=2, k=2),
> Row(i=3, j=3, k=3),
> ])
> table_a = sqlContext.createDataFrame(data_a)
> table_a.show()
> data_b = sc.parallelize([
> Row(j=2, p=1),
> Row(j=3, p=2),
> ])
> table_b = sqlContext.createDataFrame(data_b)
> table_b.show()
> data_c = sc.parallelize([
> Row(j=1, k=1, q=0),
> Row(j=2, k=2, q=1),
> ])
> table_c = sqlContext.createDataFrame(data_c)
> table_c.show()
> b = table_b.join(table_a, table_b.j == table_a.j).drop(table_a.j)
> c = table_c.join(table_a, (table_c.j == table_a.j)
>   & (table_c.k == table_a.k)) \
> .drop(table_a.j) \
> .drop(table_a.k)
> b.show()
> c.show()
> result = b.join(c, (b.i == c.i)
>   & (b.j == c.j)
>   & (b.k == c.k), 'outer') \
> .select(
> b.i.alias('b_i'),
> c.i.alias('c_i'),
> functions.coalesce(b.i, c.i).alias('i'),
> functions.coalesce(b.j, c.j).alias('j'),
> functions.coalesce(b.k, c.k).alias('k'),
> b.p,
> c.q,
> )
> result.explain()
> result.show()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17299) TRIM/LTRIM/RTRIM strips characters other than spaces

2016-08-31 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15451810#comment-15451810
 ] 

Cheng Hao commented on SPARK-17299:
---

Yes, that's my bad, I thought it should be the same behavior of 
`String.trim()`. We should fix this bug. [~jbeard], can you please fix it?

> TRIM/LTRIM/RTRIM strips characters other than spaces
> 
>
> Key: SPARK-17299
> URL: https://issues.apache.org/jira/browse/SPARK-17299
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, SQL
>Affects Versions: 2.0.0
>Reporter: Jeremy Beard
>Priority: Minor
>
> TRIM/LTRIM/RTRIM docs state that they only strip spaces:
> http://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/functions.html#trim(org.apache.spark.sql.Column)
> But the implementation strips all characters of ASCII value 20 or less:
> https://github.com/apache/spark/blob/v2.0.0/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L468-L470



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17299) TRIM/LTRIM/RTRIM strips characters other than spaces

2016-08-31 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15451914#comment-15451914
 ] 

Cheng Hao commented on SPARK-17299:
---

Or come after SPARK-14878 ?

> TRIM/LTRIM/RTRIM strips characters other than spaces
> 
>
> Key: SPARK-17299
> URL: https://issues.apache.org/jira/browse/SPARK-17299
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, SQL
>Affects Versions: 2.0.0
>Reporter: Jeremy Beard
>Priority: Minor
>
> TRIM/LTRIM/RTRIM docs state that they only strip spaces:
> http://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/functions.html#trim(org.apache.spark.sql.Column)
> But the implementation strips all characters of ASCII value 20 or less:
> https://github.com/apache/spark/blob/v2.0.0/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L468-L470



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13326) Dataset in spark 2.0.0-SNAPSHOT missing columns

2016-03-15 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15195022#comment-15195022
 ] 

Cheng Hao commented on SPARK-13326:
---

Can not reproduce it anymore, can you try it again?

> Dataset in spark 2.0.0-SNAPSHOT missing columns
> ---
>
> Key: SPARK-13326
> URL: https://issues.apache.org/jira/browse/SPARK-13326
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: koert kuipers
>Priority: Minor
>
> i noticed some things stopped working on datasets in spark 2.0.0-SNAPSHOT, 
> and with a confusing error message (cannot resolved some column with input 
> columns []).
> for example in 1.6.0-SNAPSHOT:
> {noformat}
> scala> val ds = sc.parallelize(1 to 10).toDS
> ds: org.apache.spark.sql.Dataset[Int] = [value: int]
> scala> ds.map(x => Option(x))
> res0: org.apache.spark.sql.Dataset[Option[Int]] = [value: int]
> {noformat}
> and same commands in 2.0.0-SNAPSHOT:
> {noformat}
> scala> val ds = sc.parallelize(1 to 10).toDS
> ds: org.apache.spark.sql.Dataset[Int] = [value: int]
> scala> ds.map(x => Option(x))
> org.apache.spark.sql.AnalysisException: cannot resolve 'value' given input 
> columns: [];
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:60)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:283)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:162)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:172)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:176)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:176)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:181)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:742)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
>   at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
>   at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
>   at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:308)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1194)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:300)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1194)
>   at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:287)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1194)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:181)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:57)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:50)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:122)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:121)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:121)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:121)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:50)
>   at 
> org.apache.spark

[jira] [Commented] (SPARK-13894) SQLContext.range should return Dataset[Long]

2016-03-15 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15195274#comment-15195274
 ] 

Cheng Hao commented on SPARK-13894:
---

The existing functions "SQLContext.range()" returns the underlying schema with 
name "id", it will be lots of unit test code requires to be updated if we 
changed the column name to "value". How about keep the name as "id" unchanged?

> SQLContext.range should return Dataset[Long]
> 
>
> Key: SPARK-13894
> URL: https://issues.apache.org/jira/browse/SPARK-13894
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Blocker
>
> Rather than returning DataFrame, it should return a Dataset[Long]. The 
> documentation should still make it clear that the underlying schema consists 
> of a single long column named "value".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10466) UnsafeRow exception in Sort-Based Shuffle with data spill

2015-09-06 Thread Cheng Hao (JIRA)
Cheng Hao created SPARK-10466:
-

 Summary: UnsafeRow exception in Sort-Based Shuffle with data spill 
 Key: SPARK-10466
 URL: https://issues.apache.org/jira/browse/SPARK-10466
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao
Priority: Blocker


In sort-based shuffle, if we have data spill, it will cause assert exception, 
the follow code can reproduce that
{code}
withSparkConf(("spark.shuffle.sort.bypassMergeThreshold", "2")) {
  withSQLConf(("spark.sql.autoBroadcastJoinThreshold", "0")) {
withTempTable("mytemp") {
  sparkContext.parallelize(1 to 1000, 3).map(i => (i, 
i)).toDF("key", "value").registerTempTable("mytemp")
  sql("select key, value as v1 from mytemp where key > 
1").registerTempTable("l")
  sql("select key, value as v2 from mytemp where key > 
3").registerTempTable("r")

  val df3 = sql("select v1, v2 from l left join r on l.key=r.key")
  df3.count()
}
  }
}
{code}
{code}
java.lang.AssertionError: assertion failed
at scala.Predef$.assert(Predef.scala:165)
at 
org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2.writeKey(UnsafeRowSerializer.scala:75)
at 
org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:180)
at 
org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:688)
at 
org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:687)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:687)
at 
org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:683)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:683)
at 
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:80)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)
17:32:06.172 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 1.0 in 
stage 1.0 (TID 4, localhost): java.lang.AssertionError: assertion failed
at scala.Predef$.assert(Predef.scala:165)
at 
org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2.writeKey(UnsafeRowSerializer.scala:75)
at 
org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:180)
at 
org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:688)
at 
org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:687)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:687)
at 
org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:683)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:683)
at 
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:80)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)
{code}

[jira] [Commented] (SPARK-10484) [Spark SQL] Come across lost task(timeout) or GC OOM error when two tables do cross join

2015-09-08 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14734395#comment-14734395
 ] 

Cheng Hao commented on SPARK-10484:
---

In cartesian produce implementation, there is 2 level nested loops, and 
exchanging the order of the join tables, will reduce the outer loop times(with 
smaller table), but increase the looping times of the inner loop(bigger table), 
this is actually a manually optimization for the sql query.

I created a PR for optimizing the cartesian join by involving the broadcast 
join.

> [Spark SQL]  Come across lost task(timeout) or GC OOM error when two tables 
> do cross join
> -
>
> Key: SPARK-10484
> URL: https://issues.apache.org/jira/browse/SPARK-10484
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Priority: Critical
>
> Found that it lost task or GC OOM when below cross join happen. The left big 
> table is ~1.2G in size and  the right small table is ~2.2K.
> Key SQL
> {code:sql}
> SELECT
>   CONCAT(s_store_sk,"_", s_store_name ) AS store_ID, 
>   pr_review_date,
>   pr_review_content
> FROM product_reviews pr, temp_stores_with_regression stores_with_regression
> WHERE locate(lower(stores_with_regression.s_store_name), 
> lower(pr.pr_review_content), 1) >= 1 ;
> {code}
> Physical Plan
> {code:sql}
> TungstenProject [concat(cast(s_store_sk#456L as string),_,s_store_name#457) 
> AS store_ID#446,pr_review_date#449,pr_review_content#455]
> Filter (locate(lower(s_store_name#457),lower(pr_review_content#455),1) >= 1)
>   CartesianProduct
>HiveTableScan [pr_review_date#449,pr_review_content#455], 
> (MetastoreRelation bigbench, product_reviews, Some(pr))
>HiveTableScan [s_store_sk#456L,s_store_name#457], (MetastoreRelation 
> bigbench, temp_stores_with_regression, Some(stores_with_regression))
> Code Generation: true
> {code}
> We also found a strange behavior that exchanging the two table in 'From' 
> clause can pass.
> Key SQL
> {code:sql}
> SELECT
>   CONCAT(s_store_sk,"_", s_store_name ) AS store_ID, 
>   pr_review_date,
>   pr_review_content
> FROM temp_stores_with_regression stores_with_regression, product_reviews pr
> WHERE locate(lower(stores_with_regression.s_store_name), 
> lower(pr.pr_review_content), 1) >= 1 ;
> {code}
> Physical Plan
> {code:sql}
> TungstenProject [concat(cast(s_store_sk#448L as string),_,s_store_name#449) 
> AS store_ID#446,pr_review_date#451,pr_review_content#457]
> Filter (locate(lower(s_store_name#449),lower(pr_review_content#457),1) >= 1)
>   CartesianProduct
>HiveTableScan [s_store_sk#448L,s_store_name#449], (MetastoreRelation 
> bigbench, temp_stores_with_regression, Some(stores_with_regression))
>HiveTableScan [pr_review_date#451,pr_review_content#457], 
> (MetastoreRelation bigbench, product_reviews, Some(pr))
> Code Generation: true
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10466) UnsafeRow exception in Sort-Based Shuffle with data spill

2015-09-08 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14736016#comment-14736016
 ] 

Cheng Hao commented on SPARK-10466:
---

Sorry, [~davies], I found the spark conf doens't take effect when applying an 
existed SparkContext instance, hence it passed the unit test. Actually it will 
fail if you only run the test.

Anyway, I've updated the unit test code in the PR, which will create an new 
SparkContext instance with the specified Confs.

> UnsafeRow exception in Sort-Based Shuffle with data spill 
> --
>
> Key: SPARK-10466
> URL: https://issues.apache.org/jira/browse/SPARK-10466
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Cheng Hao
>Priority: Blocker
>
> In sort-based shuffle, if we have data spill, it will cause assert exception, 
> the follow code can reproduce that
> {code}
> withSparkConf(("spark.shuffle.sort.bypassMergeThreshold", "2")) {
>   withSQLConf(("spark.sql.autoBroadcastJoinThreshold", "0")) {
> withTempTable("mytemp") {
>   sparkContext.parallelize(1 to 1000, 3).map(i => (i, 
> i)).toDF("key", "value").registerTempTable("mytemp")
>   sql("select key, value as v1 from mytemp where key > 
> 1").registerTempTable("l")
>   sql("select key, value as v2 from mytemp where key > 
> 3").registerTempTable("r")
>   val df3 = sql("select v1, v2 from l left join r on l.key=r.key")
>   df3.count()
> }
>   }
> }
> {code}
> {code}
> java.lang.AssertionError: assertion failed
>   at scala.Predef$.assert(Predef.scala:165)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2.writeKey(UnsafeRowSerializer.scala:75)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:180)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:688)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:687)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:687)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:683)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:683)
>   at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:80)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>   at java.lang.Thread.run(Thread.java:722)
> 17:32:06.172 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 1.0 in 
> stage 1.0 (TID 4, localhost): java.lang.AssertionError: assertion failed
>   at scala.Predef$.assert(Predef.scala:165)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2.writeKey(UnsafeRowSerializer.scala:75)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:180)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:688)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:687)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:687)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:683)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:683)
>   at 
> org.ap

[jira] [Commented] (SPARK-10466) UnsafeRow exception in Sort-Based Shuffle with data spill

2015-09-15 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14744966#comment-14744966
 ] 

Cheng Hao commented on SPARK-10466:
---

[~naliazheli] It's an irrelevant issue, you'd better to subscribe the spark 
mail list and then ask question in English. 
See(http://spark.apache.org/community.html)

> UnsafeRow exception in Sort-Based Shuffle with data spill 
> --
>
> Key: SPARK-10466
> URL: https://issues.apache.org/jira/browse/SPARK-10466
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Cheng Hao
>Assignee: Cheng Hao
>Priority: Blocker
> Fix For: 1.6.0, 1.5.1
>
>
> In sort-based shuffle, if we have data spill, it will cause assert exception, 
> the follow code can reproduce that
> {code}
> withSparkConf(("spark.shuffle.sort.bypassMergeThreshold", "2")) {
>   withSQLConf(("spark.sql.autoBroadcastJoinThreshold", "0")) {
> withTempTable("mytemp") {
>   sparkContext.parallelize(1 to 1000, 3).map(i => (i, 
> i)).toDF("key", "value").registerTempTable("mytemp")
>   sql("select key, value as v1 from mytemp where key > 
> 1").registerTempTable("l")
>   sql("select key, value as v2 from mytemp where key > 
> 3").registerTempTable("r")
>   val df3 = sql("select v1, v2 from l left join r on l.key=r.key")
>   df3.count()
> }
>   }
> }
> {code}
> {code}
> java.lang.AssertionError: assertion failed
>   at scala.Predef$.assert(Predef.scala:165)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2.writeKey(UnsafeRowSerializer.scala:75)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:180)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:688)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:687)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:687)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:683)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:683)
>   at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:80)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>   at java.lang.Thread.run(Thread.java:722)
> 17:32:06.172 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 1.0 in 
> stage 1.0 (TID 4, localhost): java.lang.AssertionError: assertion failed
>   at scala.Predef$.assert(Predef.scala:165)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2.writeKey(UnsafeRowSerializer.scala:75)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:180)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:688)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:687)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:687)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:683)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:683)
>   at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:80)
>  

[jira] [Issue Comment Deleted] (SPARK-10466) UnsafeRow exception in Sort-Based Shuffle with data spill

2015-09-15 Thread Cheng Hao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Hao updated SPARK-10466:
--
Comment: was deleted

(was: [~naliazheli] It's an irrelevant issue, you'd better to subscribe the 
spark mail list and then ask question in English. 
See(http://spark.apache.org/community.html))

> UnsafeRow exception in Sort-Based Shuffle with data spill 
> --
>
> Key: SPARK-10466
> URL: https://issues.apache.org/jira/browse/SPARK-10466
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Cheng Hao
>Assignee: Cheng Hao
>Priority: Blocker
> Fix For: 1.6.0, 1.5.1
>
>
> In sort-based shuffle, if we have data spill, it will cause assert exception, 
> the follow code can reproduce that
> {code}
> withSparkConf(("spark.shuffle.sort.bypassMergeThreshold", "2")) {
>   withSQLConf(("spark.sql.autoBroadcastJoinThreshold", "0")) {
> withTempTable("mytemp") {
>   sparkContext.parallelize(1 to 1000, 3).map(i => (i, 
> i)).toDF("key", "value").registerTempTable("mytemp")
>   sql("select key, value as v1 from mytemp where key > 
> 1").registerTempTable("l")
>   sql("select key, value as v2 from mytemp where key > 
> 3").registerTempTable("r")
>   val df3 = sql("select v1, v2 from l left join r on l.key=r.key")
>   df3.count()
> }
>   }
> }
> {code}
> {code}
> java.lang.AssertionError: assertion failed
>   at scala.Predef$.assert(Predef.scala:165)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2.writeKey(UnsafeRowSerializer.scala:75)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:180)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:688)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:687)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:687)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:683)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:683)
>   at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:80)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>   at java.lang.Thread.run(Thread.java:722)
> 17:32:06.172 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 1.0 in 
> stage 1.0 (TID 4, localhost): java.lang.AssertionError: assertion failed
>   at scala.Predef$.assert(Predef.scala:165)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2.writeKey(UnsafeRowSerializer.scala:75)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:180)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:688)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:687)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:687)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:683)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:683)
>   at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:80)
>   at 
> org.apache.spa

[jira] [Commented] (SPARK-10466) UnsafeRow exception in Sort-Based Shuffle with data spill

2015-09-15 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14744967#comment-14744967
 ] 

Cheng Hao commented on SPARK-10466:
---

[~naliazheli] It's an irrelevant issue, you'd better to subscribe the spark 
mail list and then ask question in English. 
See(http://spark.apache.org/community.html)

> UnsafeRow exception in Sort-Based Shuffle with data spill 
> --
>
> Key: SPARK-10466
> URL: https://issues.apache.org/jira/browse/SPARK-10466
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Cheng Hao
>Assignee: Cheng Hao
>Priority: Blocker
> Fix For: 1.6.0, 1.5.1
>
>
> In sort-based shuffle, if we have data spill, it will cause assert exception, 
> the follow code can reproduce that
> {code}
> withSparkConf(("spark.shuffle.sort.bypassMergeThreshold", "2")) {
>   withSQLConf(("spark.sql.autoBroadcastJoinThreshold", "0")) {
> withTempTable("mytemp") {
>   sparkContext.parallelize(1 to 1000, 3).map(i => (i, 
> i)).toDF("key", "value").registerTempTable("mytemp")
>   sql("select key, value as v1 from mytemp where key > 
> 1").registerTempTable("l")
>   sql("select key, value as v2 from mytemp where key > 
> 3").registerTempTable("r")
>   val df3 = sql("select v1, v2 from l left join r on l.key=r.key")
>   df3.count()
> }
>   }
> }
> {code}
> {code}
> java.lang.AssertionError: assertion failed
>   at scala.Predef$.assert(Predef.scala:165)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2.writeKey(UnsafeRowSerializer.scala:75)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:180)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:688)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:687)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:687)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:683)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:683)
>   at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:80)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>   at java.lang.Thread.run(Thread.java:722)
> 17:32:06.172 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 1.0 in 
> stage 1.0 (TID 4, localhost): java.lang.AssertionError: assertion failed
>   at scala.Predef$.assert(Predef.scala:165)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2.writeKey(UnsafeRowSerializer.scala:75)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:180)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:688)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:687)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:687)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:683)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:683)
>   at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:80)
>  

[jira] [Commented] (SPARK-10474) Aggregation failed with unable to acquire memory

2015-09-15 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14744969#comment-14744969
 ] 

Cheng Hao commented on SPARK-10474:
---

The root causes for the exception is the executor don't have enough memory for 
external sorting(UnsafeXXXSorter), 
The memory used for the sorting is MAX_JVM_HEAP * spark.shuffle.memoryFraction 
* spark.shuffle.safetyFraction.

So a workaround is to set a bigger memory for jvm, or the spark conf keys 
"spark.shuffle.memoryFraction"(0.2 by default) and 
"spark.shuffle.safetyFraction"(0.8 by default).


> Aggregation failed with unable to acquire memory
> 
>
> Key: SPARK-10474
> URL: https://issues.apache.org/jira/browse/SPARK-10474
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Priority: Blocker
>
> In aggregation case, a  Lost task happened with below error.
> {code}
>  java.io.IOException: Could not acquire 65536 bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220)
> at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:126)
> at 
> org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Key SQL Query
> {code:sql}
> INSERT INTO TABLE test_table
> SELECT
>   ss.ss_customer_sk AS cid,
>   count(CASE WHEN i.i_class_id=1  THEN 1 ELSE NULL END) AS id1,
>   count(CASE WHEN i.i_class_id=3  THEN 1 ELSE NULL END) AS id3,
>   count(CASE WHEN i.i_class_id=5  THEN 1 ELSE NULL END) AS id5,
>   count(CASE WHEN i.i_class_id=7  THEN 1 ELSE NULL END) AS id7,
>   count(CASE WHEN i.i_class_id=9  THEN 1 ELSE NULL END) AS id9,
>   count(CASE WHEN i.i_class_id=11 THEN 1 ELSE NULL END) AS id11,
>   count(CASE WHEN i.i_class_id=13 THEN 1 ELSE NULL END) AS id13,
>   count(CASE WHEN i.i_class_id=15 THEN 1 ELSE NULL END) AS id15,
>   count(CASE WHEN i.i_class_id=2  THEN 1 ELSE NULL END) AS id2,
>   count(CASE WHEN i.i_class_id=4  THEN 1 ELSE NULL END) AS id4,
>   count(CASE WHEN i.i_class_id=6  THEN 1 ELSE NULL END) AS id6,
>   count(CASE WHEN i.i_class_id=8  THEN 1 ELSE NULL END) AS id8,
>   count(CASE WHEN i.i_class_id=10 THEN 1 ELSE NULL END) AS id10,
>   count(CASE WHEN i.i_class_id=14 THEN 1 ELSE NULL END) AS id14,
>   count(CASE WHEN i.i_class_id=16 THEN 1 ELSE NULL END) AS id16
> FROM store_sales ss
> INNER JOIN item i ON ss.ss_item_sk = i.i_item_sk
> WHERE i.i_category IN ('Books')
> AND ss.ss_customer_sk IS NOT NULL
> GROUP BY ss.ss_customer_sk
> HAVING count(ss.ss_item_sk) > 5
> {code}
> Note:
> the s

[jira] [Commented] (SPARK-10474) Aggregation failed with unable to acquire memory

2015-09-15 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14745008#comment-14745008
 ] 

Cheng Hao commented on SPARK-10474:
---

But from the current implementation, we'd better not to throw exception if 
acquired memory(offheap) is not satisfied,  maybe we'd better use fixed memory 
allocations for both data page and the pointer array, what do you think?

> Aggregation failed with unable to acquire memory
> 
>
> Key: SPARK-10474
> URL: https://issues.apache.org/jira/browse/SPARK-10474
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Priority: Blocker
>
> In aggregation case, a  Lost task happened with below error.
> {code}
>  java.io.IOException: Could not acquire 65536 bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220)
> at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:126)
> at 
> org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Key SQL Query
> {code:sql}
> INSERT INTO TABLE test_table
> SELECT
>   ss.ss_customer_sk AS cid,
>   count(CASE WHEN i.i_class_id=1  THEN 1 ELSE NULL END) AS id1,
>   count(CASE WHEN i.i_class_id=3  THEN 1 ELSE NULL END) AS id3,
>   count(CASE WHEN i.i_class_id=5  THEN 1 ELSE NULL END) AS id5,
>   count(CASE WHEN i.i_class_id=7  THEN 1 ELSE NULL END) AS id7,
>   count(CASE WHEN i.i_class_id=9  THEN 1 ELSE NULL END) AS id9,
>   count(CASE WHEN i.i_class_id=11 THEN 1 ELSE NULL END) AS id11,
>   count(CASE WHEN i.i_class_id=13 THEN 1 ELSE NULL END) AS id13,
>   count(CASE WHEN i.i_class_id=15 THEN 1 ELSE NULL END) AS id15,
>   count(CASE WHEN i.i_class_id=2  THEN 1 ELSE NULL END) AS id2,
>   count(CASE WHEN i.i_class_id=4  THEN 1 ELSE NULL END) AS id4,
>   count(CASE WHEN i.i_class_id=6  THEN 1 ELSE NULL END) AS id6,
>   count(CASE WHEN i.i_class_id=8  THEN 1 ELSE NULL END) AS id8,
>   count(CASE WHEN i.i_class_id=10 THEN 1 ELSE NULL END) AS id10,
>   count(CASE WHEN i.i_class_id=14 THEN 1 ELSE NULL END) AS id14,
>   count(CASE WHEN i.i_class_id=16 THEN 1 ELSE NULL END) AS id16
> FROM store_sales ss
> INNER JOIN item i ON ss.ss_item_sk = i.i_item_sk
> WHERE i.i_category IN ('Books')
> AND ss.ss_customer_sk IS NOT NULL
> GROUP BY ss.ss_customer_sk
> HAVING count(ss.ss_item_sk) > 5
> {code}
> Note:
> the store_sales is a big fact table and item is a small dimension table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---

[jira] [Commented] (SPARK-4226) SparkSQL - Add support for subqueries in predicates

2015-09-15 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14745467#comment-14745467
 ] 

Cheng Hao commented on SPARK-4226:
--

[~marmbrus] [~yhuai] After investigating a little bit, I think using anti-join 
is much more efficient than rewriting the NOT IN / NOT EXISTS with left outer 
join followed by null filtering. As the anti-join will return negative once 
it's found the first matched from the second relation, however the left outer 
join will go thru every match pairs and then do filtering.

Besides, for the NOT EXISTS clause, without the anti-join, seems more 
complicated in implementation. For example:
{code}
mysql> select * from d1;
+--+--+
| a| b|
+--+--+
|2 |2 |
|8 |   10 |
+--+--+
2 rows in set (0.00 sec)

mysql> select * from d2;
+--+--+
| a| b|
+--+--+
|1 |1 |
|8 | NULL |
|0 |0 |
+--+--+
3 rows in set (0.00 sec)

mysql> select * from d1 where not exists (select b from d2 where d1.a=d2.a);
+--+--+
| a| b|
+--+--+
|2 |2 |
+--+--+
1 row in set (0.00 sec)

// If we rewrite the above query in left outer join, the filter condition 
cannot simply be the subquery project list.
mysql> select d1.a, d1.b from d1 left join d2 on d1.a=d2.a where d2.b is null;
+--+--+
| a| b|
+--+--+
|8 |   10 |
|2 |2 |
+--+--+
2 rows in set (0.00 sec)
// get difference result with NOT EXISTS.
{code}

If you feel that make sense, I can reopen my PR and do the rebasing.

> SparkSQL - Add support for subqueries in predicates
> ---
>
> Key: SPARK-4226
> URL: https://issues.apache.org/jira/browse/SPARK-4226
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.2.0
> Environment: Spark 1.2 snapshot
>Reporter: Terry Siu
>
> I have a test table defined in Hive as follows:
> {code:sql}
> CREATE TABLE sparkbug (
>   id INT,
>   event STRING
> ) STORED AS PARQUET;
> {code}
> and insert some sample data with ids 1, 2, 3.
> In a Spark shell, I then create a HiveContext and then execute the following 
> HQL to test out subquery predicates:
> {code}
> val hc = HiveContext(hc)
> hc.hql("select customerid from sparkbug where customerid in (select 
> customerid from sparkbug where customerid in (2,3))")
> {code}
> I get the following error:
> {noformat}
> java.lang.RuntimeException: Unsupported language features in query: select 
> customerid from sparkbug where customerid in (select customerid from sparkbug 
> where customerid in (2,3))
> TOK_QUERY
>   TOK_FROM
> TOK_TABREF
>   TOK_TABNAME
> sparkbug
>   TOK_INSERT
> TOK_DESTINATION
>   TOK_DIR
> TOK_TMP_FILE
> TOK_SELECT
>   TOK_SELEXPR
> TOK_TABLE_OR_COL
>   customerid
> TOK_WHERE
>   TOK_SUBQUERY_EXPR
> TOK_SUBQUERY_OP
>   in
> TOK_QUERY
>   TOK_FROM
> TOK_TABREF
>   TOK_TABNAME
> sparkbug
>   TOK_INSERT
> TOK_DESTINATION
>   TOK_DIR
> TOK_TMP_FILE
> TOK_SELECT
>   TOK_SELEXPR
> TOK_TABLE_OR_COL
>   customerid
> TOK_WHERE
>   TOK_FUNCTION
> in
> TOK_TABLE_OR_COL
>   customerid
> 2
> 3
> TOK_TABLE_OR_COL
>   customerid
> scala.NotImplementedError: No parse rules for ASTNode type: 817, text: 
> TOK_SUBQUERY_EXPR :
> TOK_SUBQUERY_EXPR
>   TOK_SUBQUERY_OP
> in
>   TOK_QUERY
> TOK_FROM
>   TOK_TABREF
> TOK_TABNAME
>   sparkbug
> TOK_INSERT
>   TOK_DESTINATION
> TOK_DIR
>   TOK_TMP_FILE
>   TOK_SELECT
> TOK_SELEXPR
>   TOK_TABLE_OR_COL
> customerid
>   TOK_WHERE
> TOK_FUNCTION
>   in
>   TOK_TABLE_OR_COL
> customerid
>   2
>   3
>   TOK_TABLE_OR_COL
> customerid
> " +
>  
> org.apache.spark.sql.hive.HiveQl$.nodeToExpr(HiveQl.scala:1098)
> 
> at scala.sys.package$.error(package.scala:27)
> at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:252)
> at 
> org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:50)
> at 
> org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:49)
> at 
> scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
> {noformat}
> [This 
> thread|http://apache-spark-user-list.1001560.n3.nabble.com/Subquery-in-having-clause-Spark-1-1-0-td17401.html]
>  also brings

[jira] [Commented] (SPARK-4226) SparkSQL - Add support for subqueries in predicates

2015-09-15 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14746642#comment-14746642
 ] 

Cheng Hao commented on SPARK-4226:
--

Thank you [~brooks], you're right! I meant it will makes more complicated in 
the implementation, e.g. to resolved and split the conjunction for the 
condition, that's also what I was trying to avoid in my PR by using the 
anti-join. 

> SparkSQL - Add support for subqueries in predicates
> ---
>
> Key: SPARK-4226
> URL: https://issues.apache.org/jira/browse/SPARK-4226
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.2.0
> Environment: Spark 1.2 snapshot
>Reporter: Terry Siu
>
> I have a test table defined in Hive as follows:
> {code:sql}
> CREATE TABLE sparkbug (
>   id INT,
>   event STRING
> ) STORED AS PARQUET;
> {code}
> and insert some sample data with ids 1, 2, 3.
> In a Spark shell, I then create a HiveContext and then execute the following 
> HQL to test out subquery predicates:
> {code}
> val hc = HiveContext(hc)
> hc.hql("select customerid from sparkbug where customerid in (select 
> customerid from sparkbug where customerid in (2,3))")
> {code}
> I get the following error:
> {noformat}
> java.lang.RuntimeException: Unsupported language features in query: select 
> customerid from sparkbug where customerid in (select customerid from sparkbug 
> where customerid in (2,3))
> TOK_QUERY
>   TOK_FROM
> TOK_TABREF
>   TOK_TABNAME
> sparkbug
>   TOK_INSERT
> TOK_DESTINATION
>   TOK_DIR
> TOK_TMP_FILE
> TOK_SELECT
>   TOK_SELEXPR
> TOK_TABLE_OR_COL
>   customerid
> TOK_WHERE
>   TOK_SUBQUERY_EXPR
> TOK_SUBQUERY_OP
>   in
> TOK_QUERY
>   TOK_FROM
> TOK_TABREF
>   TOK_TABNAME
> sparkbug
>   TOK_INSERT
> TOK_DESTINATION
>   TOK_DIR
> TOK_TMP_FILE
> TOK_SELECT
>   TOK_SELEXPR
> TOK_TABLE_OR_COL
>   customerid
> TOK_WHERE
>   TOK_FUNCTION
> in
> TOK_TABLE_OR_COL
>   customerid
> 2
> 3
> TOK_TABLE_OR_COL
>   customerid
> scala.NotImplementedError: No parse rules for ASTNode type: 817, text: 
> TOK_SUBQUERY_EXPR :
> TOK_SUBQUERY_EXPR
>   TOK_SUBQUERY_OP
> in
>   TOK_QUERY
> TOK_FROM
>   TOK_TABREF
> TOK_TABNAME
>   sparkbug
> TOK_INSERT
>   TOK_DESTINATION
> TOK_DIR
>   TOK_TMP_FILE
>   TOK_SELECT
> TOK_SELEXPR
>   TOK_TABLE_OR_COL
> customerid
>   TOK_WHERE
> TOK_FUNCTION
>   in
>   TOK_TABLE_OR_COL
> customerid
>   2
>   3
>   TOK_TABLE_OR_COL
> customerid
> " +
>  
> org.apache.spark.sql.hive.HiveQl$.nodeToExpr(HiveQl.scala:1098)
> 
> at scala.sys.package$.error(package.scala:27)
> at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:252)
> at 
> org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:50)
> at 
> org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:49)
> at 
> scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
> {noformat}
> [This 
> thread|http://apache-spark-user-list.1001560.n3.nabble.com/Subquery-in-having-clause-Spark-1-1-0-td17401.html]
>  also brings up lack of subquery support in SparkSQL. It would be nice to 
> have subquery predicate support in a near, future release (1.3, maybe?).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10606) Cube/Rollup/GrpSet doesn't create the correct plan when group by is on something other than an AttributeReference

2015-09-16 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14791499#comment-14791499
 ] 

Cheng Hao commented on SPARK-10606:
---

[~rhbutani] Which version are you using, actually I've fixed the bug at 
SPARK-8972, it should be included in 1.5. Can you try that with 1.5?

> Cube/Rollup/GrpSet doesn't create the correct plan when group by is on 
> something other than an AttributeReference
> -
>
> Key: SPARK-10606
> URL: https://issues.apache.org/jira/browse/SPARK-10606
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Harish Butani
>Priority: Critical
>
> Consider the following table: t(a : String, b : String) and the query
> {code}
> select a, concat(b, '1'), count(*)
> from t
> group by a, concat(b, '1') with cube
> {code}
> The projections in the Expand operator are not setup correctly. The expand 
> logic in Analyzer:expand is comparing grouping expressions against 
> child.output. So {{concat(b, '1')}} is never mapped to a null Literal.  
> A simple fix is to add a Rule to introduce a Projection below the 
> Cube/Rollup/GrpSet operator that additionally projects the   
> groupingExpressions that are missing in the child.
> Marking this as Critical, because you get wrong results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10474) Aggregation failed with unable to acquire memory

2015-09-17 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14802912#comment-14802912
 ] 

Cheng Hao commented on SPARK-10474:
---

The root reason for this failure, is because of the 
`TungstenAggregationIterator.switchToSortBasedAggregation`, as it's eat out 
memory by HashAggregation, and then, we cannot allocate memory when turn the 
sort-based aggregation even in the spilling time.

I post a workaround solution PR for discussion.

> Aggregation failed with unable to acquire memory
> 
>
> Key: SPARK-10474
> URL: https://issues.apache.org/jira/browse/SPARK-10474
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Priority: Blocker
>
> In aggregation case, a  Lost task happened with below error.
> {code}
>  java.io.IOException: Could not acquire 65536 bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220)
> at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:126)
> at 
> org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Key SQL Query
> {code:sql}
> INSERT INTO TABLE test_table
> SELECT
>   ss.ss_customer_sk AS cid,
>   count(CASE WHEN i.i_class_id=1  THEN 1 ELSE NULL END) AS id1,
>   count(CASE WHEN i.i_class_id=3  THEN 1 ELSE NULL END) AS id3,
>   count(CASE WHEN i.i_class_id=5  THEN 1 ELSE NULL END) AS id5,
>   count(CASE WHEN i.i_class_id=7  THEN 1 ELSE NULL END) AS id7,
>   count(CASE WHEN i.i_class_id=9  THEN 1 ELSE NULL END) AS id9,
>   count(CASE WHEN i.i_class_id=11 THEN 1 ELSE NULL END) AS id11,
>   count(CASE WHEN i.i_class_id=13 THEN 1 ELSE NULL END) AS id13,
>   count(CASE WHEN i.i_class_id=15 THEN 1 ELSE NULL END) AS id15,
>   count(CASE WHEN i.i_class_id=2  THEN 1 ELSE NULL END) AS id2,
>   count(CASE WHEN i.i_class_id=4  THEN 1 ELSE NULL END) AS id4,
>   count(CASE WHEN i.i_class_id=6  THEN 1 ELSE NULL END) AS id6,
>   count(CASE WHEN i.i_class_id=8  THEN 1 ELSE NULL END) AS id8,
>   count(CASE WHEN i.i_class_id=10 THEN 1 ELSE NULL END) AS id10,
>   count(CASE WHEN i.i_class_id=14 THEN 1 ELSE NULL END) AS id14,
>   count(CASE WHEN i.i_class_id=16 THEN 1 ELSE NULL END) AS id16
> FROM store_sales ss
> INNER JOIN item i ON ss.ss_item_sk = i.i_item_sk
> WHERE i.i_category IN ('Books')
> AND ss.ss_customer_sk IS NOT NULL
> GROUP BY ss.ss_customer_sk
> HAVING count(ss.ss_item_sk) > 5
> {code}
> Note:
> the store_sales is a big fact table and item is a small dimension table.



--
This message was sent by

[jira] [Comment Edited] (SPARK-10474) Aggregation failed with unable to acquire memory

2015-09-17 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14802912#comment-14802912
 ] 

Cheng Hao edited comment on SPARK-10474 at 9/17/15 1:48 PM:


The root reason for this failure, is the trigger condition from  hash-based 
aggregation to sort-based aggregation in the `TungstenAggregationIterator`, 
current code logic is if no more memory to can be allocated, then turn to 
sort-based aggregation,  however, since no memory left, the data spill will 
also failed in UnsafeExternalSorter.initializeWriting.

I post a workaround solution PR for discussion.


was (Author: chenghao):
The root reason for this failure, is because of the 
`TungstenAggregationIterator.switchToSortBasedAggregation`, as it's eat out 
memory by HashAggregation, and then, we cannot allocate memory when turn the 
sort-based aggregation even in the spilling time.

I post a workaround solution PR for discussion.

> Aggregation failed with unable to acquire memory
> 
>
> Key: SPARK-10474
> URL: https://issues.apache.org/jira/browse/SPARK-10474
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Priority: Blocker
>
> In aggregation case, a  Lost task happened with below error.
> {code}
>  java.io.IOException: Could not acquire 65536 bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220)
> at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:126)
> at 
> org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Key SQL Query
> {code:sql}
> INSERT INTO TABLE test_table
> SELECT
>   ss.ss_customer_sk AS cid,
>   count(CASE WHEN i.i_class_id=1  THEN 1 ELSE NULL END) AS id1,
>   count(CASE WHEN i.i_class_id=3  THEN 1 ELSE NULL END) AS id3,
>   count(CASE WHEN i.i_class_id=5  THEN 1 ELSE NULL END) AS id5,
>   count(CASE WHEN i.i_class_id=7  THEN 1 ELSE NULL END) AS id7,
>   count(CASE WHEN i.i_class_id=9  THEN 1 ELSE NULL END) AS id9,
>   count(CASE WHEN i.i_class_id=11 THEN 1 ELSE NULL END) AS id11,
>   count(CASE WHEN i.i_class_id=13 THEN 1 ELSE NULL END) AS id13,
>   count(CASE WHEN i.i_class_id=15 THEN 1 ELSE NULL END) AS id15,
>   count(CASE WHEN i.i_class_id=2  THEN 1 ELSE NULL END) AS id2,
>   count(CASE WHEN i.i_class_id=4  THEN 1 ELSE NULL END) AS id4,
>   count(CASE WHEN i.i_class_id=6  THEN 1 ELSE NULL END) AS id6,
>   count(CASE WHEN i.i_class_id=8  THEN 1 ELSE NULL END) AS id8,
>   count(CASE WHEN i.i_class_id=10 THEN 1 

[jira] [Commented] (SPARK-10733) TungstenAggregation cannot acquire page after switching to sort-based

2015-09-23 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14904778#comment-14904778
 ] 

Cheng Hao commented on SPARK-10733:
---

[~jameszhouyi] Can you please patch the 
https://github.com/chenghao-intel/spark/commit/91af33397100802d6ba577a3f423bb47d5a761ea
 and try your workload? And be sure set the log level to `INFO`.

[~andrewor14] [~yhuai] One possibility is Sort-Merge-Join eat out all of the 
memory, as Sort-Merge-Join will not free the memory until we finish iterating 
all join result, however, partial aggregation will actually accept the iterator 
the join result, which means possible no memory at all for aggregation.

> TungstenAggregation cannot acquire page after switching to sort-based
> -
>
> Key: SPARK-10733
> URL: https://issues.apache.org/jira/browse/SPARK-10733
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Blocker
>
> This is uncovered after fixing SPARK-10474. Stack trace:
> {code}
> 15/09/21 12:51:46 WARN scheduler.TaskSetManager: Lost task 115.0 in stage 
> 152.0 (TID 1736, bb-node2): java.io.IOException: Unable to acquire 16777216 
> bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:378)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:359)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:488)
> at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:144)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:465)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10829) Scan DataSource with predicate expression combine partition key and attributes doesn't work

2015-09-24 Thread Cheng Hao (JIRA)
Cheng Hao created SPARK-10829:
-

 Summary: Scan DataSource with predicate expression combine 
partition key and attributes doesn't work
 Key: SPARK-10829
 URL: https://issues.apache.org/jira/browse/SPARK-10829
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao
Priority: Blocker


To reproduce that with the code:
{code}
withSQLConf(SQLConf.PARQUET_FILTER_PUSHDOWN_ENABLED.key -> "true") {
  withTempPath { dir =>
val path = s"${dir.getCanonicalPath}/part=1"
(1 to 3).map(i => (i, i.toString)).toDF("a", "b").write.parquet(path)

// If the "part = 1" filter gets pushed down, this query will throw an 
exception since
// "part" is not a valid column in the actual Parquet file
checkAnswer(
  sqlContext.read.parquet(path).filter("a > 0 and (part = 0 or a > 1)"),
  (2 to 3).map(i => Row(i, i.toString, 1)))
  }
}
{code}
We expect the result as:
{code}
2, 1
3, 1
{code}
But we got:
{code}
1, 1
2, 1
3, 1
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10831) Spark SQL Configuration missing in the doc

2015-09-25 Thread Cheng Hao (JIRA)
Cheng Hao created SPARK-10831:
-

 Summary: Spark SQL Configuration missing in the doc
 Key: SPARK-10831
 URL: https://issues.apache.org/jira/browse/SPARK-10831
 Project: Spark
  Issue Type: Documentation
  Components: SQL
Reporter: Cheng Hao


E.g.
spark.sql.codegen
spark.sql.planner.sortMergeJoin
spark.sql.dialect
spark.sql.caseSensitive



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10992) Partial Aggregation Support for Hive UDAF

2015-10-07 Thread Cheng Hao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Hao updated SPARK-10992:
--
Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-4366

> Partial Aggregation Support for Hive UDAF
> -
>
> Key: SPARK-10992
> URL: https://issues.apache.org/jira/browse/SPARK-10992
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Cheng Hao
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10992) Partial Aggregation Support for Hive UDAF

2015-10-07 Thread Cheng Hao (JIRA)
Cheng Hao created SPARK-10992:
-

 Summary: Partial Aggregation Support for Hive UDAF
 Key: SPARK-10992
 URL: https://issues.apache.org/jira/browse/SPARK-10992
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Cheng Hao






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11041) Add (NOT) IN / EXISTS support for predicates

2015-10-09 Thread Cheng Hao (JIRA)
Cheng Hao created SPARK-11041:
-

 Summary: Add (NOT) IN / EXISTS support for predicates
 Key: SPARK-11041
 URL: https://issues.apache.org/jira/browse/SPARK-11041
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Cheng Hao






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-11041) Add (NOT) IN / EXISTS support for predicates

2015-10-09 Thread Cheng Hao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Hao closed SPARK-11041.
-
Resolution: Duplicate

> Add (NOT) IN / EXISTS support for predicates
> 
>
> Key: SPARK-11041
> URL: https://issues.apache.org/jira/browse/SPARK-11041
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Cheng Hao
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11076) Decimal Support for Ceil/Floor

2015-10-12 Thread Cheng Hao (JIRA)
Cheng Hao created SPARK-11076:
-

 Summary: Decimal Support for Ceil/Floor
 Key: SPARK-11076
 URL: https://issues.apache.org/jira/browse/SPARK-11076
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Cheng Hao


Currently, Ceil & Floor doesn't support decimal, but Hive does.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4226) SparkSQL - Add support for subqueries in predicates

2015-10-15 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14958524#comment-14958524
 ] 

Cheng Hao commented on SPARK-4226:
--

[~nadenf] Actually I am working on it right now, and the first PR is ready, it 
will be great appreciated if you can try 
https://github.com/apache/spark/pull/9055 in your local testing, let me know if 
there any problem or bug you found.

> SparkSQL - Add support for subqueries in predicates
> ---
>
> Key: SPARK-4226
> URL: https://issues.apache.org/jira/browse/SPARK-4226
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.2.0
> Environment: Spark 1.2 snapshot
>Reporter: Terry Siu
>
> I have a test table defined in Hive as follows:
> {code:sql}
> CREATE TABLE sparkbug (
>   id INT,
>   event STRING
> ) STORED AS PARQUET;
> {code}
> and insert some sample data with ids 1, 2, 3.
> In a Spark shell, I then create a HiveContext and then execute the following 
> HQL to test out subquery predicates:
> {code}
> val hc = HiveContext(hc)
> hc.hql("select customerid from sparkbug where customerid in (select 
> customerid from sparkbug where customerid in (2,3))")
> {code}
> I get the following error:
> {noformat}
> java.lang.RuntimeException: Unsupported language features in query: select 
> customerid from sparkbug where customerid in (select customerid from sparkbug 
> where customerid in (2,3))
> TOK_QUERY
>   TOK_FROM
> TOK_TABREF
>   TOK_TABNAME
> sparkbug
>   TOK_INSERT
> TOK_DESTINATION
>   TOK_DIR
> TOK_TMP_FILE
> TOK_SELECT
>   TOK_SELEXPR
> TOK_TABLE_OR_COL
>   customerid
> TOK_WHERE
>   TOK_SUBQUERY_EXPR
> TOK_SUBQUERY_OP
>   in
> TOK_QUERY
>   TOK_FROM
> TOK_TABREF
>   TOK_TABNAME
> sparkbug
>   TOK_INSERT
> TOK_DESTINATION
>   TOK_DIR
> TOK_TMP_FILE
> TOK_SELECT
>   TOK_SELEXPR
> TOK_TABLE_OR_COL
>   customerid
> TOK_WHERE
>   TOK_FUNCTION
> in
> TOK_TABLE_OR_COL
>   customerid
> 2
> 3
> TOK_TABLE_OR_COL
>   customerid
> scala.NotImplementedError: No parse rules for ASTNode type: 817, text: 
> TOK_SUBQUERY_EXPR :
> TOK_SUBQUERY_EXPR
>   TOK_SUBQUERY_OP
> in
>   TOK_QUERY
> TOK_FROM
>   TOK_TABREF
> TOK_TABNAME
>   sparkbug
> TOK_INSERT
>   TOK_DESTINATION
> TOK_DIR
>   TOK_TMP_FILE
>   TOK_SELECT
> TOK_SELEXPR
>   TOK_TABLE_OR_COL
> customerid
>   TOK_WHERE
> TOK_FUNCTION
>   in
>   TOK_TABLE_OR_COL
> customerid
>   2
>   3
>   TOK_TABLE_OR_COL
> customerid
> " +
>  
> org.apache.spark.sql.hive.HiveQl$.nodeToExpr(HiveQl.scala:1098)
> 
> at scala.sys.package$.error(package.scala:27)
> at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:252)
> at 
> org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:50)
> at 
> org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:49)
> at 
> scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
> {noformat}
> [This 
> thread|http://apache-spark-user-list.1001560.n3.nabble.com/Subquery-in-having-clause-Spark-1-1-0-td17401.html]
>  also brings up lack of subquery support in SparkSQL. It would be nice to 
> have subquery predicate support in a near, future release (1.3, maybe?).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9735) Auto infer partition schema of HadoopFsRelation should should respected the user specified one

2015-10-20 Thread Cheng Hao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Hao updated SPARK-9735:
-
Description: 
This code is copied from the hadoopFsRelationSuite.scala

{code}
partitionedTestDF = (for {
i <- 1 to 3
p2 <- Seq("foo", "bar")
  } yield (i, s"val_$i", 1, p2)).toDF("a", "b", "p1", "p2")

withTempPath { file =>
  val input = partitionedTestDF.select('a, 'b, 
'p1.cast(StringType).as('ps), 'p2)

  input
.write
.format(dataSourceName)
.mode(SaveMode.Overwrite)
.partitionBy("ps", "p2")
.saveAsTable("t")

  input
.write
.format(dataSourceName)
.mode(SaveMode.Append)
.partitionBy("ps", "p2")
.saveAsTable("t")

  val realData = input.collect()
  withTempTable("t") {
checkAnswer(sqlContext.table("t"), realData ++ realData)
  }
}

java.lang.ClassCastException: java.lang.Integer cannot be cast to 
org.apache.spark.unsafe.types.UTF8String
at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:45)
at 
org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getUTF8String(rows.scala:220)
at 
org.apache.spark.sql.catalyst.expressions.JoinedRow.getUTF8String(JoinedRow.scala:102)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(generated.java:62)
at 
org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$17$$anonfun$apply$9.apply(DataSourceStrategy.scala:212)
at 
org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$17$$anonfun$apply$9.apply(DataSourceStrategy.scala:212)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at 
org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:903)
at 
org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:903)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1846)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1846)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
07:44:01.344 ERROR org.apache.spark.executor.Executor: Exception in task 14.0 
in stage 3.0 (TID 206)
java.lang.ClassCastException: java.lang.Integer cannot be cast to 
org.apache.spark.unsafe.types.UTF8String
at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:45)
at 
org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getUTF8String(rows.scala:220)
at 
org.apache.spark.sql.catalyst.expressions.JoinedRow.getUTF8String(JoinedRow.scala:102)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(generated.java:62)
at 
org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$17$$anonfun$apply$9.apply(DataSourceStrategy.scala:212)
at 
org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$17$$anonfun$apply$9.apply(DataSourceStrategy.scala:212)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)

[jira] [Commented] (SPARK-11330) Filter operation on StringType after groupBy PERSISTED brings no results

2015-10-27 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14977600#comment-14977600
 ] 

Cheng Hao commented on SPARK-11330:
---

Hi, [~saif.a.ellafi], I've tried the code like below:
{code}
case class Spark11330(account_id: Int, product: String, vint: String,
  band: String, age: Int, mb: String, mm: String,
  balance: Float, balancec: Float)

test("SPARK-11330: Filter operation on StringType after groupBy PERSISTED 
brings no results") {
withTempPath { f =>
  val d = Seq(
Spark11330(1, "product1", "2007-01-01", "band1", 27, "mb1", "200808", 
1000.0f, 2000.0f),
Spark11330(1, "product1", "2007-01-01", "band1", 27, "mb1", "200808", 
2000.0f, 2000.0f),
Spark11330(1, "product1", "2007-01-01", "band1", 27, "mb1", "200809", 
2000.0f, 2000.0f),
Spark11330(2, "product2", "2007-01-01", "band3", 29, "mb1", "200809", 
2010.0f, 3000.0f))

  val data = List.tabulate[Seq[Spark11330]](10) { i => d }.flatten

  sqlContext.sparkContext.parallelize(data, 4)

.toDF().write.format("parquet").mode(SaveMode.Overwrite).save(f.getAbsolutePath)

  val df = sqlContext.read.parquet(f.getAbsolutePath)

  val f1 = df.groupBy("vint").count().persist().filter("vint = 
'2007-01-01'").first
  val f2 = df.groupBy("vint").count().filter("vint = '2007-01-01'").first

  assert(f1 == f2)

  val res = df
.groupBy("product", "band", "age", "vint", "mb", "mm")
.agg(
  count($"account_id").as("N"),
  sum($"balance").as("balance_eom"),
  sum($"balancec").as("balance_eoc")).persist()

  val c1 = res.select("vint", 
"mm").filter("vint='2007-01-01'").select("mm").distinct.collect
  res.unpersist()
  val c2 = res.select("vint", 
"mm").filter("vint='2007-01-01'").select("mm").distinct.collect
  assert(c1.sameElements(c2))
}
  }
{code}

Seems everything works fine, I am not sure if I missed something, can you try 
to reproduce the issue based on my code?

> Filter operation on StringType after groupBy PERSISTED brings no results
> 
>
> Key: SPARK-11330
> URL: https://issues.apache.org/jira/browse/SPARK-11330
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
> Environment: Stand alone Cluster of five servers (happens as well in 
> local mode). sqlContext instance of HiveContext (happens as well with 
> SQLContext)
> No special options other than driver memory and executor memory.
> Parquet partitions are 512 where there are 160 cores. Happens as well with 
> other partitioning
> Data is nearly 2 billion rows.
>Reporter: Saif Addin Ellafi
>Priority: Blocker
>
> ONLY HAPPENS WHEN PERSIST() IS CALLED
> val data = sqlContext.read.parquet("/var/data/Saif/ppnr_pqt")
> data.groupBy("vintages").count.select("vintages").filter("vintages = 
> '2007-01-01'").first
> >>> res9: org.apache.spark.sql.Row = [2007-01-01]
> data.groupBy("vintages").count.persist.select("vintages").filter("vintages = 
> '2007-01-01'").first
> >>> Exception on empty iterator stuff
> This does not happen if using another type of field, eg IntType
> data.groupBy("mm").count.persist.select("mm").filter("mm = 
> 200805").first >>> res13: org.apache.spark.sql.Row = [200805]
> NOTE: I have no idea whether I used KRYO serializer when stored this parquet.
> NOTE2: If setting the persist after the filtering, it works fine. But this is 
> not a good enough workaround since any filter operation afterwards will break 
> results.
> NOTE3: I have reproduced the issue with several different datasets.
> NOTE4: The only real workaround is to store the groupBy dataframe in database 
> and reload it as a new dataframe.
> Query to raw-data works fine:
> data.select("vintages").filter("vintages = '2007-01-01'").first >>> res4: 
> org.apache.spark.sql.Row = [2007-01-01]
> Originally, the issue happened with a larger aggregation operation, the 
> result was that data was inconsistent bringing different results every call.
> Reducing the operation step by step, I got into this issue.
> In any case, the original operation was:
> val data = sqlContext.read.parquet("/var/Saif/data_pqt")
> val res = data.groupBy("product", "band", "age", "vint", "mb", 
> "mm").agg(count($"account_id").as("N"), 
> sum($"balance").as("balance_eom"), sum($"balancec").as("balance_eoc"), 
> sum($"spend").as("spend"), sum($"payment").as("payment"), 
> sum($"feoc").as("feoc"), sum($"cfintbal").as("cfintbal"), count($"newacct" 
> === 1).as("newacct")).persist()
> val z = res.select("vint", "mm").filter("vint = 
> '2007-01-01'").select("mm").distinct.collect
> z.length
> >>> res0: Int = 102
> res.unpersist

[jira] [Comment Edited] (SPARK-11330) Filter operation on StringType after groupBy PERSISTED brings no results

2015-10-27 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14977600#comment-14977600
 ] 

Cheng Hao edited comment on SPARK-11330 at 10/28/15 2:28 AM:
-

Hi, [~saif.a.ellafi], I've tried the code like below:
{code}
case class Spark11330(account_id: Int, product: String, vint: String,
  band: String, age: Int, mb: String, mm: String,
  balance: Float, balancec: Float)

test("SPARK-11330: Filter operation on StringType after groupBy PERSISTED 
brings no results") {
withTempPath { f =>
  // generate the more data.
  val d = Seq(
Spark11330(1, "product1", "2007-01-01", "band1", 27, "mb1", "200808", 
1000.0f, 2000.0f),
Spark11330(1, "product1", "2007-01-01", "band1", 27, "mb1", "200808", 
2000.0f, 2000.0f),
Spark11330(1, "product1", "2007-01-01", "band1", 27, "mb1", "200809", 
2000.0f, 2000.0f),
Spark11330(2, "product2", "2007-01-01", "band3", 29, "mb1", "200809", 
2010.0f, 3000.0f))

  val data = List.tabulate[Seq[Spark11330]](10) { i => d }.flatten

  // save as parquet file in local disk
  sqlContext.sparkContext.parallelize(data, 4)

.toDF().write.format("parquet").mode(SaveMode.Overwrite).save(f.getAbsolutePath)

  // reproduce
  val df = sqlContext.read.parquet(f.getAbsolutePath)

  val f1 = df.groupBy("vint").count().persist().filter("vint = 
'2007-01-01'").first
  val f2 = df.groupBy("vint").count().filter("vint = '2007-01-01'").first

  assert(f1 == f2)

  val res = df
.groupBy("product", "band", "age", "vint", "mb", "mm")
.agg(
  count($"account_id").as("N"),
  sum($"balance").as("balance_eom"),
  sum($"balancec").as("balance_eoc")).persist()

  val c1 = res.select("vint", 
"mm").filter("vint='2007-01-01'").select("mm").distinct.collect
  res.unpersist()
  val c2 = res.select("vint", 
"mm").filter("vint='2007-01-01'").select("mm").distinct.collect
  assert(c1.sameElements(c2))
}
  }
{code}

Seems everything works fine, I am not sure if I missed something, can you try 
to reproduce the issue based on my code?


was (Author: chenghao):
Hi, [~saif.a.ellafi], I've tried the code like below:
{code}
case class Spark11330(account_id: Int, product: String, vint: String,
  band: String, age: Int, mb: String, mm: String,
  balance: Float, balancec: Float)

test("SPARK-11330: Filter operation on StringType after groupBy PERSISTED 
brings no results") {
withTempPath { f =>
  val d = Seq(
Spark11330(1, "product1", "2007-01-01", "band1", 27, "mb1", "200808", 
1000.0f, 2000.0f),
Spark11330(1, "product1", "2007-01-01", "band1", 27, "mb1", "200808", 
2000.0f, 2000.0f),
Spark11330(1, "product1", "2007-01-01", "band1", 27, "mb1", "200809", 
2000.0f, 2000.0f),
Spark11330(2, "product2", "2007-01-01", "band3", 29, "mb1", "200809", 
2010.0f, 3000.0f))

  val data = List.tabulate[Seq[Spark11330]](10) { i => d }.flatten

  sqlContext.sparkContext.parallelize(data, 4)

.toDF().write.format("parquet").mode(SaveMode.Overwrite).save(f.getAbsolutePath)

  val df = sqlContext.read.parquet(f.getAbsolutePath)

  val f1 = df.groupBy("vint").count().persist().filter("vint = 
'2007-01-01'").first
  val f2 = df.groupBy("vint").count().filter("vint = '2007-01-01'").first

  assert(f1 == f2)

  val res = df
.groupBy("product", "band", "age", "vint", "mb", "mm")
.agg(
  count($"account_id").as("N"),
  sum($"balance").as("balance_eom"),
  sum($"balancec").as("balance_eoc")).persist()

  val c1 = res.select("vint", 
"mm").filter("vint='2007-01-01'").select("mm").distinct.collect
  res.unpersist()
  val c2 = res.select("vint", 
"mm").filter("vint='2007-01-01'").select("mm").distinct.collect
  assert(c1.sameElements(c2))
}
  }
{code}

Seems everything works fine, I am not sure if I missed something, can you try 
to reproduce the issue based on my code?

> Filter operation on StringType after groupBy PERSISTED brings no results
> 
>
> Key: SPARK-11330
> URL: https://issues.apache.org/jira/browse/SPARK-11330
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
> Environment: Stand alone Cluster of five servers (happens as well in 
> local mode). sqlContext instance of HiveContext (happens as well with 
> SQLContext)
> No special options other than driver memory and executor memory.
> Parquet partitions are 512 where there are 160 cores. Happens as well with 
> other partitioning
> Data is nearly 2 billion rows.
>

[jira] [Created] (SPARK-11364) HadoopFsRelation doesn't reload the hadoop configuration for each execution

2015-10-27 Thread Cheng Hao (JIRA)
Cheng Hao created SPARK-11364:
-

 Summary: HadoopFsRelation doesn't reload the hadoop configuration 
for each execution
 Key: SPARK-11364
 URL: https://issues.apache.org/jira/browse/SPARK-11364
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao


https://www.mail-archive.com/user@spark.apache.org/msg39706.html

We didn't propagate the hadoop configuration to the Data Source, as we always 
try to load the default hadoop configuration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11330) Filter operation on StringType after groupBy PERSISTED brings no results

2015-10-28 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14979646#comment-14979646
 ] 

Cheng Hao commented on SPARK-11330:
---

[~saif.a.ellafi] I've checked that with 1.5.0 and it's confirmed it can be 
reproduced, however, it does not exists in latest master branch, I am still 
digging when and how it's been fixed.

> Filter operation on StringType after groupBy PERSISTED brings no results
> 
>
> Key: SPARK-11330
> URL: https://issues.apache.org/jira/browse/SPARK-11330
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
> Environment: Stand alone Cluster of five servers (happens as well in 
> local mode). sqlContext instance of HiveContext (happens as well with 
> SQLContext)
> No special options other than driver memory and executor memory.
> Parquet partitions are 512 where there are 160 cores. Happens as well with 
> other partitioning
> Data is nearly 2 billion rows.
>Reporter: Saif Addin Ellafi
>Priority: Blocker
> Attachments: bug_reproduce.zip, bug_reproduce_50k.zip
>
>
> ONLY HAPPENS WHEN PERSIST() IS CALLED
> val data = sqlContext.read.parquet("/var/data/Saif/ppnr_pqt")
> data.groupBy("vintages").count.select("vintages").filter("vintages = 
> '2007-01-01'").first
> >>> res9: org.apache.spark.sql.Row = [2007-01-01]
> data.groupBy("vintages").count.persist.select("vintages").filter("vintages = 
> '2007-01-01'").first
> >>> Exception on empty iterator stuff
> This does not happen if using another type of field, eg IntType
> data.groupBy("mm").count.persist.select("mm").filter("mm = 
> 200805").first >>> res13: org.apache.spark.sql.Row = [200805]
> NOTE: I have no idea whether I used KRYO serializer when stored this parquet.
> NOTE2: If setting the persist after the filtering, it works fine. But this is 
> not a good enough workaround since any filter operation afterwards will break 
> results.
> NOTE3: I have reproduced the issue with several different datasets.
> NOTE4: The only real workaround is to store the groupBy dataframe in database 
> and reload it as a new dataframe.
> Query to raw-data works fine:
> data.select("vintages").filter("vintages = '2007-01-01'").first >>> res4: 
> org.apache.spark.sql.Row = [2007-01-01]
> Originally, the issue happened with a larger aggregation operation, the 
> result was that data was inconsistent bringing different results every call.
> Reducing the operation step by step, I got into this issue.
> In any case, the original operation was:
> val data = sqlContext.read.parquet("/var/Saif/data_pqt")
> val res = data.groupBy("product", "band", "age", "vint", "mb", 
> "mm").agg(count($"account_id").as("N"), 
> sum($"balance").as("balance_eom"), sum($"balancec").as("balance_eoc"), 
> sum($"spend").as("spend"), sum($"payment").as("payment"), 
> sum($"feoc").as("feoc"), sum($"cfintbal").as("cfintbal"), count($"newacct" 
> === 1).as("newacct")).persist()
> val z = res.select("vint", "mm").filter("vint = 
> '2007-01-01'").select("mm").distinct.collect
> z.length
> >>> res0: Int = 102
> res.unpersist()
> val z = res.select("vint", "mm").filter("vint = 
> '2007-01-01'").select("mm").distinct.collect
> z.length
> >>> res1: Int = 103



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11330) Filter operation on StringType after groupBy PERSISTED brings no results

2015-10-28 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14979699#comment-14979699
 ] 

Cheng Hao commented on SPARK-11330:
---

OK, seems it's solved in https://issues.apache.org/jira/browse/SPARK-11330, it 
should not be the problem in the 1.5.2 or 1.6.0 any more.

> Filter operation on StringType after groupBy PERSISTED brings no results
> 
>
> Key: SPARK-11330
> URL: https://issues.apache.org/jira/browse/SPARK-11330
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
> Environment: Stand alone Cluster of five servers (happens as well in 
> local mode). sqlContext instance of HiveContext (happens as well with 
> SQLContext)
> No special options other than driver memory and executor memory.
> Parquet partitions are 512 where there are 160 cores. Happens as well with 
> other partitioning
> Data is nearly 2 billion rows.
>Reporter: Saif Addin Ellafi
>Priority: Blocker
> Attachments: bug_reproduce.zip, bug_reproduce_50k.zip
>
>
> ONLY HAPPENS WHEN PERSIST() IS CALLED
> val data = sqlContext.read.parquet("/var/data/Saif/ppnr_pqt")
> data.groupBy("vintages").count.select("vintages").filter("vintages = 
> '2007-01-01'").first
> >>> res9: org.apache.spark.sql.Row = [2007-01-01]
> data.groupBy("vintages").count.persist.select("vintages").filter("vintages = 
> '2007-01-01'").first
> >>> Exception on empty iterator stuff
> This does not happen if using another type of field, eg IntType
> data.groupBy("mm").count.persist.select("mm").filter("mm = 
> 200805").first >>> res13: org.apache.spark.sql.Row = [200805]
> NOTE: I have no idea whether I used KRYO serializer when stored this parquet.
> NOTE2: If setting the persist after the filtering, it works fine. But this is 
> not a good enough workaround since any filter operation afterwards will break 
> results.
> NOTE3: I have reproduced the issue with several different datasets.
> NOTE4: The only real workaround is to store the groupBy dataframe in database 
> and reload it as a new dataframe.
> Query to raw-data works fine:
> data.select("vintages").filter("vintages = '2007-01-01'").first >>> res4: 
> org.apache.spark.sql.Row = [2007-01-01]
> Originally, the issue happened with a larger aggregation operation, the 
> result was that data was inconsistent bringing different results every call.
> Reducing the operation step by step, I got into this issue.
> In any case, the original operation was:
> val data = sqlContext.read.parquet("/var/Saif/data_pqt")
> val res = data.groupBy("product", "band", "age", "vint", "mb", 
> "mm").agg(count($"account_id").as("N"), 
> sum($"balance").as("balance_eom"), sum($"balancec").as("balance_eoc"), 
> sum($"spend").as("spend"), sum($"payment").as("payment"), 
> sum($"feoc").as("feoc"), sum($"cfintbal").as("cfintbal"), count($"newacct" 
> === 1).as("newacct")).persist()
> val z = res.select("vint", "mm").filter("vint = 
> '2007-01-01'").select("mm").distinct.collect
> z.length
> >>> res0: Int = 102
> res.unpersist()
> val z = res.select("vint", "mm").filter("vint = 
> '2007-01-01'").select("mm").distinct.collect
> z.length
> >>> res1: Int = 103



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11330) Filter operation on StringType after groupBy PERSISTED brings no results

2015-10-28 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14979699#comment-14979699
 ] 

Cheng Hao edited comment on SPARK-11330 at 10/29/15 2:48 AM:
-

OK, seems it's solved in https://issues.apache.org/jira/browse/SPARK-10859, it 
should not be the problem in the 1.5.2 or 1.6.0 any more.


was (Author: chenghao):
OK, seems it's solved in https://issues.apache.org/jira/browse/SPARK-11330, it 
should not be the problem in the 1.5.2 or 1.6.0 any more.

> Filter operation on StringType after groupBy PERSISTED brings no results
> 
>
> Key: SPARK-11330
> URL: https://issues.apache.org/jira/browse/SPARK-11330
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
> Environment: Stand alone Cluster of five servers (happens as well in 
> local mode). sqlContext instance of HiveContext (happens as well with 
> SQLContext)
> No special options other than driver memory and executor memory.
> Parquet partitions are 512 where there are 160 cores. Happens as well with 
> other partitioning
> Data is nearly 2 billion rows.
>Reporter: Saif Addin Ellafi
>Priority: Blocker
> Attachments: bug_reproduce.zip, bug_reproduce_50k.zip
>
>
> ONLY HAPPENS WHEN PERSIST() IS CALLED
> val data = sqlContext.read.parquet("/var/data/Saif/ppnr_pqt")
> data.groupBy("vintages").count.select("vintages").filter("vintages = 
> '2007-01-01'").first
> >>> res9: org.apache.spark.sql.Row = [2007-01-01]
> data.groupBy("vintages").count.persist.select("vintages").filter("vintages = 
> '2007-01-01'").first
> >>> Exception on empty iterator stuff
> This does not happen if using another type of field, eg IntType
> data.groupBy("mm").count.persist.select("mm").filter("mm = 
> 200805").first >>> res13: org.apache.spark.sql.Row = [200805]
> NOTE: I have no idea whether I used KRYO serializer when stored this parquet.
> NOTE2: If setting the persist after the filtering, it works fine. But this is 
> not a good enough workaround since any filter operation afterwards will break 
> results.
> NOTE3: I have reproduced the issue with several different datasets.
> NOTE4: The only real workaround is to store the groupBy dataframe in database 
> and reload it as a new dataframe.
> Query to raw-data works fine:
> data.select("vintages").filter("vintages = '2007-01-01'").first >>> res4: 
> org.apache.spark.sql.Row = [2007-01-01]
> Originally, the issue happened with a larger aggregation operation, the 
> result was that data was inconsistent bringing different results every call.
> Reducing the operation step by step, I got into this issue.
> In any case, the original operation was:
> val data = sqlContext.read.parquet("/var/Saif/data_pqt")
> val res = data.groupBy("product", "band", "age", "vint", "mb", 
> "mm").agg(count($"account_id").as("N"), 
> sum($"balance").as("balance_eom"), sum($"balancec").as("balance_eoc"), 
> sum($"spend").as("spend"), sum($"payment").as("payment"), 
> sum($"feoc").as("feoc"), sum($"cfintbal").as("cfintbal"), count($"newacct" 
> === 1).as("newacct")).persist()
> val z = res.select("vint", "mm").filter("vint = 
> '2007-01-01'").select("mm").distinct.collect
> z.length
> >>> res0: Int = 102
> res.unpersist()
> val z = res.select("vint", "mm").filter("vint = 
> '2007-01-01'").select("mm").distinct.collect
> z.length
> >>> res1: Int = 103



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10371) Optimize sequential projections

2015-10-29 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14981650#comment-14981650
 ] 

Cheng Hao commented on SPARK-10371:
---

Eliminating the common sub expression within the projection?

> Optimize sequential projections
> ---
>
> Key: SPARK-10371
> URL: https://issues.apache.org/jira/browse/SPARK-10371
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SQL
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>
> In ML pipelines, each transformer/estimator appends new columns to the input 
> DataFrame. For example, it might produce DataFrames like the following 
> columns: a, b, c, d, where a is from raw input, b = udf_b(a), c = udf_c(b), 
> and d = udf_d(c). Some UDFs could be expensive. However, if we materialize c 
> and d, udf_b, and udf_c are triggered twice, i.e., value c is not re-used.
> It would be nice to detect this pattern and re-use intermediate values.
> {code}
> val input = sqlContext.range(10)
> val output = input.withColumn("x", col("id") + 1).withColumn("y", col("x") * 
> 2)
> output.explain(true)
> == Parsed Logical Plan ==
> 'Project [*,('x * 2) AS y#254]
>  Project [id#252L,(id#252L + cast(1 as bigint)) AS x#253L]
>   LogicalRDD [id#252L], MapPartitionsRDD[458] at range at :30
> == Analyzed Logical Plan ==
> id: bigint, x: bigint, y: bigint
> Project [id#252L,x#253L,(x#253L * cast(2 as bigint)) AS y#254L]
>  Project [id#252L,(id#252L + cast(1 as bigint)) AS x#253L]
>   LogicalRDD [id#252L], MapPartitionsRDD[458] at range at :30
> == Optimized Logical Plan ==
> Project [id#252L,(id#252L + 1) AS x#253L,((id#252L + 1) * 2) AS y#254L]
>  LogicalRDD [id#252L], MapPartitionsRDD[458] at range at :30
> == Physical Plan ==
> TungstenProject [id#252L,(id#252L + 1) AS x#253L,((id#252L + 1) * 2) AS 
> y#254L]
>  Scan PhysicalRDD[id#252L]
> Code Generation: true
> input: org.apache.spark.sql.DataFrame = [id: bigint]
> output: org.apache.spark.sql.DataFrame = [id: bigint, x: bigint, y: bigint]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11512) Bucket Join

2015-11-04 Thread Cheng Hao (JIRA)
Cheng Hao created SPARK-11512:
-

 Summary: Bucket Join
 Key: SPARK-11512
 URL: https://issues.apache.org/jira/browse/SPARK-11512
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Cheng Hao


Sort merge join on two datasets on the file system that have already been 
partitioned the same with the same number of partitions and sorted within each 
partition, and we don't need to sort it again while join with the 
sorted/partitioned keys

This functionality exists in
- Hive (hive.optimize.bucketmapjoin.sortedmerge)
- Pig (USING 'merge')
- MapReduce (CompositeInputFormat)




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11512) Bucket Join

2015-11-04 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14990867#comment-14990867
 ] 

Cheng Hao commented on SPARK-11512:
---

Oh, yes, but SPARK-5292 is only about to support the Hive bucket, but in a 
generic way, we need to add support the bucket for Data Source API. Anyway, I 
will add a link with that jira issue.

> Bucket Join
> ---
>
> Key: SPARK-11512
> URL: https://issues.apache.org/jira/browse/SPARK-11512
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Cheng Hao
>
> Sort merge join on two datasets on the file system that have already been 
> partitioned the same with the same number of partitions and sorted within 
> each partition, and we don't need to sort it again while join with the 
> sorted/partitioned keys
> This functionality exists in
> - Hive (hive.optimize.bucketmapjoin.sortedmerge)
> - Pig (USING 'merge')
> - MapReduce (CompositeInputFormat)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11512) Bucket Join

2015-11-04 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14990868#comment-14990868
 ] 

Cheng Hao commented on SPARK-11512:
---

We need to support the "bucket" for DataSource API.

> Bucket Join
> ---
>
> Key: SPARK-11512
> URL: https://issues.apache.org/jira/browse/SPARK-11512
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Cheng Hao
>
> Sort merge join on two datasets on the file system that have already been 
> partitioned the same with the same number of partitions and sorted within 
> each partition, and we don't need to sort it again while join with the 
> sorted/partitioned keys
> This functionality exists in
> - Hive (hive.optimize.bucketmapjoin.sortedmerge)
> - Pig (USING 'merge')
> - MapReduce (CompositeInputFormat)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10865) [Spark SQL] [UDF] the ceil/ceiling function got wrong return value type

2015-11-11 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15001422#comment-15001422
 ] 

Cheng Hao commented on SPARK-10865:
---

We actually follow the criteria of Hive, and actually I tested it in MySQL, it 
works in the same way. 

> [Spark SQL] [UDF] the ceil/ceiling function got wrong return value type
> ---
>
> Key: SPARK-10865
> URL: https://issues.apache.org/jira/browse/SPARK-10865
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Assignee: Cheng Hao
> Fix For: 1.6.0
>
>
> As per ceil/ceiling definition,it should get BIGINT return value
> -ceil(DOUBLE a), ceiling(DOUBLE a)
> -Returns the minimum BIGINT value that is equal to or greater than a.
> But in current Spark implementation, it got wrong value type.
> e.g., 
> select ceil(2642.12) from udf_test_web_sales limit 1;
> 2643.0
> In hive implementation, it got return value type like below:
> hive> select ceil(2642.12) from udf_test_web_sales limit 1;
> OK
> 2643



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10865) [Spark SQL] [UDF] the ceil/ceiling function got wrong return value type

2015-11-11 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15001423#comment-15001423
 ] 

Cheng Hao commented on SPARK-10865:
---

1.5.2 is released, I am not sure whether part of it now or not.

> [Spark SQL] [UDF] the ceil/ceiling function got wrong return value type
> ---
>
> Key: SPARK-10865
> URL: https://issues.apache.org/jira/browse/SPARK-10865
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Assignee: Cheng Hao
> Fix For: 1.6.0
>
>
> As per ceil/ceiling definition,it should get BIGINT return value
> -ceil(DOUBLE a), ceiling(DOUBLE a)
> -Returns the minimum BIGINT value that is equal to or greater than a.
> But in current Spark implementation, it got wrong value type.
> e.g., 
> select ceil(2642.12) from udf_test_web_sales limit 1;
> 2643.0
> In hive implementation, it got return value type like below:
> hive> select ceil(2642.12) from udf_test_web_sales limit 1;
> OK
> 2643



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12064) Make the SqlParser as trait for better integrated with extensions

2015-11-30 Thread Cheng Hao (JIRA)
Cheng Hao created SPARK-12064:
-

 Summary: Make the SqlParser as trait for better integrated with 
extensions
 Key: SPARK-12064
 URL: https://issues.apache.org/jira/browse/SPARK-12064
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Cheng Hao


`SqlParser` is now an object, which hard to reuse it in extensions, a proper 
implementation will be make the `SqlParser` as trait, and keep all of its 
implementation unchanged, and then add another object called `SqlParser` 
inherits from the trait.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12064) Make the SqlParser as trait for better integrated with extensions

2015-11-30 Thread Cheng Hao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Hao resolved SPARK-12064.
---
Resolution: Won't Fix

DBX has plan to remove the SqlParser in 2.0.

> Make the SqlParser as trait for better integrated with extensions
> -
>
> Key: SPARK-12064
> URL: https://issues.apache.org/jira/browse/SPARK-12064
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Cheng Hao
>
> `SqlParser` is now an object, which hard to reuse it in extensions, a proper 
> implementation will be make the `SqlParser` as trait, and keep all of its 
> implementation unchanged, and then add another object called `SqlParser` 
> inherits from the trait.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8360) Streaming DataFrames

2015-12-01 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15035335#comment-15035335
 ] 

Cheng Hao commented on SPARK-8360:
--

Add some thoughts on StreamingSQL. 
https://docs.google.com/document/d/1ZgmDsNLT2XltMI317nNtkUkxBFwaaCkLHtPXjQE6JBY/edit
 

> Streaming DataFrames
> 
>
> Key: SPARK-8360
> URL: https://issues.apache.org/jira/browse/SPARK-8360
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL, Streaming
>Reporter: Reynold Xin
>
> Umbrella ticket to track what's needed to make streaming DataFrame a reality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-8360) Streaming DataFrames

2015-12-01 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15035335#comment-15035335
 ] 

Cheng Hao edited comment on SPARK-8360 at 12/2/15 6:19 AM:
---

Add some thoughts on StreamingSQL. 
https://docs.google.com/document/u/1/d/1ZgmDsNLT2XltMI317nNtkUkxBFwaaCkLHtPXjQE6JBY/pub

Request Edit if you needed.
https://docs.google.com/document/d/1ZgmDsNLT2XltMI317nNtkUkxBFwaaCkLHtPXjQE6JBY/edit
 


was (Author: chenghao):
Add some thoughts on StreamingSQL. 
https://docs.google.com/document/d/1ZgmDsNLT2XltMI317nNtkUkxBFwaaCkLHtPXjQE6JBY/edit
 

> Streaming DataFrames
> 
>
> Key: SPARK-8360
> URL: https://issues.apache.org/jira/browse/SPARK-8360
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL, Streaming
>Reporter: Reynold Xin
>
> Umbrella ticket to track what's needed to make streaming DataFrame a reality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8360) Streaming DataFrames

2015-12-02 Thread Cheng Hao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Hao updated SPARK-8360:
-
Attachment: StreamingDataFrameProposal.pdf

This is a proposal for streaming dataframes that we were trying to work, 
hopefully helpful for the new design.

> Streaming DataFrames
> 
>
> Key: SPARK-8360
> URL: https://issues.apache.org/jira/browse/SPARK-8360
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL, Streaming
>Reporter: Reynold Xin
> Attachments: StreamingDataFrameProposal.pdf
>
>
> Umbrella ticket to track what's needed to make streaming DataFrame a reality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-8360) Streaming DataFrames

2015-12-02 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15035335#comment-15035335
 ] 

Cheng Hao edited comment on SPARK-8360 at 12/2/15 12:14 PM:


Remove the google docs link, as I cannot make it access for anyone when using 
the corp account. In the meantime, I put an pdf doc, hopefully helpful.


was (Author: chenghao):
Add some thoughts on StreamingSQL. 
https://docs.google.com/document/u/1/d/1ZgmDsNLT2XltMI317nNtkUkxBFwaaCkLHtPXjQE6JBY/pub

Request Edit if you needed.
https://docs.google.com/document/d/1ZgmDsNLT2XltMI317nNtkUkxBFwaaCkLHtPXjQE6JBY/edit
 

> Streaming DataFrames
> 
>
> Key: SPARK-8360
> URL: https://issues.apache.org/jira/browse/SPARK-8360
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL, Streaming
>Reporter: Reynold Xin
> Attachments: StreamingDataFrameProposal.pdf
>
>
> Umbrella ticket to track what's needed to make streaming DataFrame a reality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12196) Store blocks in different speed storage devices by hierarchy way

2015-12-28 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15072634#comment-15072634
 ] 

Cheng Hao commented on SPARK-12196:
---

Thank you wei wu to support this feature! 

However, we're trying to avoid to change the existing configuration format, as 
it might impact the user applications, and besides, in Yarn/Mesos, this 
configuration key will not work anymore.

An updated PR will be submitted soon, welcome to join the discussion the in PR.

> Store blocks in different speed storage devices by hierarchy way
> 
>
> Key: SPARK-12196
> URL: https://issues.apache.org/jira/browse/SPARK-12196
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: yucai
>
> *Problem*
> Nowadays, users have both SSDs and HDDs. 
> SSDs have great performance, but capacity is small. HDDs have good capacity, 
> but x2-x3 lower than SSDs.
> How can we get both good?
> *Solution*
> Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup 
> storage. 
> When Spark core allocates blocks for RDD (either shuffle or RDD cache), it 
> gets blocks from SSDs first, and when SSD’s useable space is less than some 
> threshold, getting blocks from HDDs.
> In our implementation, we actually go further. We support a way to build any 
> level hierarchy store access all storage medias (NVM, SSD, HDD etc.).
> *Performance*
> 1. At the best case, our solution performs the same as all SSDs.
> 2. At the worst case, like all data are spilled to HDDs, no performance 
> regression.
> 3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it 
> could be higher, CPU reaches bottleneck in our test environment).
> 4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because 
> we support both RDD cache and shuffle and no extra inter process 
> communication.
> *Usage*
> 1. Set the priority and threshold for each layer in 
> spark.storage.hierarchyStore.
> {code}
> spark.storage.hierarchyStore='nvm 50GB,ssd 80GB'
> {code}
> It builds a 3 layers hierarchy store: the 1st is "nvm", the 2nd is "sdd", all 
> the rest form the last layer.
> 2. Configure each layer's location, user just needs put the keyword like 
> "nvm", "ssd", which are specified in step 1, into local dirs, like 
> spark.local.dir or yarn.nodemanager.local-dirs.
> {code}
> spark.local.dir=/mnt/nvm1,/mnt/ssd1,/mnt/ssd2,/mnt/ssd3,/mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4,/mnt/others
> {code}
> After then, restart your Spark application, it will allocate blocks from nvm 
> first.
> When nvm's usable space is less than 50GB, it starts to allocate from ssd.
> When ssd's usable space is less than 80GB, it starts to allocate from the 
> last layer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12610) Add Anti Join Operators

2016-01-03 Thread Cheng Hao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Hao updated SPARK-12610:
--
Issue Type: Sub-task  (was: New Feature)
Parent: SPARK-4226

> Add Anti Join Operators
> ---
>
> Key: SPARK-12610
> URL: https://issues.apache.org/jira/browse/SPARK-12610
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Cheng Hao
>
> We need to implements the anti join operators, for supporting the NOT 
> predicates in subquery.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12610) Add Anti Join Operators

2016-01-03 Thread Cheng Hao (JIRA)
Cheng Hao created SPARK-12610:
-

 Summary: Add Anti Join Operators
 Key: SPARK-12610
 URL: https://issues.apache.org/jira/browse/SPARK-12610
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Cheng Hao


We need to implements the anti join operators, for supporting the NOT 
predicates in subquery.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-13133) When the option --master of spark-submit script is inconsistent with SparkConf.setMaster in Spark appliction code, the behavior of Spark application is dif

2016-02-02 Thread Ji Hao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ji Hao updated SPARK-13133:
---
Comment: was deleted

(was: Sean Owen, I think you should consider this issue, the error log may be 
more clearly!)

> When the option --master of spark-submit script is inconsistent with 
> SparkConf.setMaster in Spark appliction code, the behavior of Spark 
> application is difficult to understand
> ---
>
> Key: SPARK-13133
> URL: https://issues.apache.org/jira/browse/SPARK-13133
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.3.0, 1.4.0, 1.5.0, 1.6.0
>Reporter: Li Ye
>Priority: Minor
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> When the option --master of spark-submit script is inconsistent with 
> SparkConf.setMaster in Spark application code, the behavior is difficult to 
> understand. For example, if the option --master of spark-submit script is 
> yarn-cluster while there is SparkConf.setMaster("local") in Spark application 
> code, the application exit abnormally after about 2 minutes. In driver's log 
> there is an error whose content is "SparkContext did not initialize after 
> waiting for 10 ms. Please check earlier log output for errors. Failing 
> the application".
> When SparkContext is launched, it should be checked whether the option 
> --master of spark-submit script and SparkConf.setMaster in Spark application 
> code are different. If they are different, there should be a clear hint in 
> the driver's log for the developer to troubleshoot.
> I found the same question with me in stackoverflow:
> http://stackoverflow.com/questions/30670933/submit-spark-job-on-yarn-cluster



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13133) When the option --master of spark-submit script is inconsistent with SparkConf.setMaster in Spark appliction code, the behavior of Spark application is difficult to un

2016-02-02 Thread Ji Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15127911#comment-15127911
 ] 

Ji Hao commented on SPARK-13133:


Sean Owen, I think you should consider this issue, the error log may be more 
clearly!

> When the option --master of spark-submit script is inconsistent with 
> SparkConf.setMaster in Spark appliction code, the behavior of Spark 
> application is difficult to understand
> ---
>
> Key: SPARK-13133
> URL: https://issues.apache.org/jira/browse/SPARK-13133
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.3.0, 1.4.0, 1.5.0, 1.6.0
>Reporter: Li Ye
>Priority: Minor
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> When the option --master of spark-submit script is inconsistent with 
> SparkConf.setMaster in Spark application code, the behavior is difficult to 
> understand. For example, if the option --master of spark-submit script is 
> yarn-cluster while there is SparkConf.setMaster("local") in Spark application 
> code, the application exit abnormally after about 2 minutes. In driver's log 
> there is an error whose content is "SparkContext did not initialize after 
> waiting for 10 ms. Please check earlier log output for errors. Failing 
> the application".
> When SparkContext is launched, it should be checked whether the option 
> --master of spark-submit script and SparkConf.setMaster in Spark application 
> code are different. If they are different, there should be a clear hint in 
> the driver's log for the developer to troubleshoot.
> I found the same question with me in stackoverflow:
> http://stackoverflow.com/questions/30670933/submit-spark-job-on-yarn-cluster



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13133) When the option --master of spark-submit script is inconsistent with SparkConf.setMaster in Spark appliction code, the behavior of Spark application is difficult to un

2016-02-02 Thread Ji Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15127912#comment-15127912
 ] 

Ji Hao commented on SPARK-13133:


Sean Owen, I think you should consider this issue, the error log may be more 
clearly!

> When the option --master of spark-submit script is inconsistent with 
> SparkConf.setMaster in Spark appliction code, the behavior of Spark 
> application is difficult to understand
> ---
>
> Key: SPARK-13133
> URL: https://issues.apache.org/jira/browse/SPARK-13133
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.3.0, 1.4.0, 1.5.0, 1.6.0
>Reporter: Li Ye
>Priority: Minor
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> When the option --master of spark-submit script is inconsistent with 
> SparkConf.setMaster in Spark application code, the behavior is difficult to 
> understand. For example, if the option --master of spark-submit script is 
> yarn-cluster while there is SparkConf.setMaster("local") in Spark application 
> code, the application exit abnormally after about 2 minutes. In driver's log 
> there is an error whose content is "SparkContext did not initialize after 
> waiting for 10 ms. Please check earlier log output for errors. Failing 
> the application".
> When SparkContext is launched, it should be checked whether the option 
> --master of spark-submit script and SparkConf.setMaster in Spark application 
> code are different. If they are different, there should be a clear hint in 
> the driver's log for the developer to troubleshoot.
> I found the same question with me in stackoverflow:
> http://stackoverflow.com/questions/30670933/submit-spark-job-on-yarn-cluster



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5941) Unit Test loads the table `src` twice for leftsemijoin.q

2015-04-05 Thread Cheng Hao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Hao updated SPARK-5941:
-
Summary: Unit Test loads the table `src` twice for leftsemijoin.q  (was: 
`def table` is not using the unresolved logical plan `UnresolvedRelation`)

> Unit Test loads the table `src` twice for leftsemijoin.q
> 
>
> Key: SPARK-5941
> URL: https://issues.apache.org/jira/browse/SPARK-5941
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Cheng Hao
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-5941) Unit Test loads the table `src` twice for leftsemijoin.q

2015-04-05 Thread Cheng Hao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Hao updated SPARK-5941:
-
Comment: was deleted

(was: Eagerly resolving the table probably causes side effect in some 
scenarios, let's keep it the same behavior (deferred resolving) with the other 
DF APIs.

In the meantime, I noticed during the unit test of leftsemijoin, the table 
sales will be loaded twice, hence we will get duplicated records(double the 
records), which causes the unit test failure after updating the DataFrameImpl 
code by using the UnresolvedRelation instead of lookupRelation eagerly.

The root reason for this is in leftsemijoin.q, there is a data loading command 
for table sales, but in TestHive, the table sales has been registered as 
TestTable, UnresolvedRelation will lead to table data loading if it's 
registered as TestTable, however, the ResolvedRelation will not trigger that.)

> Unit Test loads the table `src` twice for leftsemijoin.q
> 
>
> Key: SPARK-5941
> URL: https://issues.apache.org/jira/browse/SPARK-5941
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Cheng Hao
>
> In leftsemijoin.q, there is a data loading command for table sales already, 
> but in TestHive, it also created the table sales, which causes duplicated 
> records inserted into the sales.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5941) Unit Test loads the table `src` twice for leftsemijoin.q

2015-04-05 Thread Cheng Hao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Hao updated SPARK-5941:
-
Description: In leftsemijoin.q, there is a data loading command for table 
sales already, but in TestHive, it also created the table sales, which causes 
duplicated records inserted into the sales.

> Unit Test loads the table `src` twice for leftsemijoin.q
> 
>
> Key: SPARK-5941
> URL: https://issues.apache.org/jira/browse/SPARK-5941
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Cheng Hao
>
> In leftsemijoin.q, there is a data loading command for table sales already, 
> but in TestHive, it also created the table sales, which causes duplicated 
> records inserted into the sales.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   4   5   >