[jira] [Commented] (SPARK-18502) Spark does not handle columns that contain backquote (`)

2017-06-27 Thread Sudeshna Bora (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16065990#comment-16065990
 ] 

Sudeshna Bora commented on SPARK-18502:
---

What is the expected time for resolution of this bug ? 
Currently, my dataset have columns with backticks as special character. 
It is failing while trying to initiate such a Column (new Column()) , while 
using dataset.NumericColumns() and dataset.select() apis. 
Is there a known work-around for this?

> Spark does not handle columns that contain backquote (`)
> 
>
> Key: SPARK-18502
> URL: https://issues.apache.org/jira/browse/SPARK-18502
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Barry Becker
>Priority: Minor
>
> I know that if a column contains dots or hyphens we can put 
> backquotes/backticks around it, but what if the column contains a backtick 
> (`)? Can the back tick be escaped by some means?
> Here is an example of the sort of error I see
> {code}
> org.apache.spark.sql.AnalysisException: syntax error in attribute name: 
> `Invoice`Date`;org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.e$1(unresolved.scala:99)
>  
> org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.parseAttributeName(unresolved.scala:109)
>  
> org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.quotedString(unresolved.scala:90)
>  org.apache.spark.sql.Column.(Column.scala:113) 
> org.apache.spark.sql.Column$.apply(Column.scala:36) 
> org.apache.spark.sql.functions$.min(functions.scala:407) 
> com.mineset.spark.vizagg.vizbin.strategies.DateBinStrategy.getDateExtent(DateBinStrategy.scala:158)
>  
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21232) New built-in SQL function - Data_Type

2017-06-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16065980#comment-16065980
 ] 

Apache Spark commented on SPARK-21232:
--

User 'mmolimar' has created a pull request for this issue:
https://github.com/apache/spark/pull/18447

> New built-in SQL function - Data_Type
> -
>
> Key: SPARK-21232
> URL: https://issues.apache.org/jira/browse/SPARK-21232
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SparkR, SQL
>Affects Versions: 2.1.1
>Reporter: Mario Molina
>Priority: Minor
> Fix For: 2.2.0
>
>
> This function returns the data type of a given column.
> {code:java}
> data_type("a")
> // returns string
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21232) New built-in SQL function - Data_Type

2017-06-27 Thread Mario Molina (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16065982#comment-16065982
 ] 

Mario Molina commented on SPARK-21232:
--

I just created a PR for this.

> New built-in SQL function - Data_Type
> -
>
> Key: SPARK-21232
> URL: https://issues.apache.org/jira/browse/SPARK-21232
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SparkR, SQL
>Affects Versions: 2.1.1
>Reporter: Mario Molina
>Priority: Minor
> Fix For: 2.2.0
>
>
> This function returns the data type of a given column.
> {code:java}
> data_type("a")
> // returns string
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21232) New built-in SQL function - Data_Type

2017-06-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21232:


Assignee: (was: Apache Spark)

> New built-in SQL function - Data_Type
> -
>
> Key: SPARK-21232
> URL: https://issues.apache.org/jira/browse/SPARK-21232
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SparkR, SQL
>Affects Versions: 2.1.1
>Reporter: Mario Molina
>Priority: Minor
> Fix For: 2.2.0
>
>
> This function returns the data type of a given column.
> {code:java}
> data_type("a")
> // returns string
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21232) New built-in SQL function - Data_Type

2017-06-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21232:


Assignee: Apache Spark

> New built-in SQL function - Data_Type
> -
>
> Key: SPARK-21232
> URL: https://issues.apache.org/jira/browse/SPARK-21232
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SparkR, SQL
>Affects Versions: 2.1.1
>Reporter: Mario Molina
>Assignee: Apache Spark
>Priority: Minor
> Fix For: 2.2.0
>
>
> This function returns the data type of a given column.
> {code:java}
> data_type("a")
> // returns string
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21236) Make the threshold of using HighlyCompressedStatus configurable.

2017-06-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21236:


Assignee: (was: Apache Spark)

> Make the threshold of using HighlyCompressedStatus configurable.
> 
>
> Key: SPARK-21236
> URL: https://issues.apache.org/jira/browse/SPARK-21236
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 2.1.1
>Reporter: jin xing
>Priority: Minor
>
> Currently the threshold of using {{HighlyCompressedMapStatus}} is hardcoded 
> 2000.
> We could make this configurable. Thus users having enough memory on driver 
> can configure the threshold to be larger thus to save the size of blocks more 
> accurately in {{CompressedMapStatus}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21236) Make the threshold of using HighlyCompressedStatus configurable.

2017-06-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16065973#comment-16065973
 ] 

Apache Spark commented on SPARK-21236:
--

User 'jinxing64' has created a pull request for this issue:
https://github.com/apache/spark/pull/18446

> Make the threshold of using HighlyCompressedStatus configurable.
> 
>
> Key: SPARK-21236
> URL: https://issues.apache.org/jira/browse/SPARK-21236
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 2.1.1
>Reporter: jin xing
>Priority: Minor
>
> Currently the threshold of using {{HighlyCompressedMapStatus}} is hardcoded 
> 2000.
> We could make this configurable. Thus users having enough memory on driver 
> can configure the threshold to be larger thus to save the size of blocks more 
> accurately in {{CompressedMapStatus}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21236) Make the threshold of using HighlyCompressedStatus configurable.

2017-06-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21236:


Assignee: Apache Spark

> Make the threshold of using HighlyCompressedStatus configurable.
> 
>
> Key: SPARK-21236
> URL: https://issues.apache.org/jira/browse/SPARK-21236
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 2.1.1
>Reporter: jin xing
>Assignee: Apache Spark
>Priority: Minor
>
> Currently the threshold of using {{HighlyCompressedMapStatus}} is hardcoded 
> 2000.
> We could make this configurable. Thus users having enough memory on driver 
> can configure the threshold to be larger thus to save the size of blocks more 
> accurately in {{CompressedMapStatus}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21232) New built-in SQL function - Data_Type

2017-06-27 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16065972#comment-16065972
 ] 

Felix Cheung commented on SPARK-21232:
--

I don't see this in Scala - are you proposing we add this?

> New built-in SQL function - Data_Type
> -
>
> Key: SPARK-21232
> URL: https://issues.apache.org/jira/browse/SPARK-21232
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SparkR, SQL
>Affects Versions: 2.1.1
>Reporter: Mario Molina
>Priority: Minor
> Fix For: 2.2.0
>
>
> This function returns the data type of a given column.
> {code:java}
> data_type("a")
> // returns string
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21236) Make the threshold of using HighlyCompressedStatus configurable.

2017-06-27 Thread jin xing (JIRA)
jin xing created SPARK-21236:


 Summary: Make the threshold of using HighlyCompressedStatus 
configurable.
 Key: SPARK-21236
 URL: https://issues.apache.org/jira/browse/SPARK-21236
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 2.1.1
Reporter: jin xing
Priority: Minor


Currently the threshold of using {{HighlyCompressedMapStatus}} is hardcoded 
2000.
We could make this configurable. Thus users having enough memory on driver can 
configure the threshold to be larger thus to save the size of blocks more 
accurately in {{CompressedMapStatus}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21235) UTest should clear temp results when run case

2017-06-27 Thread wangjiaochun (JIRA)
wangjiaochun created SPARK-21235:


 Summary: UTest should clear temp results when run case 
 Key: SPARK-21235
 URL: https://issues.apache.org/jira/browse/SPARK-21235
 Project: Spark
  Issue Type: Test
  Components: Tests
Affects Versions: 2.1.1
Reporter: wangjiaochun
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21234) When the function returns Option[Iterator[_]] is None,then get on None will cause java.util.NoSuchElementException: None.get

2017-06-27 Thread wangjiaochun (JIRA)
wangjiaochun created SPARK-21234:


 Summary: When the function returns Option[Iterator[_]] is 
None,then get on None will cause java.util.NoSuchElementException: None.get
 Key: SPARK-21234
 URL: https://issues.apache.org/jira/browse/SPARK-21234
 Project: Spark
  Issue Type: Bug
  Components: Block Manager
Affects Versions: 2.1.1
Reporter: wangjiaochun


Class BlockManager {
...
def getLocalValues(blockId: BlockId): Option[BlockResult] ={
...
memoryStore.getValues(blockId).get
...
}
..
}
The above code getValues return three type values: 
None,IllegalArgumentException and normal ,if return None,Cause 
java.util.NoSuchElementException: None.get。so I think this is potential Bug;



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21233) Support pluggable offset storage

2017-06-27 Thread darion yaphet (JIRA)
darion yaphet created SPARK-21233:
-

 Summary: Support pluggable offset storage
 Key: SPARK-21233
 URL: https://issues.apache.org/jira/browse/SPARK-21233
 Project: Spark
  Issue Type: New Feature
  Components: DStreams
Affects Versions: 2.1.1, 2.0.2
Reporter: darion yaphet


Currently we using *ZooKeeper* to save the *Kafka Commit Offset* , when there 
are a lot of streaming program running in the cluster the ZooKeeper Cluster's 
loading is very high . Maybe Zookeeper is not very suitable to save offset 
periodicity.

This issue is wish to support a pluggable offset storage to avoid save it in 
the zookeeper 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21232) New built-in SQL function - Data_Type

2017-06-27 Thread Mario Molina (JIRA)
Mario Molina created SPARK-21232:


 Summary: New built-in SQL function - Data_Type
 Key: SPARK-21232
 URL: https://issues.apache.org/jira/browse/SPARK-21232
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SparkR, SQL
Affects Versions: 2.1.1
Reporter: Mario Molina
Priority: Minor
 Fix For: 2.2.0


This function returns the data type of a given column.

{code:scala}
data_type("a")
// returns string
{code}





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21232) New built-in SQL function - Data_Type

2017-06-27 Thread Mario Molina (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mario Molina updated SPARK-21232:
-
Description: 
This function returns the data type of a given column.

{code:java}
data_type("a")
// returns string
{code}



  was:
This function returns the data type of a given column.

{code:scala}
data_type("a")
// returns string
{code}




> New built-in SQL function - Data_Type
> -
>
> Key: SPARK-21232
> URL: https://issues.apache.org/jira/browse/SPARK-21232
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SparkR, SQL
>Affects Versions: 2.1.1
>Reporter: Mario Molina
>Priority: Minor
> Fix For: 2.2.0
>
>
> This function returns the data type of a given column.
> {code:java}
> data_type("a")
> // returns string
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21182) Structured streaming on Spark-shell on windows

2017-06-27 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16065929#comment-16065929
 ] 

Hyukjin Kwon commented on SPARK-21182:
--

Ah, could you maybe try to explicitly specify the fully qualified URI form via 
{{.option("checkpointLocation", "...")}}?

> Structured streaming on Spark-shell on windows
> --
>
> Key: SPARK-21182
> URL: https://issues.apache.org/jira/browse/SPARK-21182
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.1
> Environment: Windows 10
> spark-2.1.1-bin-hadoop2.7
>Reporter: Vijay
>Priority: Minor
>
> Structured streaming output operation is failing on Windows shell.
> As per the error message, path is being prefixed with File separator as in 
> Linux.
> Thus, causing the IllegalArgumentException.
> Following is the error message.
> scala> val query = wordCounts.writeStream  .outputMode("complete")  
> .format("console")  .start()
> java.lang.IllegalArgumentException: Pathname 
> {color:red}*/*{color}C:/Users/Vijay/AppData/Local/Temp/temporary-081b482c-98a4-494e-8cfb-22d966c2da01/offsets
>  from 
> C:/Users/Vijay/AppData/Local/Temp/temporary-081b482c-98a4-494e-8cfb-22d966c2da01/offsets
>  is not a valid DFS filename.
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:197)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.access$000(DistributedFileSystem.java:106)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1305)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1317)
>   at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1426)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:222)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:280)
>   at 
> org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:268)
>   ... 52 elided



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21182) Structured streaming on Spark-shell on windows

2017-06-27 Thread Vijay (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16065921#comment-16065921
 ] 

Vijay commented on SPARK-21182:
---

I'm still facing the same issue.
Actually I have configured Hadoop on windows along with Spark.

will this be an issue?

> Structured streaming on Spark-shell on windows
> --
>
> Key: SPARK-21182
> URL: https://issues.apache.org/jira/browse/SPARK-21182
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.1
> Environment: Windows 10
> spark-2.1.1-bin-hadoop2.7
>Reporter: Vijay
>Priority: Minor
>
> Structured streaming output operation is failing on Windows shell.
> As per the error message, path is being prefixed with File separator as in 
> Linux.
> Thus, causing the IllegalArgumentException.
> Following is the error message.
> scala> val query = wordCounts.writeStream  .outputMode("complete")  
> .format("console")  .start()
> java.lang.IllegalArgumentException: Pathname 
> {color:red}*/*{color}C:/Users/Vijay/AppData/Local/Temp/temporary-081b482c-98a4-494e-8cfb-22d966c2da01/offsets
>  from 
> C:/Users/Vijay/AppData/Local/Temp/temporary-081b482c-98a4-494e-8cfb-22d966c2da01/offsets
>  is not a valid DFS filename.
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:197)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.access$000(DistributedFileSystem.java:106)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1305)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1317)
>   at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1426)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:222)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:280)
>   at 
> org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:268)
>   ... 52 elided



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14486) For partition table, the dag occurs oom because of too many same rdds

2017-06-27 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-14486.
--
Resolution: Invalid

I am resolving this.I also agree with the opinion above.

> For partition table, the dag occurs oom because of too many same rdds
> -
>
> Key: SPARK-14486
> URL: https://issues.apache.org/jira/browse/SPARK-14486
> Project: Spark
>  Issue Type: Bug
>Reporter: meiyoula
> Attachments: screenshot-1.png
>
>
> For partition table, when partition rdds do some maps, the rdd number will 
> multiple grow. So rdd number in dag will become thousands, and occurs oom.
> Can we make a improvement to reduce the rdd number in dag. show the same rdds 
> just one time, not each partition.
> As the screen shot shows "HiveTableScan" cluster has thousands same rdds.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21019) read orc when some of the columns are missing in some files

2017-06-27 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-21019.
--
Resolution: Duplicate

I believe this should be a duplicate of SPARK-11412.

>  read orc when some of the columns are missing in some files
> 
>
> Key: SPARK-21019
> URL: https://issues.apache.org/jira/browse/SPARK-21019
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.1.1
>Reporter: Modi Tamam
>
> I'm using Spark-2.1.1.
> I'm experiencing an issue when I'm reading a bunch of ORC files when some of 
> the fields are missing from some of the files (file-1 has fields 'a' and 'b', 
> file-2 has fields 'a' and 'c'). When I'm running the same flow with JSON 
> files format, every thing is just fine (you can see it at the code snippets , 
> if you'll run it...) My question is whether it's a bug or an expected 
> behavior?
> I'v pushed a maven project, ready for run, you can find it here 
> https://github.com/MordechaiTamam/spark-orc-issue.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21053) Number overflow on agg function of Dataframe

2017-06-27 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-21053.
--
Resolution: Cannot Reproduce

I tried to follow what's written in this JIRA but I could not reproduce this. I 
am resolving this.

> Number overflow on agg function of Dataframe
> 
>
> Key: SPARK-21053
> URL: https://issues.apache.org/jira/browse/SPARK-21053
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: Databricks Community version
>Reporter: DUC LIEM NGUYEN
>  Labels: features
>
> The use of average on aggregation function on a large data set return a NaN 
> instead of the desired numerical value although it's range between 0 and 1.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21076) R dapply doesn't return array or raw columns when array have different length

2017-06-27 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16065902#comment-16065902
 ] 

Hyukjin Kwon commented on SPARK-21076:
--

I believe this produces the similar error described above

{code}
dapplyCollect(createDataFrame(list(list(1, list(1, 2, function(x) x)
{code}

> R dapply doesn't return array or raw columns when array have different length
> -
>
> Key: SPARK-21076
> URL: https://issues.apache.org/jira/browse/SPARK-21076
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Xu Yang
>
> Calling SparkR::dapplyCollect with R functions that return dataframes 
> produces an error. This comes up when returning columns of binary data- ie. 
> serialized fitted models. Also happens when functions return columns 
> containing vectors. 
> [SPARK-16785|https://issues.apache.org/jira/browse/SPARK-16785]
> still have this issue when input data is an array column not having the same 
> length on each vector, like:
> {code}
> head(test1)
>key  value
> 1 4dda7d68a202e9e3  1595297780
> 2  4e08f349deb7392  641991337
> 3 4e105531747ee00b  374773009
> 4 4f1d5ef7fdb4620a  2570136926
> 5 4f63a71e6dde04cd  2117602722
> 6 4fa2f96b689624fc  3489692062, 1344510747, 1095592237, 
> 424510360, 3211239587
> sparkR.stop()
> sc <- sparkR.init()
> sqlContext <- sparkRSQL.init(sc)
> spark_df = createDataFrame(sqlContext, test1)
> # Fails
> dapplyCollect(spark_df, function(x) x)
> Caused by: org.apache.spark.SparkException: R computation failed with
>  Error in (function (..., deparse.level = 1, make.row.names = TRUE, 
> stringsAsFactors = default.stringsAsFactors())  : 
>   invalid list argument: all variables should have the same length
>   at org.apache.spark.api.r.RRunner.compute(RRunner.scala:108)
>   at 
> org.apache.spark.sql.execution.r.MapPartitionsRWrapper.apply(MapPartitionsRWrapper.scala:59)
>   at 
> org.apache.spark.sql.execution.r.MapPartitionsRWrapper.apply(MapPartitionsRWrapper.scala:29)
>   at 
> org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$6.apply(objects.scala:186)
>   at 
> org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$6.apply(objects.scala:183)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   ... 1 more
> # Works fine
> spark_df <- selectExpr(spark_df, "key", "explode(value) value") 
> dapplyCollect(spark_df, function(x) x)
> key value
> 1  4dda7d68a202e9e3 1595297780
> 2   4e08f349deb7392  641991337
> 3  4e105531747ee00b  374773009
> 4  4f1d5ef7fdb4620a 2570136926
> 5  4f63a71e6dde04cd 2117602722
> 6  4fa2f96b689624fc 3489692062
> 7  4fa2f96b689624fc 1344510747
> 8  4fa2f96b689624fc 1095592237
> 9  4fa2f96b689624fc  424510360
> 10 4fa2f96b689624fc 3211239587
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21182) Structured streaming on Spark-shell on windows

2017-06-27 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16065876#comment-16065876
 ] 

Hyukjin Kwon commented on SPARK-21182:
--

Looks I can't reproduce this on Windows at the current master.

With the reproducer below:

{code:title=Wordcount.scala|borderStyle=solid}
val lines = spark.readStream.format("socket").option("host", 
"localhost").option("port", ).load()
val words = lines.as[String].flatMap(_.split(" "))
val wordCounts = words.groupBy("value").count().sort($"count".desc)
val query = 
wordCounts.writeStream.outputMode("complete").format("console").start()
query.awaitTermination()
{code}

{code:title=nc.py|borderStyle=solid}
import socket
import urllib
import time


if __name__ == "__main__":
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.bind(('0.0.0.0', ))
s.listen(1)
conn, _ = s.accept()
while True:
conn.sendall(raw_input() + "\n")
{code}

In cmd A:

{code}
C:\...\...>python nc.py
{code}

In cmd B:

{code}
C:\...\...>.\bin\spark-shell -i Wordcount.scala
{code}

In cmd A:
{code}
 
a
abab
abab
{code}

In cmd B:

{code}
---
Batch: 0
---
...
+-+-+
|value|count|
+-+-+
| |1|
+-+-+

---
Batch: 1
---
...
+-+-+
|value|count|
+-+-+
|a|1|
| |1|
+-+-+

---
Batch: 2
---
...
+-+-+
|value|count|
+-+-+
| abab|1|
|a|1|
| |1|
+-+-+

---
Batch: 3
---
...
+-+-+
|value|count|
+-+-+
| abab|2|
|a|1|
| |1|
+-+-+

{code}


> Structured streaming on Spark-shell on windows
> --
>
> Key: SPARK-21182
> URL: https://issues.apache.org/jira/browse/SPARK-21182
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.1
> Environment: Windows 10
> spark-2.1.1-bin-hadoop2.7
>Reporter: Vijay
>Priority: Minor
>
> Structured streaming output operation is failing on Windows shell.
> As per the error message, path is being prefixed with File separator as in 
> Linux.
> Thus, causing the IllegalArgumentException.
> Following is the error message.
> scala> val query = wordCounts.writeStream  .outputMode("complete")  
> .format("console")  .start()
> java.lang.IllegalArgumentException: Pathname 
> {color:red}*/*{color}C:/Users/Vijay/AppData/Local/Temp/temporary-081b482c-98a4-494e-8cfb-22d966c2da01/offsets
>  from 
> C:/Users/Vijay/AppData/Local/Temp/temporary-081b482c-98a4-494e-8cfb-22d966c2da01/offsets
>  is not a valid DFS filename.
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:197)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.access$000(DistributedFileSystem.java:106)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1305)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1317)
>   at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1426)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:222)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:280)
>   at 
> org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:268)
>   ... 52 elided



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19726) Faild to insert null timestamp value to mysql using spark jdbc

2017-06-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19726:


Assignee: Apache Spark

> Faild to insert null timestamp value to mysql using spark jdbc
> --
>
> Key: SPARK-19726
> URL: https://issues.apache.org/jira/browse/SPARK-19726
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: AnfengYuan
>Assignee: Apache Spark
>
> 1. create a table in mysql
> {code:borderStyle=solid}
> CREATE TABLE `timestamp_test` (
>   `id` bigint(23) DEFAULT NULL,
>   `time_stamp` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE 
> CURRENT_TIMESTAMP
> ) ENGINE=InnoDB DEFAULT CHARSET=utf8
> {code}
> 2. insert one row using spark
> {code:borderStyle=solid}
> CREATE OR REPLACE TEMPORARY VIEW jdbcTable
> USING org.apache.spark.sql.jdbc
> OPTIONS (
>   url 
> 'jdbc:mysql://xxx.xxx.xxx.xxx:3306/default?characterEncoding=utf8&useServerPrepStmts=false&rewriteBatchedStatements=true',
>   dbtable 'timestamp_test',
>   driver 'com.mysql.jdbc.Driver',
>   user 'root',
>   password 'root'
> );
> insert into jdbcTable values (1, null);
> {code}
> the insert statement failed with exceptions:
> {code:borderStyle=solid}
> Error: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 599 in stage 1.0 failed 4 times, most recent failure: Lost task 599.3 in 
> stage 1.0 (TID 1202, A03-R07-I12-135.JD.LOCAL): 
> java.sql.BatchUpdateException: Data truncation: Incorrect datetime value: 
> '1970-01-01 08:00:00' for column 'time_stamp' at row 1
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
>   at com.mysql.jdbc.Util.handleNewInstance(Util.java:404)
>   at com.mysql.jdbc.Util.getInstance(Util.java:387)
>   at 
> com.mysql.jdbc.SQLError.createBatchUpdateException(SQLError.java:1154)
>   at 
> com.mysql.jdbc.PreparedStatement.executeBatchedInserts(PreparedStatement.java:1582)
>   at 
> com.mysql.jdbc.PreparedStatement.executeBatchInternal(PreparedStatement.java:1248)
>   at com.mysql.jdbc.StatementImpl.executeBatch(StatementImpl.java:959)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:227)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:300)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:299)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:902)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:902)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1899)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1899)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: com.mysql.jdbc.MysqlDataTruncation: Data truncation: Incorrect 
> datetime value: '1970-01-01 08:00:00' for column 'time_stamp' at row 1
>   at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3876)
>   at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3814)
>   at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2478)
>   at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2625)
>   at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2551)
>   at 
> com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:1861)
>   at 
> com.mysql.jdbc.PreparedStatement.executeUpdateInternal(PreparedStatement.java:2073)
>   at 
> com.mysql.jdbc.PreparedStatement.executeUpdateInternal(PreparedStatement.java:2009)
>   at 
> com.mysql.jdbc.PreparedStatement.executeLargeUpdate(PreparedStatement.java:5094)
>   at 
> com.mysql.jdbc.PreparedStatement.executeBatchedInserts(PreparedStatement.java:1543)
>   ... 15 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

--

[jira] [Commented] (SPARK-19726) Faild to insert null timestamp value to mysql using spark jdbc

2017-06-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16065869#comment-16065869
 ] 

Apache Spark commented on SPARK-19726:
--

User 'shuangshuangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/18445

> Faild to insert null timestamp value to mysql using spark jdbc
> --
>
> Key: SPARK-19726
> URL: https://issues.apache.org/jira/browse/SPARK-19726
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: AnfengYuan
>
> 1. create a table in mysql
> {code:borderStyle=solid}
> CREATE TABLE `timestamp_test` (
>   `id` bigint(23) DEFAULT NULL,
>   `time_stamp` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE 
> CURRENT_TIMESTAMP
> ) ENGINE=InnoDB DEFAULT CHARSET=utf8
> {code}
> 2. insert one row using spark
> {code:borderStyle=solid}
> CREATE OR REPLACE TEMPORARY VIEW jdbcTable
> USING org.apache.spark.sql.jdbc
> OPTIONS (
>   url 
> 'jdbc:mysql://xxx.xxx.xxx.xxx:3306/default?characterEncoding=utf8&useServerPrepStmts=false&rewriteBatchedStatements=true',
>   dbtable 'timestamp_test',
>   driver 'com.mysql.jdbc.Driver',
>   user 'root',
>   password 'root'
> );
> insert into jdbcTable values (1, null);
> {code}
> the insert statement failed with exceptions:
> {code:borderStyle=solid}
> Error: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 599 in stage 1.0 failed 4 times, most recent failure: Lost task 599.3 in 
> stage 1.0 (TID 1202, A03-R07-I12-135.JD.LOCAL): 
> java.sql.BatchUpdateException: Data truncation: Incorrect datetime value: 
> '1970-01-01 08:00:00' for column 'time_stamp' at row 1
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
>   at com.mysql.jdbc.Util.handleNewInstance(Util.java:404)
>   at com.mysql.jdbc.Util.getInstance(Util.java:387)
>   at 
> com.mysql.jdbc.SQLError.createBatchUpdateException(SQLError.java:1154)
>   at 
> com.mysql.jdbc.PreparedStatement.executeBatchedInserts(PreparedStatement.java:1582)
>   at 
> com.mysql.jdbc.PreparedStatement.executeBatchInternal(PreparedStatement.java:1248)
>   at com.mysql.jdbc.StatementImpl.executeBatch(StatementImpl.java:959)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:227)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:300)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:299)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:902)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:902)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1899)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1899)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: com.mysql.jdbc.MysqlDataTruncation: Data truncation: Incorrect 
> datetime value: '1970-01-01 08:00:00' for column 'time_stamp' at row 1
>   at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3876)
>   at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3814)
>   at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2478)
>   at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2625)
>   at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2551)
>   at 
> com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:1861)
>   at 
> com.mysql.jdbc.PreparedStatement.executeUpdateInternal(PreparedStatement.java:2073)
>   at 
> com.mysql.jdbc.PreparedStatement.executeUpdateInternal(PreparedStatement.java:2009)
>   at 
> com.mysql.jdbc.PreparedStatement.executeLargeUpdate(PreparedStatement.java:5094)
>   at 
> com.mysql.jdbc.PreparedStatement.executeBatchedInserts(PreparedStatement.java:1543)
>   ... 15 more
> {code}



--
This message was sent by 

[jira] [Assigned] (SPARK-19726) Faild to insert null timestamp value to mysql using spark jdbc

2017-06-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19726:


Assignee: (was: Apache Spark)

> Faild to insert null timestamp value to mysql using spark jdbc
> --
>
> Key: SPARK-19726
> URL: https://issues.apache.org/jira/browse/SPARK-19726
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: AnfengYuan
>
> 1. create a table in mysql
> {code:borderStyle=solid}
> CREATE TABLE `timestamp_test` (
>   `id` bigint(23) DEFAULT NULL,
>   `time_stamp` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE 
> CURRENT_TIMESTAMP
> ) ENGINE=InnoDB DEFAULT CHARSET=utf8
> {code}
> 2. insert one row using spark
> {code:borderStyle=solid}
> CREATE OR REPLACE TEMPORARY VIEW jdbcTable
> USING org.apache.spark.sql.jdbc
> OPTIONS (
>   url 
> 'jdbc:mysql://xxx.xxx.xxx.xxx:3306/default?characterEncoding=utf8&useServerPrepStmts=false&rewriteBatchedStatements=true',
>   dbtable 'timestamp_test',
>   driver 'com.mysql.jdbc.Driver',
>   user 'root',
>   password 'root'
> );
> insert into jdbcTable values (1, null);
> {code}
> the insert statement failed with exceptions:
> {code:borderStyle=solid}
> Error: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 599 in stage 1.0 failed 4 times, most recent failure: Lost task 599.3 in 
> stage 1.0 (TID 1202, A03-R07-I12-135.JD.LOCAL): 
> java.sql.BatchUpdateException: Data truncation: Incorrect datetime value: 
> '1970-01-01 08:00:00' for column 'time_stamp' at row 1
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
>   at com.mysql.jdbc.Util.handleNewInstance(Util.java:404)
>   at com.mysql.jdbc.Util.getInstance(Util.java:387)
>   at 
> com.mysql.jdbc.SQLError.createBatchUpdateException(SQLError.java:1154)
>   at 
> com.mysql.jdbc.PreparedStatement.executeBatchedInserts(PreparedStatement.java:1582)
>   at 
> com.mysql.jdbc.PreparedStatement.executeBatchInternal(PreparedStatement.java:1248)
>   at com.mysql.jdbc.StatementImpl.executeBatch(StatementImpl.java:959)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:227)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:300)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:299)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:902)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:902)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1899)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1899)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: com.mysql.jdbc.MysqlDataTruncation: Data truncation: Incorrect 
> datetime value: '1970-01-01 08:00:00' for column 'time_stamp' at row 1
>   at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3876)
>   at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3814)
>   at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2478)
>   at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2625)
>   at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2551)
>   at 
> com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:1861)
>   at 
> com.mysql.jdbc.PreparedStatement.executeUpdateInternal(PreparedStatement.java:2073)
>   at 
> com.mysql.jdbc.PreparedStatement.executeUpdateInternal(PreparedStatement.java:2009)
>   at 
> com.mysql.jdbc.PreparedStatement.executeLargeUpdate(PreparedStatement.java:5094)
>   at 
> com.mysql.jdbc.PreparedStatement.executeBatchedInserts(PreparedStatement.java:1543)
>   ... 15 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mai

[jira] [Commented] (SPARK-21222) Move elimination of Distinct clause from analyzer to optimizer

2017-06-27 Thread Gengliang Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16065791#comment-16065791
 ] 

Gengliang Wang commented on SPARK-21222:


[~srowen] thanks! I have corrected the statement.

> Move elimination of Distinct clause from analyzer to optimizer
> --
>
> Key: SPARK-21222
> URL: https://issues.apache.org/jira/browse/SPARK-21222
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Gengliang Wang
>Priority: Minor
>
> Move elimination of Distinct clause from analyzer to optimizer
> Distinct clause is useless after MAX/MIN clause. For example,
> "Select MAX(distinct a) FROM src from"
> is equivalent of
> "Select MAX(a) FROM src from"
> However, this optimization is implemented in analyzer. It should be in 
> optimizer.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21222) Move elimination of Distinct clause from analyzer to optimizer

2017-06-27 Thread Gengliang Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-21222:
---
Description: 
Move elimination of Distinct clause from analyzer to optimizer

Distinct clause is useless after MAX/MIN clause. For example,
"Select MAX(distinct a) FROM src from"
is equivalent of
"Select MAX(a) FROM src from"
However, this optimization is implemented in analyzer. It should be in 
optimizer.

  was:
Distinct clause is after MAX/MIN clause 
"Select MAX(distinct a) FROM src from"
 is equivalent of 
"Select MAX(distinct a) FROM src from"

However, this optimization is implemented in analyzer. It should be in 
optimizer.


> Move elimination of Distinct clause from analyzer to optimizer
> --
>
> Key: SPARK-21222
> URL: https://issues.apache.org/jira/browse/SPARK-21222
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Gengliang Wang
>Priority: Minor
>
> Move elimination of Distinct clause from analyzer to optimizer
> Distinct clause is useless after MAX/MIN clause. For example,
> "Select MAX(distinct a) FROM src from"
> is equivalent of
> "Select MAX(a) FROM src from"
> However, this optimization is implemented in analyzer. It should be in 
> optimizer.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10915) Add support for UDAFs in Python

2017-06-27 Thread Han Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16065790#comment-16065790
 ] 

Han Xu commented on SPARK-10915:


I'm currently traveling without access to my email.  To get in touch with me, 
please call +1 650 272 7131.


> Add support for UDAFs in Python
> ---
>
> Key: SPARK-10915
> URL: https://issues.apache.org/jira/browse/SPARK-10915
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Reporter: Justin Uang
>
> This should support python defined lambdas.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16542) bugs about types that result an array of null when creating dataframe using python

2017-06-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16065789#comment-16065789
 ] 

Apache Spark commented on SPARK-16542:
--

User 'zasdfgbnm' has created a pull request for this issue:
https://github.com/apache/spark/pull/18444

> bugs about types that result an array of null when creating dataframe using 
> python
> --
>
> Key: SPARK-16542
> URL: https://issues.apache.org/jira/browse/SPARK-16542
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Reporter: Xiang Gao
>
> This is a bugs about types that result an array of null when creating 
> DataFrame using python.
> Python's array.array have richer type than python itself, e.g. we can have 
> {{array('f',[1,2,3])}} and {{array('d',[1,2,3])}}. Codes in spark-sql didn't 
> take this into consideration which might cause a problem that you get an 
> array of null values when you have {{array('f')}} in your rows.
> A simple code to reproduce this is:
> {code}
> from pyspark import SparkContext
> from pyspark.sql import SQLContext,Row,DataFrame
> from array import array
> sc = SparkContext()
> sqlContext = SQLContext(sc)
> row1 = Row(floatarray=array('f',[1,2,3]), doublearray=array('d',[1,2,3]))
> rows = sc.parallelize([ row1 ])
> df = sqlContext.createDataFrame(rows)
> df.show()
> {code}
> which have output
> {code}
> +---+--+
> |doublearray|floatarray|
> +---+--+
> |[1.0, 2.0, 3.0]|[null, null, null]|
> +---+--+
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10915) Add support for UDAFs in Python

2017-06-27 Thread Erik Erlandson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16065787#comment-16065787
 ] 

Erik Erlandson commented on SPARK-10915:


This would be great for exposing {{TDigest}} aggregation to py-spark datasets.  
(see https://github.com/isarn/isarn-sketches#t-digest)

Currently the newer {{Aggregator}} trait makes this easy to do for datasets in 
Scala.  Writing the alternative {{UserDefinedAggregateFunction}} is possible, 
although I'd have to code my own serializor for a TDigest UDT instead of just 
using {{Encoder.kryo}}.  But UDAF to python is a hack at best: (see 
https://stackoverflow.com/a/33257733/3669757)


> Add support for UDAFs in Python
> ---
>
> Key: SPARK-10915
> URL: https://issues.apache.org/jira/browse/SPARK-10915
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Reporter: Justin Uang
>
> This should support python defined lambdas.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21227) Unicode in Json field causes AnalysisException when selecting from Dataframe

2017-06-27 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16065775#comment-16065775
 ] 

Hyukjin Kwon commented on SPARK-21227:
--

In scala too:

{code}
val jsons = Seq(
  """{"city_name": "paris"}""",
  """{"cıty_name": "new-york"}""")

spark.read.json(jsons.toDS).show()
spark.read.json(jsons.toDS).select("city_name").show()
{code}

> Unicode in Json field causes AnalysisException when selecting from Dataframe
> 
>
> Key: SPARK-21227
> URL: https://issues.apache.org/jira/browse/SPARK-21227
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: Seydou Dia
>
> Hi,
> please find below the step to reproduce the issue I am facing.
> First I create a json with 2 fields:
> * city_name
> * cıty_name
> The first one is valid ascii, while the second contains a unicode (ı, i 
> without dot ).
> When I try to select from the dataframe I have an  {noformat} 
> AnalysisException {noformat}.
> {code:python}
> $ pyspark
> Python 3.4.3 (default, Sep  1 2016, 23:33:38) 
> [GCC 4.8.3 20140911 (Red Hat 4.8.3-9)] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 17/06/27 12:29:05 WARN Utils: Service 'SparkUI' could not bind on port 4040. 
> Attempting port 4041.
> 17/06/27 12:29:05 WARN Utils: Service 'SparkUI' could not bind on port 4041. 
> Attempting port 4042.
> 17/06/27 12:29:08 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive 
> is set, falling back to uploading libraries under SPARK_HOME.
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/__ / .__/\_,_/_/ /_/\_\   version 2.1.0
>   /_/
> Using Python version 3.4.3 (default, Sep  1 2016 23:33:38)
> SparkSession available as 'spark'.
> >>> sc=spark.sparkContext
> >>> js = ['{"city_name": "paris"}'
> ... , '{"city_name": "rome"}'
> ... , '{"city_name": "berlin"}'
> ... , '{"cıty_name": "new-york"}'
> ... , '{"cıty_name": "toronto"}'
> ... , '{"cıty_name": "chicago"}'
> ... , '{"cıty_name": "dubai"}']
> >>> myRDD = sc.parallelize(js)
> >>> myDF = spark.read.json(myRDD)
> >>> myDF.printSchema()
> >>>   
> root
>  |-- city_name: string (nullable = true)
>  |-- cıty_name: string (nullable = true)
> >>> myDF.select(myDF['city_name'])
> Traceback (most recent call last):
>   File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco
> return f(*a, **kw)
>   File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 
> 319, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o226.apply.
> : org.apache.spark.sql.AnalysisException: Reference 'city_name' is ambiguous, 
> could be: city_name#29, city_name#30.;
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:264)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:168)
>   at org.apache.spark.sql.Dataset.resolve(Dataset.scala:217)
>   at org.apache.spark.sql.Dataset.col(Dataset.scala:1073)
>   at org.apache.spark.sql.Dataset.apply(Dataset.scala:1059)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>   at py4j.Gateway.invoke(Gateway.java:280)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:214)
>   at java.lang.Thread.run(Thread.java:745)
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/usr/lib/spark/python/pyspark/sql/dataframe.py", line 943, in 
> __getitem__
> jc = self._jdf.apply(item)
>   File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", 
> line 1133, in __call__
>   File "/usr/lib/spark/python/pyspark/sql/utils.py", line 69, in deco
> raise AnalysisException(s.split(': ', 1)[1], stackTrace)
> pyspark.sql.utils.AnalysisException: "Reference 'city_name' is ambiguous, 
> could be: city_name#29, city_name#30.;"
> {code}



--
This 

[jira] [Assigned] (SPARK-21155) Add (? running tasks) into Spark UI progress

2017-06-27 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-21155:
---

Assignee: Eric Vandenberg

> Add (? running tasks) into Spark UI progress
> 
>
> Key: SPARK-21155
> URL: https://issues.apache.org/jira/browse/SPARK-21155
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.1.1
>Reporter: Eric Vandenberg
>Assignee: Eric Vandenberg
>Priority: Minor
> Fix For: 2.3.0
>
> Attachments: Screen Shot 2017-06-20 at 12.32.58 PM.png, Screen Shot 
> 2017-06-20 at 3.40.39 PM.png, Screen Shot 2017-06-22 at 9.58.08 AM.png
>
>
> The progress UI for Active Jobs / Tasks should show the number of exact 
> number of running tasks.  See screen shot attachment for what this looks like.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21155) Add (? running tasks) into Spark UI progress

2017-06-27 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-21155.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18369
[https://github.com/apache/spark/pull/18369]

> Add (? running tasks) into Spark UI progress
> 
>
> Key: SPARK-21155
> URL: https://issues.apache.org/jira/browse/SPARK-21155
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.1.1
>Reporter: Eric Vandenberg
>Priority: Minor
> Fix For: 2.3.0
>
> Attachments: Screen Shot 2017-06-20 at 12.32.58 PM.png, Screen Shot 
> 2017-06-20 at 3.40.39 PM.png, Screen Shot 2017-06-22 at 9.58.08 AM.png
>
>
> The progress UI for Active Jobs / Tasks should show the number of exact 
> number of running tasks.  See screen shot attachment for what this looks like.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21227) Unicode in Json field causes AnalysisException when selecting from Dataframe

2017-06-27 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16065769#comment-16065769
 ] 

Hyukjin Kwon commented on SPARK-21227:
--

I can reproduce this in both Python 2.7 and 3.6.0

{code}
sc = spark.sparkContext
js = ['{"city_name": "paris"}'
, '{"city_name": "rome"}'
, '{"city_name": "berlin"}'
, '{"cıty_name": "new-york"}'
, '{"cıty_name": "toronto"}'
, '{"cıty_name": "chicago"}'
, '{"cıty_name": "dubai"}']
myRDD = sc.parallelize(js)
myDF = spark.read.json(myRDD)
myDF.printSchema()
myDF.show()
myDF.select(myDF['city_name'])
{code}



> Unicode in Json field causes AnalysisException when selecting from Dataframe
> 
>
> Key: SPARK-21227
> URL: https://issues.apache.org/jira/browse/SPARK-21227
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: Seydou Dia
>
> Hi,
> please find below the step to reproduce the issue I am facing.
> First I create a json with 2 fields:
> * city_name
> * cıty_name
> The first one is valid ascii, while the second contains a unicode (ı, i 
> without dot ).
> When I try to select from the dataframe I have an  {noformat} 
> AnalysisException {noformat}.
> {code:python}
> $ pyspark
> Python 3.4.3 (default, Sep  1 2016, 23:33:38) 
> [GCC 4.8.3 20140911 (Red Hat 4.8.3-9)] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 17/06/27 12:29:05 WARN Utils: Service 'SparkUI' could not bind on port 4040. 
> Attempting port 4041.
> 17/06/27 12:29:05 WARN Utils: Service 'SparkUI' could not bind on port 4041. 
> Attempting port 4042.
> 17/06/27 12:29:08 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive 
> is set, falling back to uploading libraries under SPARK_HOME.
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/__ / .__/\_,_/_/ /_/\_\   version 2.1.0
>   /_/
> Using Python version 3.4.3 (default, Sep  1 2016 23:33:38)
> SparkSession available as 'spark'.
> >>> sc=spark.sparkContext
> >>> js = ['{"city_name": "paris"}'
> ... , '{"city_name": "rome"}'
> ... , '{"city_name": "berlin"}'
> ... , '{"cıty_name": "new-york"}'
> ... , '{"cıty_name": "toronto"}'
> ... , '{"cıty_name": "chicago"}'
> ... , '{"cıty_name": "dubai"}']
> >>> myRDD = sc.parallelize(js)
> >>> myDF = spark.read.json(myRDD)
> >>> myDF.printSchema()
> >>>   
> root
>  |-- city_name: string (nullable = true)
>  |-- cıty_name: string (nullable = true)
> >>> myDF.select(myDF['city_name'])
> Traceback (most recent call last):
>   File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco
> return f(*a, **kw)
>   File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 
> 319, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o226.apply.
> : org.apache.spark.sql.AnalysisException: Reference 'city_name' is ambiguous, 
> could be: city_name#29, city_name#30.;
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:264)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:168)
>   at org.apache.spark.sql.Dataset.resolve(Dataset.scala:217)
>   at org.apache.spark.sql.Dataset.col(Dataset.scala:1073)
>   at org.apache.spark.sql.Dataset.apply(Dataset.scala:1059)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>   at py4j.Gateway.invoke(Gateway.java:280)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:214)
>   at java.lang.Thread.run(Thread.java:745)
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/usr/lib/spark/python/pyspark/sql/dataframe.py", line 943, in 
> __getitem__
> jc = self._jdf.apply(item)
>   File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", 
> line 1133, in __call__
>   File "/usr/lib/spark/python/pyspark/sql

[jira] [Issue Comment Deleted] (SPARK-21227) Unicode in Json field causes AnalysisException when selecting from Dataframe

2017-06-27 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-21227:
-
Comment: was deleted

(was: I tested both as below on Python 3.6.0 and 2.7.10 as below but I could 
not reproduce this against the current master:

{code}
sc = spark.sparkContext
js = ['{"city_name": "paris"}'
, '{"city_name": "rome"}'
, '{"city_name": "berlin"}'
, '{"cıty_name": "new-york"}'
, '{"cıty_name": "toronto"}'
, '{"cıty_name": "chicago"}'
, '{"cıty_name": "dubai"}']
myRDD = sc.parallelize(js)
myDF = spark.read.json(myRDD)
myDF.printSchema()
myDF.show()
{code}

{code}
root
 |-- city_name: string (nullable = true)
 |-- cıty_name: string (nullable = true)
{code}

{code}
+-+-+
|city_name|cıty_name|
+-+-+
|paris| null|
| rome| null|
|   berlin| null|
| null| new-york|
| null|  toronto|
| null|  chicago|
| null|dubai|
+-+-+
{code}

Also, tested this with Scala as below:

{code}
java.util.Locale.setDefault(new java.util.Locale("tr-TR"))
val jsons = Seq(
  """{"city_name": "paris"}""",
  """{"cıty_name": "new-york"}""")
spark.read.json(jsons.toDS).show()
{code}

{code}
+-+-+
|city_name|cıty_name|
+-+-+
|paris| null|
| null| new-york|
+-+-+
{code}


{code}
java.util.Locale.setDefault(new java.util.Locale("en-US"))
val jsons = Seq(
  """{"city_name": "paris"}""",
  """{"cıty_name": "new-york"}""")
spark.read.json(jsons.toDS).show()
{code}

{code}
+-+-+
|city_name|cıty_name|
+-+-+
|paris| null|
| null| new-york|
+-+-+
{code})

> Unicode in Json field causes AnalysisException when selecting from Dataframe
> 
>
> Key: SPARK-21227
> URL: https://issues.apache.org/jira/browse/SPARK-21227
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: Seydou Dia
>
> Hi,
> please find below the step to reproduce the issue I am facing.
> First I create a json with 2 fields:
> * city_name
> * cıty_name
> The first one is valid ascii, while the second contains a unicode (ı, i 
> without dot ).
> When I try to select from the dataframe I have an  {noformat} 
> AnalysisException {noformat}.
> {code:python}
> $ pyspark
> Python 3.4.3 (default, Sep  1 2016, 23:33:38) 
> [GCC 4.8.3 20140911 (Red Hat 4.8.3-9)] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 17/06/27 12:29:05 WARN Utils: Service 'SparkUI' could not bind on port 4040. 
> Attempting port 4041.
> 17/06/27 12:29:05 WARN Utils: Service 'SparkUI' could not bind on port 4041. 
> Attempting port 4042.
> 17/06/27 12:29:08 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive 
> is set, falling back to uploading libraries under SPARK_HOME.
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/__ / .__/\_,_/_/ /_/\_\   version 2.1.0
>   /_/
> Using Python version 3.4.3 (default, Sep  1 2016 23:33:38)
> SparkSession available as 'spark'.
> >>> sc=spark.sparkContext
> >>> js = ['{"city_name": "paris"}'
> ... , '{"city_name": "rome"}'
> ... , '{"city_name": "berlin"}'
> ... , '{"cıty_name": "new-york"}'
> ... , '{"cıty_name": "toronto"}'
> ... , '{"cıty_name": "chicago"}'
> ... , '{"cıty_name": "dubai"}']
> >>> myRDD = sc.parallelize(js)
> >>> myDF = spark.read.json(myRDD)
> >>> myDF.printSchema()
> >>>   
> root
>  |-- city_name: string (nullable = true)
>  |-- cıty_name: string (nullable = true)
> >>> myDF.select(myDF['city_name'])
> Traceback (most recent call last):
>   File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco
> return f(*a, **kw)
>   File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 
> 319, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o226.apply.
> : org.apache.spark.sql.AnalysisException: Reference 'city_name' is ambiguous, 
> could be: city_name#29, city_name#30.;
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:264)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:168)
>   at org.apache.spark.sql.Dataset.resolve(Dataset.scala:217)
>   at org.apache.spark.sql.Dataset.col(Dataset.scala:1073)
>   at org.apache.spark.sql.Dataset.apply(Dataset.scala:1059)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.ref

[jira] [Commented] (SPARK-21227) Unicode in Json field causes AnalysisException when selecting from Dataframe

2017-06-27 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16065768#comment-16065768
 ] 

Hyukjin Kwon commented on SPARK-21227:
--

I tested both as below on Python 3.6.0 and 2.7.10 as below but I could not 
reproduce this against the current master:

{code}
sc = spark.sparkContext
js = ['{"city_name": "paris"}'
, '{"city_name": "rome"}'
, '{"city_name": "berlin"}'
, '{"cıty_name": "new-york"}'
, '{"cıty_name": "toronto"}'
, '{"cıty_name": "chicago"}'
, '{"cıty_name": "dubai"}']
myRDD = sc.parallelize(js)
myDF = spark.read.json(myRDD)
myDF.printSchema()
myDF.show()
{code}

{code}
root
 |-- city_name: string (nullable = true)
 |-- cıty_name: string (nullable = true)
{code}

{code}
+-+-+
|city_name|cıty_name|
+-+-+
|paris| null|
| rome| null|
|   berlin| null|
| null| new-york|
| null|  toronto|
| null|  chicago|
| null|dubai|
+-+-+
{code}

Also, tested this with Scala as below:

{code}
java.util.Locale.setDefault(new java.util.Locale("tr-TR"))
val jsons = Seq(
  """{"city_name": "paris"}""",
  """{"cıty_name": "new-york"}""")
spark.read.json(jsons.toDS).show()
{code}

{code}
+-+-+
|city_name|cıty_name|
+-+-+
|paris| null|
| null| new-york|
+-+-+
{code}


{code}
java.util.Locale.setDefault(new java.util.Locale("en-US"))
val jsons = Seq(
  """{"city_name": "paris"}""",
  """{"cıty_name": "new-york"}""")
spark.read.json(jsons.toDS).show()
{code}

{code}
+-+-+
|city_name|cıty_name|
+-+-+
|paris| null|
| null| new-york|
+-+-+
{code}

> Unicode in Json field causes AnalysisException when selecting from Dataframe
> 
>
> Key: SPARK-21227
> URL: https://issues.apache.org/jira/browse/SPARK-21227
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: Seydou Dia
>
> Hi,
> please find below the step to reproduce the issue I am facing.
> First I create a json with 2 fields:
> * city_name
> * cıty_name
> The first one is valid ascii, while the second contains a unicode (ı, i 
> without dot ).
> When I try to select from the dataframe I have an  {noformat} 
> AnalysisException {noformat}.
> {code:python}
> $ pyspark
> Python 3.4.3 (default, Sep  1 2016, 23:33:38) 
> [GCC 4.8.3 20140911 (Red Hat 4.8.3-9)] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 17/06/27 12:29:05 WARN Utils: Service 'SparkUI' could not bind on port 4040. 
> Attempting port 4041.
> 17/06/27 12:29:05 WARN Utils: Service 'SparkUI' could not bind on port 4041. 
> Attempting port 4042.
> 17/06/27 12:29:08 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive 
> is set, falling back to uploading libraries under SPARK_HOME.
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/__ / .__/\_,_/_/ /_/\_\   version 2.1.0
>   /_/
> Using Python version 3.4.3 (default, Sep  1 2016 23:33:38)
> SparkSession available as 'spark'.
> >>> sc=spark.sparkContext
> >>> js = ['{"city_name": "paris"}'
> ... , '{"city_name": "rome"}'
> ... , '{"city_name": "berlin"}'
> ... , '{"cıty_name": "new-york"}'
> ... , '{"cıty_name": "toronto"}'
> ... , '{"cıty_name": "chicago"}'
> ... , '{"cıty_name": "dubai"}']
> >>> myRDD = sc.parallelize(js)
> >>> myDF = spark.read.json(myRDD)
> >>> myDF.printSchema()
> >>>   
> root
>  |-- city_name: string (nullable = true)
>  |-- cıty_name: string (nullable = true)
> >>> myDF.select(myDF['city_name'])
> Traceback (most recent call last):
>   File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco
> return f(*a, **kw)
>   File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 
> 319, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o226.apply.
> : org.apache.spark.sql.AnalysisException: Reference 'city_name' is ambiguous, 
> could be: city_name#29, city_name#30.;
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:264)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:168)
>   at org.apache.spark.sql.Dataset.resolve(Dataset.scala:217)
>   at org.apache.spark.sql.Dataset.col(Dataset.scala:1073)
>   at org.apache.spark.sql.Dataset.apply(Dataset.scala:1059)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Meth

[jira] [Commented] (SPARK-21152) Use level 3 BLAS operations in LogisticAggregator

2017-06-27 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16065694#comment-16065694
 ] 

yuhao yang commented on SPARK-21152:


This is something that we should investigate anyway. 

By GEMM, do you mean you will treat the coefficients as a Matrix even it's 
actually a vector? Before the implementation, I think it's necessary to check 
the GEMM speedup when multiplying matrix and vector, which could be quite 
different from normal GEMM.

> Use level 3 BLAS operations in LogisticAggregator
> -
>
> Key: SPARK-21152
> URL: https://issues.apache.org/jira/browse/SPARK-21152
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.1.1
>Reporter: Seth Hendrickson
>
> In logistic regression gradient update, we currently compute by each 
> individual row. If we blocked the rows together, we can do a blocked gradient 
> update which leverages the BLAS GEMM operation.
> On high dimensional dense datasets, I've observed ~10x speedups. The problem 
> here, though, is that it likely won't improve the sparse case so we need to 
> keep both implementations around, and this blocked algorithm will require 
> caching a new dataset of type:
> {code}
> BlockInstance(label: Vector, weight: Vector, features: Matrix)
> {code}
> We have avoided caching anything beside the original dataset passed to train 
> in the past because it adds memory overhead if the user has cached this 
> original dataset for other reasons. Here, I'd like to discuss whether we 
> think this patch would be worth the investment, given that it only improves a 
> subset of the use cases.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21231) Conda install of packages during Jenkins testing is causing intermittent failure

2017-06-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21231:


Assignee: Apache Spark

> Conda install of packages during Jenkins testing is causing intermittent 
> failure
> 
>
> Key: SPARK-21231
> URL: https://issues.apache.org/jira/browse/SPARK-21231
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra, PySpark, Tests
>Affects Versions: 2.3.0
>Reporter: holdenk
>Assignee: Apache Spark
>
> To allow testing of the Arrow PR we installed Arrow as part of the pip tests. 
> However, conda forge with the old version of conda on the jenkins workers 
> causes issues. To work around this [~shaneknapp] installed Arrow on the 
> workers, so we can avoid depending on installing at runtime. See 
> https://github.com/apache/spark/pull/15821 for more information.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21231) Conda install of packages during Jenkins testing is causing intermittent failure

2017-06-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16065621#comment-16065621
 ] 

Apache Spark commented on SPARK-21231:
--

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/18443

> Conda install of packages during Jenkins testing is causing intermittent 
> failure
> 
>
> Key: SPARK-21231
> URL: https://issues.apache.org/jira/browse/SPARK-21231
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra, PySpark, Tests
>Affects Versions: 2.3.0
>Reporter: holdenk
>
> To allow testing of the Arrow PR we installed Arrow as part of the pip tests. 
> However, conda forge with the old version of conda on the jenkins workers 
> causes issues. To work around this [~shaneknapp] installed Arrow on the 
> workers, so we can avoid depending on installing at runtime. See 
> https://github.com/apache/spark/pull/15821 for more information.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21231) Conda install of packages during Jenkins testing is causing intermittent failure

2017-06-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21231:


Assignee: (was: Apache Spark)

> Conda install of packages during Jenkins testing is causing intermittent 
> failure
> 
>
> Key: SPARK-21231
> URL: https://issues.apache.org/jira/browse/SPARK-21231
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra, PySpark, Tests
>Affects Versions: 2.3.0
>Reporter: holdenk
>
> To allow testing of the Arrow PR we installed Arrow as part of the pip tests. 
> However, conda forge with the old version of conda on the jenkins workers 
> causes issues. To work around this [~shaneknapp] installed Arrow on the 
> workers, so we can avoid depending on installing at runtime. See 
> https://github.com/apache/spark/pull/15821 for more information.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21231) Conda install of packages during Jenkins testing is causing intermittent failure

2017-06-27 Thread holdenk (JIRA)
holdenk created SPARK-21231:
---

 Summary: Conda install of packages during Jenkins testing is 
causing intermittent failure
 Key: SPARK-21231
 URL: https://issues.apache.org/jira/browse/SPARK-21231
 Project: Spark
  Issue Type: Bug
  Components: Project Infra, PySpark, Tests
Affects Versions: 2.3.0
Reporter: holdenk


To allow testing of the Arrow PR we installed Arrow as part of the pip tests. 
However, conda forge with the old version of conda on the jenkins workers 
causes issues. To work around this [~shaneknapp] installed Arrow on the 
workers, so we can avoid depending on installing at runtime. See 
https://github.com/apache/spark/pull/15821 for more information.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21137) Spark reads many small files slowly

2017-06-27 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16065515#comment-16065515
 ] 

Steve Loughran commented on SPARK-21137:


bq. so it is something that could be optimized in the Hadoop API, and seems 
somewhat specific to a local filesystem.

yeah, its something unique to the local FS (maybe things which extend it, like 
glusterfs —[~jayunit100] ?).  Fix would be straightforward if someone goes down 
to it & writes the test. Ideally a scale one which creates 1K+ small files and 
then measures the diff in times between listStatus as a before/after.

> Spark reads many small files slowly
> ---
>
> Key: SPARK-21137
> URL: https://issues.apache.org/jira/browse/SPARK-21137
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: sam
>Priority: Minor
>
> A very common use case in big data is to read a large number of small files.  
> For example the Enron email dataset has 1,227,645 small files.
> When one tries to read this data using Spark one will hit many issues.  
> Firstly, even if the data is small (each file only say 1K) any job can take a 
> very long time (I have a simple job that has been running for 3 hours and has 
> not yet got to the point of starting any tasks, I doubt if it will ever 
> finish).
> It seems all the code in Spark that manages file listing is single threaded 
> and not well optimised.  When I hand crank the code and don't use Spark, my 
> job runs much faster.
> Is it possible that I'm missing some configuration option? It seems kinda 
> surprising to me that Spark cannot read Enron data given that it's such a 
> quintessential example.
> So it takes 1 hour to output a line "1,227,645 input paths to process", it 
> then takes another hour to output the same line. Then it outputs a CSV of all 
> the input paths (so creates a text storm).
> Now it's been stuck on the following:
> {code}
> 17/06/19 09:31:07 INFO LzoCodec: Successfully loaded & initialized native-lzo 
> library [hadoop-lzo rev 154f1ef53e2d6ed126b0957d7995e0a610947608]
> {code}
> for 2.5 hours.
> So I've provided full reproduce steps here (including code and cluster setup) 
> https://github.com/samthebest/scenron, scroll down to "Bug In Spark". You can 
> easily just clone, and follow the README to reproduce exactly!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21137) Spark reads many small files slowly

2017-06-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16065476#comment-16065476
 ] 

Apache Spark commented on SPARK-21137:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/18441

> Spark reads many small files slowly
> ---
>
> Key: SPARK-21137
> URL: https://issues.apache.org/jira/browse/SPARK-21137
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: sam
>Priority: Minor
>
> A very common use case in big data is to read a large number of small files.  
> For example the Enron email dataset has 1,227,645 small files.
> When one tries to read this data using Spark one will hit many issues.  
> Firstly, even if the data is small (each file only say 1K) any job can take a 
> very long time (I have a simple job that has been running for 3 hours and has 
> not yet got to the point of starting any tasks, I doubt if it will ever 
> finish).
> It seems all the code in Spark that manages file listing is single threaded 
> and not well optimised.  When I hand crank the code and don't use Spark, my 
> job runs much faster.
> Is it possible that I'm missing some configuration option? It seems kinda 
> surprising to me that Spark cannot read Enron data given that it's such a 
> quintessential example.
> So it takes 1 hour to output a line "1,227,645 input paths to process", it 
> then takes another hour to output the same line. Then it outputs a CSV of all 
> the input paths (so creates a text storm).
> Now it's been stuck on the following:
> {code}
> 17/06/19 09:31:07 INFO LzoCodec: Successfully loaded & initialized native-lzo 
> library [hadoop-lzo rev 154f1ef53e2d6ed126b0957d7995e0a610947608]
> {code}
> for 2.5 hours.
> So I've provided full reproduce steps here (including code and cluster setup) 
> https://github.com/samthebest/scenron, scroll down to "Bug In Spark". You can 
> easily just clone, and follow the README to reproduce exactly!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21137) Spark reads many small files slowly

2017-06-27 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16065475#comment-16065475
 ] 

Sean Owen commented on SPARK-21137:
---

OK, so it is something that could be optimized in the Hadoop API, and seems 
somewhat specific to a local filesystem.
I opened the change we both mention here in this thread as a PR.

> Spark reads many small files slowly
> ---
>
> Key: SPARK-21137
> URL: https://issues.apache.org/jira/browse/SPARK-21137
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: sam
>Priority: Minor
>
> A very common use case in big data is to read a large number of small files.  
> For example the Enron email dataset has 1,227,645 small files.
> When one tries to read this data using Spark one will hit many issues.  
> Firstly, even if the data is small (each file only say 1K) any job can take a 
> very long time (I have a simple job that has been running for 3 hours and has 
> not yet got to the point of starting any tasks, I doubt if it will ever 
> finish).
> It seems all the code in Spark that manages file listing is single threaded 
> and not well optimised.  When I hand crank the code and don't use Spark, my 
> job runs much faster.
> Is it possible that I'm missing some configuration option? It seems kinda 
> surprising to me that Spark cannot read Enron data given that it's such a 
> quintessential example.
> So it takes 1 hour to output a line "1,227,645 input paths to process", it 
> then takes another hour to output the same line. Then it outputs a CSV of all 
> the input paths (so creates a text storm).
> Now it's been stuck on the following:
> {code}
> 17/06/19 09:31:07 INFO LzoCodec: Successfully loaded & initialized native-lzo 
> library [hadoop-lzo rev 154f1ef53e2d6ed126b0957d7995e0a610947608]
> {code}
> for 2.5 hours.
> So I've provided full reproduce steps here (including code and cluster setup) 
> https://github.com/samthebest/scenron, scroll down to "Bug In Spark". You can 
> easily just clone, and follow the README to reproduce exactly!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21137) Spark reads many small files slowly

2017-06-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21137:


Assignee: (was: Apache Spark)

> Spark reads many small files slowly
> ---
>
> Key: SPARK-21137
> URL: https://issues.apache.org/jira/browse/SPARK-21137
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: sam
>Priority: Minor
>
> A very common use case in big data is to read a large number of small files.  
> For example the Enron email dataset has 1,227,645 small files.
> When one tries to read this data using Spark one will hit many issues.  
> Firstly, even if the data is small (each file only say 1K) any job can take a 
> very long time (I have a simple job that has been running for 3 hours and has 
> not yet got to the point of starting any tasks, I doubt if it will ever 
> finish).
> It seems all the code in Spark that manages file listing is single threaded 
> and not well optimised.  When I hand crank the code and don't use Spark, my 
> job runs much faster.
> Is it possible that I'm missing some configuration option? It seems kinda 
> surprising to me that Spark cannot read Enron data given that it's such a 
> quintessential example.
> So it takes 1 hour to output a line "1,227,645 input paths to process", it 
> then takes another hour to output the same line. Then it outputs a CSV of all 
> the input paths (so creates a text storm).
> Now it's been stuck on the following:
> {code}
> 17/06/19 09:31:07 INFO LzoCodec: Successfully loaded & initialized native-lzo 
> library [hadoop-lzo rev 154f1ef53e2d6ed126b0957d7995e0a610947608]
> {code}
> for 2.5 hours.
> So I've provided full reproduce steps here (including code and cluster setup) 
> https://github.com/samthebest/scenron, scroll down to "Bug In Spark". You can 
> easily just clone, and follow the README to reproduce exactly!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21137) Spark reads many small files slowly

2017-06-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21137:


Assignee: Apache Spark

> Spark reads many small files slowly
> ---
>
> Key: SPARK-21137
> URL: https://issues.apache.org/jira/browse/SPARK-21137
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: sam
>Assignee: Apache Spark
>Priority: Minor
>
> A very common use case in big data is to read a large number of small files.  
> For example the Enron email dataset has 1,227,645 small files.
> When one tries to read this data using Spark one will hit many issues.  
> Firstly, even if the data is small (each file only say 1K) any job can take a 
> very long time (I have a simple job that has been running for 3 hours and has 
> not yet got to the point of starting any tasks, I doubt if it will ever 
> finish).
> It seems all the code in Spark that manages file listing is single threaded 
> and not well optimised.  When I hand crank the code and don't use Spark, my 
> job runs much faster.
> Is it possible that I'm missing some configuration option? It seems kinda 
> surprising to me that Spark cannot read Enron data given that it's such a 
> quintessential example.
> So it takes 1 hour to output a line "1,227,645 input paths to process", it 
> then takes another hour to output the same line. Then it outputs a CSV of all 
> the input paths (so creates a text storm).
> Now it's been stuck on the following:
> {code}
> 17/06/19 09:31:07 INFO LzoCodec: Successfully loaded & initialized native-lzo 
> library [hadoop-lzo rev 154f1ef53e2d6ed126b0957d7995e0a610947608]
> {code}
> for 2.5 hours.
> So I've provided full reproduce steps here (including code and cluster setup) 
> https://github.com/samthebest/scenron, scroll down to "Bug In Spark". You can 
> easily just clone, and follow the README to reproduce exactly!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12868) ADD JAR via sparkSQL JDBC will fail when using a HDFS URL

2017-06-27 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16065448#comment-16065448
 ] 

Steve Loughran commented on SPARK-12868:


I think this is the case of HADOOP-14598: once the FS has been set to 
{{FsUrlStreamHandlerFactory}} in {{org.apache.spark.sql.internal.SharedState}}, 
you can't talk to Azure. 

> ADD JAR via sparkSQL JDBC will fail when using a HDFS URL
> -
>
> Key: SPARK-12868
> URL: https://issues.apache.org/jira/browse/SPARK-12868
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Trystan Leftwich
>Assignee: Weiqing Yang
> Fix For: 2.2.0
>
>
> When trying to add a jar with a HDFS URI, i.E
> {code:sql}
> ADD JAR hdfs:///tmp/foo.jar
> {code}
> Via the spark sql JDBC interface it will fail with:
> {code:sql}
> java.net.MalformedURLException: unknown protocol: hdfs
> at java.net.URL.(URL.java:593)
> at java.net.URL.(URL.java:483)
> at java.net.URL.(URL.java:432)
> at java.net.URI.toURL(URI.java:1089)
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.addJar(ClientWrapper.scala:578)
> at org.apache.spark.sql.hive.HiveContext.addJar(HiveContext.scala:652)
> at org.apache.spark.sql.hive.execution.AddJar.run(commands.scala:89)
> at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
> at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
> at 
> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
> at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:145)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:130)
> at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
> at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:817)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:211)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:154)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:151)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:164)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21137) Spark reads many small files slowly

2017-06-27 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16065445#comment-16065445
 ] 

Steve Loughran commented on SPARK-21137:


Filed HADOOP-14600.

Looks like a v. old codepath that's not been looked at...and since then the 
native lib fstat call should be able to do this, just retain the old code for 
when {{NativeCodeLoader.isNativeCodeLoaded() == false}}. 



Sam, you said

bq. it's likely that the underlying Hadoop APIs have some yucky code that does 
something silly, I have delved down their before and my stomach cannot handle it

oh, it's not so bad. At least you don't have to go near the assembly code bit. 
That we are all scared of. Or worse, Kerberos.

Anyway, patches welcome there, with tests

> Spark reads many small files slowly
> ---
>
> Key: SPARK-21137
> URL: https://issues.apache.org/jira/browse/SPARK-21137
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: sam
>Priority: Minor
>
> A very common use case in big data is to read a large number of small files.  
> For example the Enron email dataset has 1,227,645 small files.
> When one tries to read this data using Spark one will hit many issues.  
> Firstly, even if the data is small (each file only say 1K) any job can take a 
> very long time (I have a simple job that has been running for 3 hours and has 
> not yet got to the point of starting any tasks, I doubt if it will ever 
> finish).
> It seems all the code in Spark that manages file listing is single threaded 
> and not well optimised.  When I hand crank the code and don't use Spark, my 
> job runs much faster.
> Is it possible that I'm missing some configuration option? It seems kinda 
> surprising to me that Spark cannot read Enron data given that it's such a 
> quintessential example.
> So it takes 1 hour to output a line "1,227,645 input paths to process", it 
> then takes another hour to output the same line. Then it outputs a CSV of all 
> the input paths (so creates a text storm).
> Now it's been stuck on the following:
> {code}
> 17/06/19 09:31:07 INFO LzoCodec: Successfully loaded & initialized native-lzo 
> library [hadoop-lzo rev 154f1ef53e2d6ed126b0957d7995e0a610947608]
> {code}
> for 2.5 hours.
> So I've provided full reproduce steps here (including code and cluster setup) 
> https://github.com/samthebest/scenron, scroll down to "Bug In Spark". You can 
> easily just clone, and follow the README to reproduce exactly!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21137) Spark reads many small files slowly

2017-06-27 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16065428#comment-16065428
 ] 

Steve Loughran commented on SPARK-21137:


ps, for now, do it in parallel: 
{{mapreduce.input.fileinputformat.list-status.num-threads}} . There though, the 
fact that every thread will be exec()ing code can make it expensive (looks like 
a full Posix spawn on a mac. 



> Spark reads many small files slowly
> ---
>
> Key: SPARK-21137
> URL: https://issues.apache.org/jira/browse/SPARK-21137
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: sam
>Priority: Minor
>
> A very common use case in big data is to read a large number of small files.  
> For example the Enron email dataset has 1,227,645 small files.
> When one tries to read this data using Spark one will hit many issues.  
> Firstly, even if the data is small (each file only say 1K) any job can take a 
> very long time (I have a simple job that has been running for 3 hours and has 
> not yet got to the point of starting any tasks, I doubt if it will ever 
> finish).
> It seems all the code in Spark that manages file listing is single threaded 
> and not well optimised.  When I hand crank the code and don't use Spark, my 
> job runs much faster.
> Is it possible that I'm missing some configuration option? It seems kinda 
> surprising to me that Spark cannot read Enron data given that it's such a 
> quintessential example.
> So it takes 1 hour to output a line "1,227,645 input paths to process", it 
> then takes another hour to output the same line. Then it outputs a CSV of all 
> the input paths (so creates a text storm).
> Now it's been stuck on the following:
> {code}
> 17/06/19 09:31:07 INFO LzoCodec: Successfully loaded & initialized native-lzo 
> library [hadoop-lzo rev 154f1ef53e2d6ed126b0957d7995e0a610947608]
> {code}
> for 2.5 hours.
> So I've provided full reproduce steps here (including code and cluster setup) 
> https://github.com/samthebest/scenron, scroll down to "Bug In Spark". You can 
> easily just clone, and follow the README to reproduce exactly!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21137) Spark reads many small files slowly

2017-06-27 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16065423#comment-16065423
 ] 

Steve Loughran commented on SPARK-21137:


Looking at this.

something is trying to get the permissions for every file, which is being dealt 
with by an exec & all the overheads of that. Looking at the code, it's in the 
constructor of {{LocatedFileStatus}}, which is building it from another 
{{FileStatus}}. Which normally is just a simple copy of a field (fast, 
efficient). Looks like on RawLocalFileSystem, it actually triggers an on demand 
execution. Been around for a long time (HADOOP-2288), surfacing here because 
you're working with the local FS. For all other filesystems it's a quick 
operation.

I think this is an issue: I don't think anybody thought this would be a 
problem, as it's just viewed as a marshalling of a LocatedFileStatus, which is 
what you get back from {{FileSystem.listLocatedStatus}}. Normally that's the 
higher performing one, not just on object stores, but because it scales better, 
being able to incrementally send back data in batches, rather than needing to 
enumerate an entire directory of files (possibly in the millions) and then send 
them around as arrays of FileStatus.  Here, it's clearly not.

What to do? I think we could consider whether it'd be possible to add this to 
the hadoop native libs & so make a fast API call. There's also the option of 
"allowing us to completely disable permissions entirely". That one appeals to 
me more from a windows perspective, where you could get rid of the hadoop 
native lib and still have (most) things work there...but as its an incomplete 
"most" it's probably an optimistic goal.



> Spark reads many small files slowly
> ---
>
> Key: SPARK-21137
> URL: https://issues.apache.org/jira/browse/SPARK-21137
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: sam
>Priority: Minor
>
> A very common use case in big data is to read a large number of small files.  
> For example the Enron email dataset has 1,227,645 small files.
> When one tries to read this data using Spark one will hit many issues.  
> Firstly, even if the data is small (each file only say 1K) any job can take a 
> very long time (I have a simple job that has been running for 3 hours and has 
> not yet got to the point of starting any tasks, I doubt if it will ever 
> finish).
> It seems all the code in Spark that manages file listing is single threaded 
> and not well optimised.  When I hand crank the code and don't use Spark, my 
> job runs much faster.
> Is it possible that I'm missing some configuration option? It seems kinda 
> surprising to me that Spark cannot read Enron data given that it's such a 
> quintessential example.
> So it takes 1 hour to output a line "1,227,645 input paths to process", it 
> then takes another hour to output the same line. Then it outputs a CSV of all 
> the input paths (so creates a text storm).
> Now it's been stuck on the following:
> {code}
> 17/06/19 09:31:07 INFO LzoCodec: Successfully loaded & initialized native-lzo 
> library [hadoop-lzo rev 154f1ef53e2d6ed126b0957d7995e0a610947608]
> {code}
> for 2.5 hours.
> So I've provided full reproduce steps here (including code and cluster setup) 
> https://github.com/samthebest/scenron, scroll down to "Bug In Spark". You can 
> easily just clone, and follow the README to reproduce exactly!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21218) Convert IN predicate to equivalent Parquet filter

2017-06-27 Thread Michael Styles (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Styles resolved SPARK-21218.

Resolution: Duplicate

> Convert IN predicate to equivalent Parquet filter
> -
>
> Key: SPARK-21218
> URL: https://issues.apache.org/jira/browse/SPARK-21218
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 2.1.1
>Reporter: Michael Styles
> Attachments: IN Predicate.png, OR Predicate.png
>
>
> Convert IN predicate to equivalent expression involving equality conditions 
> to allow the filter to be pushed down to Parquet.
> For instance,
> C1 IN (10, 20) is rewritten as (C1 = 10) OR (C1 = 20)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17091) Convert IN predicate to equivalent Parquet filter

2017-06-27 Thread Michael Styles (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16065413#comment-16065413
 ] 

Michael Styles edited comment on SPARK-17091 at 6/27/17 8:26 PM:
-

By not pushing the filter to Parquet, are we not preventing Parquet from 
skipping blocks during read operations? I have tests that show big improvements 
when applying this transformation.

For instance, I have a Parquet file with 162,456,394 rows which is sorted on 
column C1.

*IN Predicate*
{noformat}
df.filter[df['C1'].isin([42, 139])).collect()
{noformat}

*OR Predicate*
{noformat}
df.filter((df['C1'] == 42) | (df['C1'] == 139)).collect()
{noformat}

I'm seeing about a 50 -75 % improvement. See attachments.


was (Author: ptkool):
By not pushing the filter to Parquet, are we not preventing Parquet from 
skipping blocks during read operations? I have tests that show big improvements 
when applying this transformation.

For instance, I have a Parquet file with 162,456,394 rows which is sorted on 
column C1.

*IN Predicate*
{noformat}
df.filter[df['C1'].isin([42, 139])).collect()
{noformat}

*OR Predicate*
{noformat}
df.filter((df['C1'] == 42) | (df['C1'] == 139)).collect()
{noformat}

I'm seeing about a 50 -75 % improvement.

> Convert IN predicate to equivalent Parquet filter
> -
>
> Key: SPARK-17091
> URL: https://issues.apache.org/jira/browse/SPARK-17091
> Project: Spark
>  Issue Type: Bug
>Reporter: Andrew Duffy
> Attachments: IN Predicate.png, OR Predicate.png
>
>
> Past attempts at pushing down the InSet operation for Parquet relied on 
> user-defined predicates. It would be simpler to rewrite an IN clause into the 
> corresponding OR union of a set of equality conditions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17091) Convert IN predicate to equivalent Parquet filter

2017-06-27 Thread Michael Styles (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Styles updated SPARK-17091:
---
Summary: Convert IN predicate to equivalent Parquet filter  (was: 
ParquetFilters rewrite IN to OR of Eq)

> Convert IN predicate to equivalent Parquet filter
> -
>
> Key: SPARK-17091
> URL: https://issues.apache.org/jira/browse/SPARK-17091
> Project: Spark
>  Issue Type: Bug
>Reporter: Andrew Duffy
> Attachments: IN Predicate.png, OR Predicate.png
>
>
> Past attempts at pushing down the InSet operation for Parquet relied on 
> user-defined predicates. It would be simpler to rewrite an IN clause into the 
> corresponding OR union of a set of equality conditions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17091) ParquetFilters rewrite IN to OR of Eq

2017-06-27 Thread Michael Styles (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Styles updated SPARK-17091:
---
Attachment: IN Predicate.png
OR Predicate.png

> ParquetFilters rewrite IN to OR of Eq
> -
>
> Key: SPARK-17091
> URL: https://issues.apache.org/jira/browse/SPARK-17091
> Project: Spark
>  Issue Type: Bug
>Reporter: Andrew Duffy
> Attachments: IN Predicate.png, OR Predicate.png
>
>
> Past attempts at pushing down the InSet operation for Parquet relied on 
> user-defined predicates. It would be simpler to rewrite an IN clause into the 
> corresponding OR union of a set of equality conditions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17091) ParquetFilters rewrite IN to OR of Eq

2017-06-27 Thread Michael Styles (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16065413#comment-16065413
 ] 

Michael Styles commented on SPARK-17091:


By not pushing the filter to Parquet, are we not preventing Parquet from 
skipping blocks during read operations? I have tests that show big improvements 
when applying this transformation.

For instance, I have a Parquet file with 162,456,394 rows which is sorted on 
column C1.

*IN Predicate*
{noformat}
df.filter[df['C1'].isin([42, 139])).collect()
{noformat}

*OR Predicate*
{noformat}
df.filter((df['C1'] == 42) | (df['C1'] == 139)).collect()
{noformat}

I'm seeing about a 50 -75 % improvement.

> ParquetFilters rewrite IN to OR of Eq
> -
>
> Key: SPARK-17091
> URL: https://issues.apache.org/jira/browse/SPARK-17091
> Project: Spark
>  Issue Type: Bug
>Reporter: Andrew Duffy
>
> Past attempts at pushing down the InSet operation for Parquet relied on 
> user-defined predicates. It would be simpler to rewrite an IN clause into the 
> corresponding OR union of a set of equality conditions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-17091) ParquetFilters rewrite IN to OR of Eq

2017-06-27 Thread Michael Styles (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Styles reopened SPARK-17091:


> ParquetFilters rewrite IN to OR of Eq
> -
>
> Key: SPARK-17091
> URL: https://issues.apache.org/jira/browse/SPARK-17091
> Project: Spark
>  Issue Type: Bug
>Reporter: Andrew Duffy
>
> Past attempts at pushing down the InSet operation for Parquet relied on 
> user-defined predicates. It would be simpler to rewrite an IN clause into the 
> corresponding OR union of a set of equality conditions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21215) Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve

2017-06-27 Thread Michael Kunkel (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16065351#comment-16065351
 ] 

Michael Kunkel edited comment on SPARK-21215 at 6/27/17 7:37 PM:
-

[~sowen] I am not attempting to argue the facts. When I try to post to the ASF, 
I instantly get a reply from yahoo, with the subject matter exactly as the mail 
I sent. The reply states that the owner no longer has access to the mailing 
list.
Try it please



was (Author: michaelkunkel):
[~sowen] I am not attempting to argue that facts. When I try to post to the 
ASF, I instantly get a reply from yahoo, with the subject matter exactly as the 
mail I sent. The reply states that the owner no longer has access to the 
mailing list.
Try it please


> Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot 
> resolve
> -
>
> Key: SPARK-21215
> URL: https://issues.apache.org/jira/browse/SPARK-21215
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, SQL
>Affects Versions: 2.1.1
> Environment: macOSX
>Reporter: Michael Kunkel
>
> First Spark project.
> I have a Java method that returns a Dataset. I want to convert this to a 
> Dataset, where the Object is named StatusChangeDB. I have created a 
> POJO StatusChangeDB.java and coded it with all the query objects found in the 
> mySQL table.
> I then create a Encoder and convert the Dataset to a 
> Dataset. However when I try to .show() the values of the 
> Dataset I receive the error
> bq. 
> bq. Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot 
> resolve '`hvpinid_quad`' given input columns: [status_change_type, 
> superLayer, loclayer, sector, locwire];
> bq.   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
> bq.   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:86)
> bq.   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:83)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:290)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:290)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:289)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4$$anonfun$apply$10.apply(TreeNode.scala:324)
> bq.   at 
> scala.collection.MapLike$MappedValues$$anonfun$iterator$3.apply(MapLike.scala:246)
> bq.   at 
> scala.collection.MapLike$MappedValues$$anonfun$iterator$3.apply(MapLike.scala:246)
> bq.   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> bq.   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> bq.   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> bq.   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> bq.   at scala.collection.IterableLike$$anon$1.foreach(IterableLike.scala:311)
> bq.   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
> bq.   at 
> scala.collection.mutable.MapBuilder.$plus$plus$eq(MapBuilder.scala:25)
> bq.   at 
> scala.collection.TraversableViewLike$class.force(TraversableViewLike.scala:88)
> bq.   at scala.collection.IterableLike$$anon$1.force(IterableLike.scala:311)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:332)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsU

[jira] [Commented] (SPARK-21215) Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve

2017-06-27 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16065348#comment-16065348
 ] 

Sean Owen commented on SPARK-21215:
---

Not sure what you're looking at, but the mailing list has posts from 10 minutes 
ago:
http://apache-spark-user-list.1001560.n3.nabble.com/

I'm not sure what you mean about Yahoo, because nothing about this project or 
the ASF uses Yahoo.

> Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot 
> resolve
> -
>
> Key: SPARK-21215
> URL: https://issues.apache.org/jira/browse/SPARK-21215
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, SQL
>Affects Versions: 2.1.1
> Environment: macOSX
>Reporter: Michael Kunkel
>
> First Spark project.
> I have a Java method that returns a Dataset. I want to convert this to a 
> Dataset, where the Object is named StatusChangeDB. I have created a 
> POJO StatusChangeDB.java and coded it with all the query objects found in the 
> mySQL table.
> I then create a Encoder and convert the Dataset to a 
> Dataset. However when I try to .show() the values of the 
> Dataset I receive the error
> bq. 
> bq. Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot 
> resolve '`hvpinid_quad`' given input columns: [status_change_type, 
> superLayer, loclayer, sector, locwire];
> bq.   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
> bq.   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:86)
> bq.   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:83)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:290)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:290)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:289)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4$$anonfun$apply$10.apply(TreeNode.scala:324)
> bq.   at 
> scala.collection.MapLike$MappedValues$$anonfun$iterator$3.apply(MapLike.scala:246)
> bq.   at 
> scala.collection.MapLike$MappedValues$$anonfun$iterator$3.apply(MapLike.scala:246)
> bq.   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> bq.   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> bq.   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> bq.   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> bq.   at scala.collection.IterableLike$$anon$1.foreach(IterableLike.scala:311)
> bq.   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
> bq.   at 
> scala.collection.mutable.MapBuilder.$plus$plus$eq(MapBuilder.scala:25)
> bq.   at 
> scala.collection.TraversableViewLike$class.force(TraversableViewLike.scala:88)
> bq.   at scala.collection.IterableLike$$anon$1.force(IterableLike.scala:311)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:332)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:255)
> bq.   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:255)
> bq.   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:266)
> bq.   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveT

[jira] [Commented] (SPARK-21215) Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve

2017-06-27 Thread Michael Kunkel (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16065351#comment-16065351
 ] 

Michael Kunkel commented on SPARK-21215:


[~sowen] I am not attempting to argue that facts. When I try to post to the 
ASF, I instantly get a reply from yahoo, with the subject matter exactly as the 
mail I sent. The reply states that the owner no longer has access to the 
mailing list.
Try it please


> Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot 
> resolve
> -
>
> Key: SPARK-21215
> URL: https://issues.apache.org/jira/browse/SPARK-21215
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, SQL
>Affects Versions: 2.1.1
> Environment: macOSX
>Reporter: Michael Kunkel
>
> First Spark project.
> I have a Java method that returns a Dataset. I want to convert this to a 
> Dataset, where the Object is named StatusChangeDB. I have created a 
> POJO StatusChangeDB.java and coded it with all the query objects found in the 
> mySQL table.
> I then create a Encoder and convert the Dataset to a 
> Dataset. However when I try to .show() the values of the 
> Dataset I receive the error
> bq. 
> bq. Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot 
> resolve '`hvpinid_quad`' given input columns: [status_change_type, 
> superLayer, loclayer, sector, locwire];
> bq.   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
> bq.   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:86)
> bq.   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:83)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:290)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:290)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:289)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4$$anonfun$apply$10.apply(TreeNode.scala:324)
> bq.   at 
> scala.collection.MapLike$MappedValues$$anonfun$iterator$3.apply(MapLike.scala:246)
> bq.   at 
> scala.collection.MapLike$MappedValues$$anonfun$iterator$3.apply(MapLike.scala:246)
> bq.   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> bq.   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> bq.   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> bq.   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> bq.   at scala.collection.IterableLike$$anon$1.foreach(IterableLike.scala:311)
> bq.   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
> bq.   at 
> scala.collection.mutable.MapBuilder.$plus$plus$eq(MapBuilder.scala:25)
> bq.   at 
> scala.collection.TraversableViewLike$class.force(TraversableViewLike.scala:88)
> bq.   at scala.collection.IterableLike$$anon$1.force(IterableLike.scala:311)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:332)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:255)
> bq.   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:255)
> bq.   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:266)
> bq.   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$cata

[jira] [Comment Edited] (SPARK-21215) Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve

2017-06-27 Thread Michael Kunkel (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16065354#comment-16065354
 ] 

Michael Kunkel edited comment on SPARK-21215 at 6/27/17 7:40 PM:
-

The posts go onto the list, but the owner, ASF, does not have access to it, as 
the reply states. If they have no access to it, how can they be the help?


was (Author: michaelkunkel):
The posts go onto the list, but the owner ASF does not have access to it, as 
the reply states. If they have no access to it, how can they be the help?

> Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot 
> resolve
> -
>
> Key: SPARK-21215
> URL: https://issues.apache.org/jira/browse/SPARK-21215
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, SQL
>Affects Versions: 2.1.1
> Environment: macOSX
>Reporter: Michael Kunkel
>
> First Spark project.
> I have a Java method that returns a Dataset. I want to convert this to a 
> Dataset, where the Object is named StatusChangeDB. I have created a 
> POJO StatusChangeDB.java and coded it with all the query objects found in the 
> mySQL table.
> I then create a Encoder and convert the Dataset to a 
> Dataset. However when I try to .show() the values of the 
> Dataset I receive the error
> bq. 
> bq. Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot 
> resolve '`hvpinid_quad`' given input columns: [status_change_type, 
> superLayer, loclayer, sector, locwire];
> bq.   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
> bq.   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:86)
> bq.   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:83)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:290)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:290)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:289)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4$$anonfun$apply$10.apply(TreeNode.scala:324)
> bq.   at 
> scala.collection.MapLike$MappedValues$$anonfun$iterator$3.apply(MapLike.scala:246)
> bq.   at 
> scala.collection.MapLike$MappedValues$$anonfun$iterator$3.apply(MapLike.scala:246)
> bq.   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> bq.   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> bq.   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> bq.   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> bq.   at scala.collection.IterableLike$$anon$1.foreach(IterableLike.scala:311)
> bq.   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
> bq.   at 
> scala.collection.mutable.MapBuilder.$plus$plus$eq(MapBuilder.scala:25)
> bq.   at 
> scala.collection.TraversableViewLike$class.force(TraversableViewLike.scala:88)
> bq.   at scala.collection.IterableLike$$anon$1.force(IterableLike.scala:311)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:332)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:255)
> bq.   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:255)
> bq.   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.tra

[jira] [Commented] (SPARK-21230) Spark Encoder with mysql Enum and data truncated Error

2017-06-27 Thread Michael Kunkel (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16065347#comment-16065347
 ] 

Michael Kunkel commented on SPARK-21230:


The problem is with the Spark Encoder of type enum. So the problem is spark as 
the post attempts to point out.

> Spark Encoder with mysql Enum and data truncated Error
> --
>
> Key: SPARK-21230
> URL: https://issues.apache.org/jira/browse/SPARK-21230
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.1.1
> Environment: macosX
>Reporter: Michael Kunkel
>
> I am using Spark via Java for a MYSQL/ML(machine learning) project.
> In the mysql database, I have a column "status_change_type" of type enum = 
> {broke, fixed} in a table called "status_change" in a DB called "test".
> I have an object StatusChangeDB that constructs the needed structure for the 
> table, however for the "status_change_type", I constructed it as a String. I 
> know the bytes from MYSQL enum to Java string are much different, but I am 
> using Spark, so the encoder does not recognize enums properly. However when I 
> try to set the value of the enum via a Java string, I receive the "data 
> truncated" error
> h5. org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 0 in stage 4.0 failed 1 times, most recent failure: Lost task 0.0 in 
> stage 4.0 (TID 9, localhost, executor driver): java.sql.BatchUpdateException: 
> Data truncated for column 'status_change_type' at row 1 at 
> com.mysql.jdbc.PreparedStatement.executeBatchSerially(PreparedStatement.java:2055)
> I have tried to use enum for "status_change_type", however it fails with a 
> stack trace of
> h5. Exception in thread "AWT-EventQueue-0" java.lang.NullPointerException 
> at org.spark_project.guava.reflect.TypeToken.method(TypeToken.java:465) at 
> org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:126)
>  at 
> org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:125)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at 
> scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) at 
> org.apache.spark.sql.catalyst.JavaTypeInference$.org$apache$spark$sql$catalyst$JavaTypeInference$$inferDataType(JavaTypeInference.scala:125)
>  at 
> org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:127)
>  at 
> org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:125)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at 
> scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) at 
> org.apache.spark.sql.catalyst.JavaTypeInference$.org$apache$spark$sql$catalyst$JavaTypeInference$$inferDataType(JavaTypeInference.scala:125)
>  at 
> org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:127)
>  at 
> org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:125)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at 
> scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) ... ...
> h5. 
> I have tried to use the jdbc setting "jdbcCompliantTruncation=false" but this 
> does nothing as I get the same error of "data truncated" as first stated. 
> Here are my jdbc options map, in case I am using the 
> "jdbcCompliantTruncation=false" incorrectly.
> public static Map jdbcOptions() {
> Map jdbcOptions = new HashMap();
> jdbcOptions.put("url", 
> "jdbc:mysql://localhost:3306/test?jdbcCompliantTruncation=false");
> jdbcOptions.put("driver", "com.mysql.jdbc.Driver");
> jdbcOptions.put("dbtable", "status_change");
> jdbc

[jira] [Commented] (SPARK-21215) Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve

2017-06-27 Thread Michael Kunkel (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16065354#comment-16065354
 ] 

Michael Kunkel commented on SPARK-21215:


The posts go onto the list, but the owner ASF does not have access to it, as 
the reply states. If they have no access to it, how can they be the help?

> Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot 
> resolve
> -
>
> Key: SPARK-21215
> URL: https://issues.apache.org/jira/browse/SPARK-21215
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, SQL
>Affects Versions: 2.1.1
> Environment: macOSX
>Reporter: Michael Kunkel
>
> First Spark project.
> I have a Java method that returns a Dataset. I want to convert this to a 
> Dataset, where the Object is named StatusChangeDB. I have created a 
> POJO StatusChangeDB.java and coded it with all the query objects found in the 
> mySQL table.
> I then create a Encoder and convert the Dataset to a 
> Dataset. However when I try to .show() the values of the 
> Dataset I receive the error
> bq. 
> bq. Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot 
> resolve '`hvpinid_quad`' given input columns: [status_change_type, 
> superLayer, loclayer, sector, locwire];
> bq.   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
> bq.   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:86)
> bq.   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:83)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:290)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:290)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:289)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4$$anonfun$apply$10.apply(TreeNode.scala:324)
> bq.   at 
> scala.collection.MapLike$MappedValues$$anonfun$iterator$3.apply(MapLike.scala:246)
> bq.   at 
> scala.collection.MapLike$MappedValues$$anonfun$iterator$3.apply(MapLike.scala:246)
> bq.   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> bq.   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> bq.   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> bq.   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> bq.   at scala.collection.IterableLike$$anon$1.foreach(IterableLike.scala:311)
> bq.   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
> bq.   at 
> scala.collection.mutable.MapBuilder.$plus$plus$eq(MapBuilder.scala:25)
> bq.   at 
> scala.collection.TraversableViewLike$class.force(TraversableViewLike.scala:88)
> bq.   at scala.collection.IterableLike$$anon$1.force(IterableLike.scala:311)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:332)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:255)
> bq.   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:255)
> bq.   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:266)
> bq.   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:276)
> bq.   at 
> org.apache.spark.sql.catalyst.p

[jira] [Commented] (SPARK-21215) Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve

2017-06-27 Thread Michael Kunkel (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16065345#comment-16065345
 ] 

Michael Kunkel commented on SPARK-21215:


I looked at a few months worth of posts, and it seems that the mailing list is 
not accepting new mails. This is because the mailing list is no longer valid 
according to the auto-reply from the yahoo-mailing server.

> Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot 
> resolve
> -
>
> Key: SPARK-21215
> URL: https://issues.apache.org/jira/browse/SPARK-21215
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, SQL
>Affects Versions: 2.1.1
> Environment: macOSX
>Reporter: Michael Kunkel
>
> First Spark project.
> I have a Java method that returns a Dataset. I want to convert this to a 
> Dataset, where the Object is named StatusChangeDB. I have created a 
> POJO StatusChangeDB.java and coded it with all the query objects found in the 
> mySQL table.
> I then create a Encoder and convert the Dataset to a 
> Dataset. However when I try to .show() the values of the 
> Dataset I receive the error
> bq. 
> bq. Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot 
> resolve '`hvpinid_quad`' given input columns: [status_change_type, 
> superLayer, loclayer, sector, locwire];
> bq.   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
> bq.   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:86)
> bq.   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:83)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:290)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:290)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:289)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4$$anonfun$apply$10.apply(TreeNode.scala:324)
> bq.   at 
> scala.collection.MapLike$MappedValues$$anonfun$iterator$3.apply(MapLike.scala:246)
> bq.   at 
> scala.collection.MapLike$MappedValues$$anonfun$iterator$3.apply(MapLike.scala:246)
> bq.   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> bq.   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> bq.   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> bq.   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> bq.   at scala.collection.IterableLike$$anon$1.foreach(IterableLike.scala:311)
> bq.   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
> bq.   at 
> scala.collection.mutable.MapBuilder.$plus$plus$eq(MapBuilder.scala:25)
> bq.   at 
> scala.collection.TraversableViewLike$class.force(TraversableViewLike.scala:88)
> bq.   at scala.collection.IterableLike$$anon$1.force(IterableLike.scala:311)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:332)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:255)
> bq.   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:255)
> bq.   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:266)
> bq.   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(Q

[jira] [Commented] (SPARK-21230) Spark Encoder with mysql Enum and data truncated Error

2017-06-27 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16065338#comment-16065338
 ] 

Sean Owen commented on SPARK-21230:
---

This does also not look like a useful JIRA. It looks like a question about 
using MySQL and JDBC. Until it's narrowed down to a Spark issue, we'd generally 
close this.

> Spark Encoder with mysql Enum and data truncated Error
> --
>
> Key: SPARK-21230
> URL: https://issues.apache.org/jira/browse/SPARK-21230
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.1.1
> Environment: macosX
>Reporter: Michael Kunkel
>
> I am using Spark via Java for a MYSQL/ML(machine learning) project.
> In the mysql database, I have a column "status_change_type" of type enum = 
> {broke, fixed} in a table called "status_change" in a DB called "test".
> I have an object StatusChangeDB that constructs the needed structure for the 
> table, however for the "status_change_type", I constructed it as a String. I 
> know the bytes from MYSQL enum to Java string are much different, but I am 
> using Spark, so the encoder does not recognize enums properly. However when I 
> try to set the value of the enum via a Java string, I receive the "data 
> truncated" error
> h5. org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 0 in stage 4.0 failed 1 times, most recent failure: Lost task 0.0 in 
> stage 4.0 (TID 9, localhost, executor driver): java.sql.BatchUpdateException: 
> Data truncated for column 'status_change_type' at row 1 at 
> com.mysql.jdbc.PreparedStatement.executeBatchSerially(PreparedStatement.java:2055)
> I have tried to use enum for "status_change_type", however it fails with a 
> stack trace of
> h5. Exception in thread "AWT-EventQueue-0" java.lang.NullPointerException 
> at org.spark_project.guava.reflect.TypeToken.method(TypeToken.java:465) at 
> org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:126)
>  at 
> org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:125)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at 
> scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) at 
> org.apache.spark.sql.catalyst.JavaTypeInference$.org$apache$spark$sql$catalyst$JavaTypeInference$$inferDataType(JavaTypeInference.scala:125)
>  at 
> org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:127)
>  at 
> org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:125)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at 
> scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) at 
> org.apache.spark.sql.catalyst.JavaTypeInference$.org$apache$spark$sql$catalyst$JavaTypeInference$$inferDataType(JavaTypeInference.scala:125)
>  at 
> org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:127)
>  at 
> org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:125)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at 
> scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) ... ...
> h5. 
> I have tried to use the jdbc setting "jdbcCompliantTruncation=false" but this 
> does nothing as I get the same error of "data truncated" as first stated. 
> Here are my jdbc options map, in case I am using the 
> "jdbcCompliantTruncation=false" incorrectly.
> public static Map jdbcOptions() {
> Map jdbcOptions = new HashMap();
> jdbcOptions.put("url", 
> "jdbc:mysql://localhost:3306/test?jdbcCompliantTruncation=false");
> jdbcOptions.put("driver", "com.mysql.jdbc.Driver");
> jdbcOpti

[jira] [Commented] (SPARK-21215) Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve

2017-06-27 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16065336#comment-16065336
 ] 

Sean Owen commented on SPARK-21215:
---

I'm not sure what you're referring to. The user@ list works fine. sometimes you 
get weird auto-responses from somebody's account on the list. But, it has 
nothing to do with the list.

> Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot 
> resolve
> -
>
> Key: SPARK-21215
> URL: https://issues.apache.org/jira/browse/SPARK-21215
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, SQL
>Affects Versions: 2.1.1
> Environment: macOSX
>Reporter: Michael Kunkel
>
> First Spark project.
> I have a Java method that returns a Dataset. I want to convert this to a 
> Dataset, where the Object is named StatusChangeDB. I have created a 
> POJO StatusChangeDB.java and coded it with all the query objects found in the 
> mySQL table.
> I then create a Encoder and convert the Dataset to a 
> Dataset. However when I try to .show() the values of the 
> Dataset I receive the error
> bq. 
> bq. Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot 
> resolve '`hvpinid_quad`' given input columns: [status_change_type, 
> superLayer, loclayer, sector, locwire];
> bq.   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
> bq.   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:86)
> bq.   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:83)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:290)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:290)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:289)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4$$anonfun$apply$10.apply(TreeNode.scala:324)
> bq.   at 
> scala.collection.MapLike$MappedValues$$anonfun$iterator$3.apply(MapLike.scala:246)
> bq.   at 
> scala.collection.MapLike$MappedValues$$anonfun$iterator$3.apply(MapLike.scala:246)
> bq.   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> bq.   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> bq.   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> bq.   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> bq.   at scala.collection.IterableLike$$anon$1.foreach(IterableLike.scala:311)
> bq.   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
> bq.   at 
> scala.collection.mutable.MapBuilder.$plus$plus$eq(MapBuilder.scala:25)
> bq.   at 
> scala.collection.TraversableViewLike$class.force(TraversableViewLike.scala:88)
> bq.   at scala.collection.IterableLike$$anon$1.force(IterableLike.scala:311)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:332)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:255)
> bq.   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:255)
> bq.   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:266)
> bq.   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:276)
> bq.   at 
> org.apac

[jira] [Created] (SPARK-21230) Spark Encoder with mysql Enum and data truncated Error

2017-06-27 Thread Michael Kunkel (JIRA)
Michael Kunkel created SPARK-21230:
--

 Summary: Spark Encoder with mysql Enum and data truncated Error
 Key: SPARK-21230
 URL: https://issues.apache.org/jira/browse/SPARK-21230
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 2.1.1
 Environment: macosX
Reporter: Michael Kunkel




I am using Spark via Java for a MYSQL/ML(machine learning) project.

In the mysql database, I have a column "status_change_type" of type enum = 
{broke, fixed} in a table called "status_change" in a DB called "test".

I have an object StatusChangeDB that constructs the needed structure for the 
table, however for the "status_change_type", I constructed it as a String. I 
know the bytes from MYSQL enum to Java string are much different, but I am 
using Spark, so the encoder does not recognize enums properly. However when I 
try to set the value of the enum via a Java string, I receive the "data 
truncated" error

h5. org.apache.spark.SparkException: Job aborted due to stage failure: Task 
0 in stage 4.0 failed 1 times, most recent failure: Lost task 0.0 in stage 4.0 
(TID 9, localhost, executor driver): java.sql.BatchUpdateException: Data 
truncated for column 'status_change_type' at row 1 at 
com.mysql.jdbc.PreparedStatement.executeBatchSerially(PreparedStatement.java:2055)

I have tried to use enum for "status_change_type", however it fails with a 
stack trace of

h5. Exception in thread "AWT-EventQueue-0" java.lang.NullPointerException 
at org.spark_project.guava.reflect.TypeToken.method(TypeToken.java:465) at 
org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:126)
 at 
org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:125)
 at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
 at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
 at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
 at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at 
scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at 
scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) at 
org.apache.spark.sql.catalyst.JavaTypeInference$.org$apache$spark$sql$catalyst$JavaTypeInference$$inferDataType(JavaTypeInference.scala:125)
 at 
org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:127)
 at 
org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:125)
 at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
 at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
 at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
 at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at 
scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at 
scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) at 
org.apache.spark.sql.catalyst.JavaTypeInference$.org$apache$spark$sql$catalyst$JavaTypeInference$$inferDataType(JavaTypeInference.scala:125)
 at 
org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:127)
 at 
org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:125)
 at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
 at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
 at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
 at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at 
scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at 
scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) ... ...
h5. 
I have tried to use the jdbc setting "jdbcCompliantTruncation=false" but this 
does nothing as I get the same error of "data truncated" as first stated. Here 
are my jdbc options map, in case I am using the "jdbcCompliantTruncation=false" 
incorrectly.

public static Map jdbcOptions() {
Map jdbcOptions = new HashMap();
jdbcOptions.put("url", 
"jdbc:mysql://localhost:3306/test?jdbcCompliantTruncation=false");
jdbcOptions.put("driver", "com.mysql.jdbc.Driver");
jdbcOptions.put("dbtable", "status_change");
jdbcOptions.put("user", "root");
jdbcOptions.put("password", "");
return jdbcOptions;
}

Here is the Spark method for inserting into the mysql DB

private void insertMYSQLQuery(Dataset changeDF) {
try {

changeDF.write().mode(SaveMode.Append).jdbc(SparkManager.jdbcAppendOptions(), 
"status_change",
new java.util.Properties());
} catch (Exception e) {
System.out.println(e);
}
}

where jdbcAppendOptions uses the jdbcOptions m

[jira] [Commented] (SPARK-21215) Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve

2017-06-27 Thread Michael Kunkel (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16065329#comment-16065329
 ] 

Michael Kunkel commented on SPARK-21215:


The "resolution for this by [~sowen] was to put this on the spark mailing list.
But I am sure this is just a scam to toss questions because the mailing list is 
no longer accepting emails.
When sending a email to the mailing list, a reply is given

Hello, 

This employee can no longer access email on this account.  Your email will not 
be forwarded.


So, this is not resolved.

> Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot 
> resolve
> -
>
> Key: SPARK-21215
> URL: https://issues.apache.org/jira/browse/SPARK-21215
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, SQL
>Affects Versions: 2.1.1
> Environment: macOSX
>Reporter: Michael Kunkel
>
> First Spark project.
> I have a Java method that returns a Dataset. I want to convert this to a 
> Dataset, where the Object is named StatusChangeDB. I have created a 
> POJO StatusChangeDB.java and coded it with all the query objects found in the 
> mySQL table.
> I then create a Encoder and convert the Dataset to a 
> Dataset. However when I try to .show() the values of the 
> Dataset I receive the error
> bq. 
> bq. Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot 
> resolve '`hvpinid_quad`' given input columns: [status_change_type, 
> superLayer, loclayer, sector, locwire];
> bq.   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
> bq.   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:86)
> bq.   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:83)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:290)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:290)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:289)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4$$anonfun$apply$10.apply(TreeNode.scala:324)
> bq.   at 
> scala.collection.MapLike$MappedValues$$anonfun$iterator$3.apply(MapLike.scala:246)
> bq.   at 
> scala.collection.MapLike$MappedValues$$anonfun$iterator$3.apply(MapLike.scala:246)
> bq.   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> bq.   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> bq.   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> bq.   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> bq.   at scala.collection.IterableLike$$anon$1.foreach(IterableLike.scala:311)
> bq.   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
> bq.   at 
> scala.collection.mutable.MapBuilder.$plus$plus$eq(MapBuilder.scala:25)
> bq.   at 
> scala.collection.TraversableViewLike$class.force(TraversableViewLike.scala:88)
> bq.   at scala.collection.IterableLike$$anon$1.force(IterableLike.scala:311)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:332)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:255)
> bq.   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:255)
> bq.   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transform

[jira] [Commented] (SPARK-21218) Convert IN predicate to equivalent Parquet filter

2017-06-27 Thread Michael Styles (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16065210#comment-16065210
 ] 

Michael Styles commented on SPARK-21218:


In Parquet 1.7, there as a bug involving corrupt statistics on binary columns 
(https://issues.apache.org/jira/browse/PARQUET-251). This bug prevented earlier 
versions of Spark from generating Parquet filters on any string columns. Spark 
2.1 has moved up to Parquet 1.8.2, so this issue no longer exists.

> Convert IN predicate to equivalent Parquet filter
> -
>
> Key: SPARK-21218
> URL: https://issues.apache.org/jira/browse/SPARK-21218
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 2.1.1
>Reporter: Michael Styles
> Attachments: IN Predicate.png, OR Predicate.png
>
>
> Convert IN predicate to equivalent expression involving equality conditions 
> to allow the filter to be pushed down to Parquet.
> For instance,
> C1 IN (10, 20) is rewritten as (C1 = 10) OR (C1 = 20)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21218) Convert IN predicate to equivalent Parquet filter

2017-06-27 Thread Andrew Duffy (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16065177#comment-16065177
 ] 

Andrew Duffy edited comment on SPARK-21218 at 6/27/17 5:39 PM:
---

Curious, I wonder what the previous benchmarks were lacking.

 Have you tried disjunction push-down on other datatypes, e.g. strings? In any 
case, I'm also on board with this change, if it is in fact useful.

I think [~hyukjin.kwon] was saying we should close this as a dupe, and rename 
the PR with the original ticket number 
[17091|https://issues.apache.org/jira/browse/SPARK-17091]. You can re-open that 
issue and say that you've done tests where this now looks like it will be a big 
improvement.


was (Author: andreweduffy):
Curious, I wonder what the previous benchmarks were lacking.

 Have you tried disjunction push-down on other datatypes, e.g. strings? In any 
case, I'm also on board with this change, if it is in fact useful.

I think [~hyukjin.kwon] was saying we should close this as a dupe, and rename 
the PR with the original ticket number (#17091). You can re-open that issue and 
say that you've done tests where this now looks like it will be a big 
improvement.

> Convert IN predicate to equivalent Parquet filter
> -
>
> Key: SPARK-21218
> URL: https://issues.apache.org/jira/browse/SPARK-21218
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 2.1.1
>Reporter: Michael Styles
> Attachments: IN Predicate.png, OR Predicate.png
>
>
> Convert IN predicate to equivalent expression involving equality conditions 
> to allow the filter to be pushed down to Parquet.
> For instance,
> C1 IN (10, 20) is rewritten as (C1 = 10) OR (C1 = 20)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21218) Convert IN predicate to equivalent Parquet filter

2017-06-27 Thread Andrew Duffy (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16065177#comment-16065177
 ] 

Andrew Duffy commented on SPARK-21218:
--

Curious, I wonder what the previous benchmarks were lacking.

 Have you tried disjunction push-down on other datatypes, e.g. strings? In any 
case, I'm also on board with this change, if it is in fact useful.

I think [~hyukjin.kwon] was saying we should close this as a dupe, and rename 
the PR with the original ticket number (#17091). You can re-open that issue and 
say that you've done tests where this now looks like it will be a big 
improvement.

> Convert IN predicate to equivalent Parquet filter
> -
>
> Key: SPARK-21218
> URL: https://issues.apache.org/jira/browse/SPARK-21218
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 2.1.1
>Reporter: Michael Styles
> Attachments: IN Predicate.png, OR Predicate.png
>
>
> Convert IN predicate to equivalent expression involving equality conditions 
> to allow the filter to be pushed down to Parquet.
> For instance,
> C1 IN (10, 20) is rewritten as (C1 = 10) OR (C1 = 20)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21218) Convert IN predicate to equivalent Parquet filter

2017-06-27 Thread Michael Styles (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16065142#comment-16065142
 ] 

Michael Styles commented on SPARK-21218:


[~hyukjin.kwon] Not sure I understand what you want me to do with my PR, 
assuming this JIRA is resolved as a duplicate?

> Convert IN predicate to equivalent Parquet filter
> -
>
> Key: SPARK-21218
> URL: https://issues.apache.org/jira/browse/SPARK-21218
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 2.1.1
>Reporter: Michael Styles
> Attachments: IN Predicate.png, OR Predicate.png
>
>
> Convert IN predicate to equivalent expression involving equality conditions 
> to allow the filter to be pushed down to Parquet.
> For instance,
> C1 IN (10, 20) is rewritten as (C1 = 10) OR (C1 = 20)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19104) CompileException with Map and Case Class in Spark 2.1.0

2017-06-27 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-19104.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 18418
[https://github.com/apache/spark/pull/18418]

>  CompileException with Map and Case Class in Spark 2.1.0
> 
>
> Key: SPARK-19104
> URL: https://issues.apache.org/jira/browse/SPARK-19104
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Nils Grabbert
> Fix For: 2.2.0
>
>
> The following code will run with Spark 2.0.2 but not with Spark 2.1.0:
> {code}
> case class InnerData(name: String, value: Int)
> case class Data(id: Int, param: Map[String, InnerData])
> val data = Seq.tabulate(10)(i => Data(1, Map("key" -> InnerData("name", i + 
> 100
> val ds   = spark.createDataset(data)
> {code}
> Exception:
> {code}
> Caused by: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 63, Column 46: Expression 
> "ExternalMapToCatalyst_value_isNull1" is not an rvalue 
>   at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:11004) 
>   at 
> org.codehaus.janino.UnitCompiler.toRvalueOrCompileException(UnitCompiler.java:6639)
>  
>   at 
> org.codehaus.janino.UnitCompiler.getConstantValue2(UnitCompiler.java:5001) 
>   at org.codehaus.janino.UnitCompiler.access$10500(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$13.visitAmbiguousName(UnitCompiler.java:4984)
>  
>   at org.codehaus.janino.Java$AmbiguousName.accept(Java.java:3633) 
>   at org.codehaus.janino.Java$Lvalue.accept(Java.java:3563) 
>   at 
> org.codehaus.janino.UnitCompiler.getConstantValue(UnitCompiler.java:4956) 
>   at org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4925) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3189) 
>   at org.codehaus.janino.UnitCompiler.access$5100(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$9.visitAssignment(UnitCompiler.java:3143) 
>   at 
> org.codehaus.janino.UnitCompiler$9.visitAssignment(UnitCompiler.java:3139) 
>   at org.codehaus.janino.Java$Assignment.accept(Java.java:3847) 
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112) 
>   at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377)
>  
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370)
>  
>   at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558) 
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370) 
>   at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450) 
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811) 
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262)
>  
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234)
>  
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894) 
>   at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377)
>  
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369)
>  
>   at org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128) 
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) 
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209)
>  
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:420) 
>   at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:374)
>  
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:369)
>  
>   at 
> org.codehaus.janino.Java$AbstractPackageMemberClassDeclaration.accept(Java.java:1309)
>  
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) 
>   at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:345) 
>   at 
> org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:396)
>  
>   at 
> org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:311)
>  
>   at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyE

[jira] [Assigned] (SPARK-19104) CompileException with Map and Case Class in Spark 2.1.0

2017-06-27 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-19104:
---

Assignee: Liang-Chi Hsieh

>  CompileException with Map and Case Class in Spark 2.1.0
> 
>
> Key: SPARK-19104
> URL: https://issues.apache.org/jira/browse/SPARK-19104
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Nils Grabbert
>Assignee: Liang-Chi Hsieh
> Fix For: 2.2.0
>
>
> The following code will run with Spark 2.0.2 but not with Spark 2.1.0:
> {code}
> case class InnerData(name: String, value: Int)
> case class Data(id: Int, param: Map[String, InnerData])
> val data = Seq.tabulate(10)(i => Data(1, Map("key" -> InnerData("name", i + 
> 100
> val ds   = spark.createDataset(data)
> {code}
> Exception:
> {code}
> Caused by: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 63, Column 46: Expression 
> "ExternalMapToCatalyst_value_isNull1" is not an rvalue 
>   at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:11004) 
>   at 
> org.codehaus.janino.UnitCompiler.toRvalueOrCompileException(UnitCompiler.java:6639)
>  
>   at 
> org.codehaus.janino.UnitCompiler.getConstantValue2(UnitCompiler.java:5001) 
>   at org.codehaus.janino.UnitCompiler.access$10500(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$13.visitAmbiguousName(UnitCompiler.java:4984)
>  
>   at org.codehaus.janino.Java$AmbiguousName.accept(Java.java:3633) 
>   at org.codehaus.janino.Java$Lvalue.accept(Java.java:3563) 
>   at 
> org.codehaus.janino.UnitCompiler.getConstantValue(UnitCompiler.java:4956) 
>   at org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4925) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3189) 
>   at org.codehaus.janino.UnitCompiler.access$5100(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$9.visitAssignment(UnitCompiler.java:3143) 
>   at 
> org.codehaus.janino.UnitCompiler$9.visitAssignment(UnitCompiler.java:3139) 
>   at org.codehaus.janino.Java$Assignment.accept(Java.java:3847) 
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112) 
>   at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377)
>  
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370)
>  
>   at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558) 
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370) 
>   at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450) 
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811) 
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262)
>  
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234)
>  
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894) 
>   at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377)
>  
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369)
>  
>   at org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128) 
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) 
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209)
>  
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:420) 
>   at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:374)
>  
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:369)
>  
>   at 
> org.codehaus.janino.Java$AbstractPackageMemberClassDeclaration.accept(Java.java:1309)
>  
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) 
>   at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:345) 
>   at 
> org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:396)
>  
>   at 
> org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:311)
>  
>   at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:229) 
>   at org.codehaus.janino.SimpleCompil

[jira] [Assigned] (SPARK-21229) remove QueryPlan.preCanonicalized

2017-06-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21229:


Assignee: Wenchen Fan  (was: Apache Spark)

> remove QueryPlan.preCanonicalized
> -
>
> Key: SPARK-21229
> URL: https://issues.apache.org/jira/browse/SPARK-21229
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21229) remove QueryPlan.preCanonicalized

2017-06-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16065118#comment-16065118
 ] 

Apache Spark commented on SPARK-21229:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/18440

> remove QueryPlan.preCanonicalized
> -
>
> Key: SPARK-21229
> URL: https://issues.apache.org/jira/browse/SPARK-21229
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21229) remove QueryPlan.preCanonicalized

2017-06-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21229:


Assignee: Apache Spark  (was: Wenchen Fan)

> remove QueryPlan.preCanonicalized
> -
>
> Key: SPARK-21229
> URL: https://issues.apache.org/jira/browse/SPARK-21229
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21229) remove QueryPlan.preCanonicalized

2017-06-27 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-21229:
---

 Summary: remove QueryPlan.preCanonicalized
 Key: SPARK-21229
 URL: https://issues.apache.org/jira/browse/SPARK-21229
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18294) Implement commit protocol to support `mapred` package's committer

2017-06-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16065023#comment-16065023
 ] 

Apache Spark commented on SPARK-18294:
--

User 'jiangxb1987' has created a pull request for this issue:
https://github.com/apache/spark/pull/18438

> Implement commit protocol to support `mapred` package's committer
> -
>
> Key: SPARK-18294
> URL: https://issues.apache.org/jira/browse/SPARK-18294
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Jiang Xingbo
>
> Current `FileCommitProtocol` is based on `mapreduce` package, we should 
> implement a `HadoopMapRedCommitProtocol` that supports the older mapred 
> package's commiter.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20226) Call to sqlContext.cacheTable takes an incredibly long time in some cases

2017-06-27 Thread Barry Becker (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16064972#comment-16064972
 ] 

Barry Becker commented on SPARK-20226:
--

Calling cache() on the dataframe on the after the addColumn used to make this 
run fast. But around the time that we upgraded to spark 2.1.1 it got very slow 
again. Calling cache on the dataframe does not seem to help any more.

If I hardcode the addColumn column expression to be 
{code}
(((CAST(Plate AS STRING) + CAST(State AS STRING)) + CAST(License Type 
AS STRING)) + CAST(Violation Time AS STRING)) + CAST(Violation AS STRING)) + 
CAST(Judgment Entry Date AS STRING)) + CAST(Issue Date AS STRING)) + 
CAST(Summons Number AS STRING)) + CAST(Fine Amount AS STRING)) + CAST(Penalty 
Amount AS STRING)) + CAST(Interest Amount AS STRING)) + CAST(Violation AS 
STRING))
{code}
instead of 
{code}
CAST(UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(Plate, State), License Type), 
Violation Time), Violation), UDF(Judgment Entry Date)), UDF(Issue Date)), 
UDF(Summons Number)), UDF(Fine Amount)), UDF(Penalty Amount)), UDF(Interest 
Amount)), Violation) AS STRING)
{code}
which is what is generated by our expression parser, then the time goes from 70 
seconds down to 10 seconds. Still slow, but not nearly as slow.

> Call to sqlContext.cacheTable takes an incredibly long time in some cases
> -
>
> Key: SPARK-20226
> URL: https://issues.apache.org/jira/browse/SPARK-20226
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
> Environment: linux or windows
>Reporter: Barry Becker
>  Labels: cache
> Attachments: profile_indexer2.PNG, xyzzy.csv
>
>
> I have a case where the call to sqlContext.cacheTable can take an arbitrarily 
> long time depending on the number of columns that are referenced in a 
> withColumn expression applied to a dataframe.
> The dataset is small (20 columns 7861 rows). The sequence to reproduce is the 
> following:
> 1) add a new column that references 8 - 14 of the columns in the dataset. 
>- If I add 8 columns, then the call to cacheTable is fast - like *5 
> seconds*
>- If I add 11 columns, then it is slow - like *60 seconds*
>- and if I add 14 columns, then it basically *takes forever* - I gave up 
> after 10 minutes or so.
>   The Column expression that is added, is basically just concatenating 
> the columns together in a single string. If a number is concatenated on a 
> string (or vice versa) the number is first converted to a string.
>   The expression looks something like this:
> {code}
> `Plate` + `State` + `License Type` + `Summons Number` + `Issue Date` + 
> `Violation Time` + `Violation` + `Judgment Entry Date` + `Fine Amount` + 
> `Penalty Amount` + `Interest Amount`
> {code}
> which we then convert to a Column expression that looks like this:
> {code}
> UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF('Plate, 'State), 'License Type), 
> UDF('Summons Number)), UDF('Issue Date)), 'Violation Time), 'Violation), 
> UDF('Judgment Entry Date)), UDF('Fine Amount)), UDF('Penalty Amount)), 
> UDF('Interest Amount))
> {code}
>where the UDFs are very simple functions that basically call toString 
> and + as needed.
> 2) apply a pipeline that includes some transformers that was saved earlier. 
> Here are the steps of the pipeline (extracted from parquet)
>  - 
> {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333200603,"sparkVersion":"2.1.0","uid":"strIdx_aeb04d2777cc","paramMap":{"handleInvalid":"skip","outputCol":"State_IDX__","inputCol":"State_CLEANED__"}}{code}
>  - 
> {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333200837,"sparkVersion":"2.1.0","uid":"strIdx_0164c4c13979","paramMap":{"inputCol":"License
>  Type_CLEANED__","handleInvalid":"skip","outputCol":"License 
> Type_IDX__"}}{code}
>  - 
> {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333201068,"sparkVersion":"2.1.0","uid":"strIdx_25b6cbd02751","paramMap":{"inputCol":"Violation_CLEANED__","handleInvalid":"skip","outputCol":"Violation_IDX__"}}{code}
>  - 
> {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333201282,"sparkVersion":"2.1.0","uid":"strIdx_aa12df0354d9","paramMap":{"handleInvalid":"skip","inputCol":"County_CLEANED__","outputCol":"County_IDX__"}}{code}
>  - 
> {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333201552,"sparkVersion":"2.1.0","uid":"strIdx_babb120f3cc1","paramMap":{"handleInvalid":"skip","outputCol":"Issuing
>  Agency_IDX__","inputCol":"Issuing Agency_CLEANED__"}}{code}
>  - 
> {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333201759

[jira] [Commented] (SPARK-21228) InSet incorrect handling of structs

2017-06-27 Thread Bogdan Raducanu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16064970#comment-16064970
 ] 

Bogdan Raducanu commented on SPARK-21228:
-

InSubquery.doCodeGen is using InSet directly (although InSubquery itself is 
never used) so a fix should consider this too.

> InSet incorrect handling of structs
> ---
>
> Key: SPARK-21228
> URL: https://issues.apache.org/jira/browse/SPARK-21228
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Bogdan Raducanu
>
> In InSet it's possible that hset contains GenericInternalRows while child 
> returns UnsafeRows (and vice versa). InSet uses hset.contains (both in 
> doCodeGen and eval) which will always be false in this case.
> The following code reproduces the problem:
> {code}
> spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "2") // the 
> default is 10 which requires a longer query text to repro
> spark.range(1, 10).selectExpr("named_struct('a', id, 'b', id) as 
> a").createOrReplaceTempView("A")
> sql("select * from (select min(a) as minA from A) A where minA in 
> (named_struct('a', 1L, 'b', 1L),named_struct('a', 2L, 'b', 
> 2L),named_struct('a', 3L, 'b', 3L))").show // the Aggregate here will return 
> UnsafeRows while the list of structs that will become hset will be 
> GenericInternalRows
> ++
> |minA|
> ++
> ++
> {code}
> In.doCodeGen uses compareStructs and seems to work. In.eval might not work 
> but not sure how to reproduce.
> {code}
> spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "3") // now it 
> will not use InSet
> sql("select * from (select min(a) as minA from A) A where minA in 
> (named_struct('a', 1L, 'b', 1L),named_struct('a', 2L, 'b', 
> 2L),named_struct('a', 3L, 'b', 3L))").show
> +-+
> | minA|
> +-+
> |[1,1]|
> +-+
> {code}
> Solution could be either to do safe<->unsafe conversion in InSet or not 
> trigger InSet optimization at all in this case.
> Need to investigate if In.eval is affected.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20002) Add support for unions between streaming and batch datasets

2017-06-27 Thread Leon Pham (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16064947#comment-16064947
 ] 

Leon Pham commented on SPARK-20002:
---

We're actually reading data from two different sources and one of them is a 
batch source that doesn't make sense as a streaming one. We're trying to 
concatenate the batch data to be analyzed with the streaming datasets.

> Add support for unions between streaming and batch datasets
> ---
>
> Key: SPARK-20002
> URL: https://issues.apache.org/jira/browse/SPARK-20002
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Structured Streaming
>Affects Versions: 2.0.2
>Reporter: Leon Pham
>
> Currently unions between streaming datasets and batch datasets are not 
> supported.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21228) InSet incorrect handling of structs

2017-06-27 Thread Bogdan Raducanu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16064945#comment-16064945
 ] 

Bogdan Raducanu commented on SPARK-21228:
-

I tested manually (since there is no flag to disable codegen for expressions) 
that In.eval also fails, so only In.doCodeGen appears correct.

> InSet incorrect handling of structs
> ---
>
> Key: SPARK-21228
> URL: https://issues.apache.org/jira/browse/SPARK-21228
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Bogdan Raducanu
>
> In InSet it's possible that hset contains GenericInternalRows while child 
> returns UnsafeRows (and vice versa). InSet uses hset.contains (both in 
> doCodeGen and eval) which will always be false in this case.
> The following code reproduces the problem:
> {code}
> spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "2") // the 
> default is 10 which requires a longer query text to repro
> spark.range(1, 10).selectExpr("named_struct('a', id, 'b', id) as 
> a").createOrReplaceTempView("A")
> sql("select * from (select min(a) as minA from A) A where minA in 
> (named_struct('a', 1L, 'b', 1L),named_struct('a', 2L, 'b', 
> 2L),named_struct('a', 3L, 'b', 3L))").show // the Aggregate here will return 
> UnsafeRows while the list of structs that will become hset will be 
> GenericInternalRows
> ++
> |minA|
> ++
> ++
> {code}
> In.doCodeGen uses compareStructs and seems to work. In.eval might not work 
> but not sure how to reproduce.
> {code}
> spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "3") // now it 
> will not use InSet
> sql("select * from (select min(a) as minA from A) A where minA in 
> (named_struct('a', 1L, 'b', 1L),named_struct('a', 2L, 'b', 
> 2L),named_struct('a', 3L, 'b', 3L))").show
> +-+
> | minA|
> +-+
> |[1,1]|
> +-+
> {code}
> Solution could be either to do safe<->unsafe conversion in InSet or not 
> trigger InSet optimization at all in this case.
> Need to investigate if In.eval is affected.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21218) Convert IN predicate to equivalent Parquet filter

2017-06-27 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16064880#comment-16064880
 ] 

Hyukjin Kwon commented on SPARK-21218:
--

Yea, I support this for what it worth. Let's resolve this as a duplicate BTW. 
You could link that JIRA in your PR at your PR title.

> Convert IN predicate to equivalent Parquet filter
> -
>
> Key: SPARK-21218
> URL: https://issues.apache.org/jira/browse/SPARK-21218
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 2.1.1
>Reporter: Michael Styles
> Attachments: IN Predicate.png, OR Predicate.png
>
>
> Convert IN predicate to equivalent expression involving equality conditions 
> to allow the filter to be pushed down to Parquet.
> For instance,
> C1 IN (10, 20) is rewritten as (C1 = 10) OR (C1 = 20)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21228) InSet incorrect handling of structs

2017-06-27 Thread Bogdan Raducanu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bogdan Raducanu updated SPARK-21228:

Description: 
In InSet it's possible that hset contains GenericInternalRows while child 
returns UnsafeRows (and vice versa). InSet uses hset.contains (both in 
doCodeGen and eval) which will always be false in this case.

The following code reproduces the problem:
{code}
spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "2") // the 
default is 10 which requires a longer query text to repro

spark.range(1, 10).selectExpr("named_struct('a', id, 'b', id) as 
a").createOrReplaceTempView("A")

sql("select * from (select min(a) as minA from A) A where minA in 
(named_struct('a', 1L, 'b', 1L),named_struct('a', 2L, 'b', 
2L),named_struct('a', 3L, 'b', 3L))").show // the Aggregate here will return 
UnsafeRows while the list of structs that will become hset will be 
GenericInternalRows
++
|minA|
++
++
{code}

In.doCodeGen uses compareStructs and seems to work. In.eval might not work but 
not sure how to reproduce.

{code}
spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "3") // now it 
will not use InSet
sql("select * from (select min(a) as minA from A) A where minA in 
(named_struct('a', 1L, 'b', 1L),named_struct('a', 2L, 'b', 
2L),named_struct('a', 3L, 'b', 3L))").show

+-+
| minA|
+-+
|[1,1]|
+-+
{code}

Solution could be either to do safe<->unsafe conversion in InSet or not trigger 
InSet optimization at all in this case.
Need to investigate if In.eval is affected.


  was:
In InSet it's possible that hset contains GenericInternalRows while child 
returns UnsafeRows (and vice versa). InSet.doCodeGen uses hset.contains which 
will always be false in this case.

The following code reproduces the problem:
{code}
spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "2") // the 
default is 10 which requires a longer query text to repro

spark.range(1, 10).selectExpr("named_struct('a', id, 'b', id) as 
a").createOrReplaceTempView("A")

sql("select * from (select min(a) as minA from A) A where minA in 
(named_struct('a', 1L, 'b', 1L),named_struct('a', 2L, 'b', 
2L),named_struct('a', 3L, 'b', 3L))").show // the Aggregate here will return 
UnsafeRows while the list of structs that will become hset will be 
GenericInternalRows
++
|minA|
++
++
{code}
In.doCodeGen appears to be correct:
{code}
spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "3") // now it 
will not use InSet
sql("select * from (select min(a) as minA from A) A where minA in 
(named_struct('a', 1L, 'b', 1L),named_struct('a', 2L, 'b', 
2L),named_struct('a', 3L, 'b', 3L))").show

+-+
| minA|
+-+
|[1,1]|
+-+
{code}

Solution could be either to do safe<->unsafe conversion in InSet.doCodeGen or 
not trigger InSet optimization at all in this case.



> InSet incorrect handling of structs
> ---
>
> Key: SPARK-21228
> URL: https://issues.apache.org/jira/browse/SPARK-21228
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Bogdan Raducanu
>
> In InSet it's possible that hset contains GenericInternalRows while child 
> returns UnsafeRows (and vice versa). InSet uses hset.contains (both in 
> doCodeGen and eval) which will always be false in this case.
> The following code reproduces the problem:
> {code}
> spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "2") // the 
> default is 10 which requires a longer query text to repro
> spark.range(1, 10).selectExpr("named_struct('a', id, 'b', id) as 
> a").createOrReplaceTempView("A")
> sql("select * from (select min(a) as minA from A) A where minA in 
> (named_struct('a', 1L, 'b', 1L),named_struct('a', 2L, 'b', 
> 2L),named_struct('a', 3L, 'b', 3L))").show // the Aggregate here will return 
> UnsafeRows while the list of structs that will become hset will be 
> GenericInternalRows
> ++
> |minA|
> ++
> ++
> {code}
> In.doCodeGen uses compareStructs and seems to work. In.eval might not work 
> but not sure how to reproduce.
> {code}
> spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "3") // now it 
> will not use InSet
> sql("select * from (select min(a) as minA from A) A where minA in 
> (named_struct('a', 1L, 'b', 1L),named_struct('a', 2L, 'b', 
> 2L),named_struct('a', 3L, 'b', 3L))").show
> +-+
> | minA|
> +-+
> |[1,1]|
> +-+
> {code}
> Solution could be either to do safe<->unsafe conversion in InSet or not 
> trigger InSet optimization at all in this case.
> Need to investigate if In.eval is affected.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: is

[jira] [Updated] (SPARK-21228) InSet incorrect handling of structs

2017-06-27 Thread Bogdan Raducanu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bogdan Raducanu updated SPARK-21228:

Summary: InSet incorrect handling of structs  (was: InSet.doCodeGen 
incorrect handling of structs)

> InSet incorrect handling of structs
> ---
>
> Key: SPARK-21228
> URL: https://issues.apache.org/jira/browse/SPARK-21228
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Bogdan Raducanu
>
> In InSet it's possible that hset contains GenericInternalRows while child 
> returns UnsafeRows (and vice versa). InSet.doCodeGen uses hset.contains which 
> will always be false in this case.
> The following code reproduces the problem:
> {code}
> spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "2") // the 
> default is 10 which requires a longer query text to repro
> spark.range(1, 10).selectExpr("named_struct('a', id, 'b', id) as 
> a").createOrReplaceTempView("A")
> sql("select * from (select min(a) as minA from A) A where minA in 
> (named_struct('a', 1L, 'b', 1L),named_struct('a', 2L, 'b', 
> 2L),named_struct('a', 3L, 'b', 3L))").show // the Aggregate here will return 
> UnsafeRows while the list of structs that will become hset will be 
> GenericInternalRows
> ++
> |minA|
> ++
> ++
> {code}
> In.doCodeGen appears to be correct:
> {code}
> spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "3") // now it 
> will not use InSet
> sql("select * from (select min(a) as minA from A) A where minA in 
> (named_struct('a', 1L, 'b', 1L),named_struct('a', 2L, 'b', 
> 2L),named_struct('a', 3L, 'b', 3L))").show
> +-+
> | minA|
> +-+
> |[1,1]|
> +-+
> {code}
> Solution could be either to do safe<->unsafe conversion in InSet.doCodeGen or 
> not trigger InSet optimization at all in this case.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21228) InSet.doCodeGen incorrect handling of structs

2017-06-27 Thread Bogdan Raducanu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bogdan Raducanu updated SPARK-21228:

Description: 
In InSet it's possible that hset contains GenericInternalRows while child 
returns UnsafeRows (and vice versa). InSet.doCodeGen uses hset.contains which 
will always be false in this case.

The following code reproduces the problem:
{code}
spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "2") // the 
default is 10 which requires a longer query text to repro

spark.range(1, 10).selectExpr("named_struct('a', id, 'b', id) as 
a").createOrReplaceTempView("A")

sql("select * from (select min(a) as minA from A) A where minA in 
(named_struct('a', 1L, 'b', 1L),named_struct('a', 2L, 'b', 
2L),named_struct('a', 3L, 'b', 3L))").show // the Aggregate here will return 
UnsafeRows while the list of structs that will become hset will be 
GenericInternalRows
++
|minA|
++
++
{code}
In.doCodeGen appears to be correct:
{code}
spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "3") // now it 
will not use InSet
sql("select * from (select min(a) as minA from A) A where minA in 
(named_struct('a', 1L, 'b', 1L),named_struct('a', 2L, 'b', 
2L),named_struct('a', 3L, 'b', 3L))").show

+-+
| minA|
+-+
|[1,1]|
+-+
{code}

Solution could be either to do safe<->unsafe conversion in InSet.doCodeGen or 
not trigger InSet optimization at all in this case.


  was:
In InSet it's possible that hset contains GenericInternalRows while child 
returns UnsafeRows (and vice versa). InSet.doCodeGen uses hset.contains which 
will always be false in this case.

The following code reproduces the problem:
{code}
spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "2") // the 
default is 10 which requires a longer query text to repro

spark.range(1, 10).selectExpr("named_struct('a', id, 'b', id) as 
a").createOrReplaceTempView("A")

sql("select * from (select min(a) as minA from A) A where minA in 
(named_struct('a', 1L, 'b', 1L),named_struct('a', 2L, 'b', 
2L),named_struct('a', 3L, 'b', 3L))").show // the Aggregate here will return 
UnsafeRows while the list of structs that will become hset will be 
GenericInternalRows
++
|minA|
++
++
{code}
In.doCodeGen appears to be correct:
{code}
spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "3") // now it 
will not use InSet
sql("select * from (select min(a) as minA from A) A where minA in 
(named_struct('a', 1L, 'b', 1L),named_struct('a', 2L, 'b', 
2L),named_struct('a', 3L, 'b', 3L))").show

+-+
| minA|
+-+
|[1,1]|
+-+
{code}

Solution could be either to do safe<->unsafe conversion in InSet.doCodeGen or 
not trigger InSet optimization at all in this case.


> InSet.doCodeGen incorrect handling of structs
> -
>
> Key: SPARK-21228
> URL: https://issues.apache.org/jira/browse/SPARK-21228
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Bogdan Raducanu
>
> In InSet it's possible that hset contains GenericInternalRows while child 
> returns UnsafeRows (and vice versa). InSet.doCodeGen uses hset.contains which 
> will always be false in this case.
> The following code reproduces the problem:
> {code}
> spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "2") // the 
> default is 10 which requires a longer query text to repro
> spark.range(1, 10).selectExpr("named_struct('a', id, 'b', id) as 
> a").createOrReplaceTempView("A")
> sql("select * from (select min(a) as minA from A) A where minA in 
> (named_struct('a', 1L, 'b', 1L),named_struct('a', 2L, 'b', 
> 2L),named_struct('a', 3L, 'b', 3L))").show // the Aggregate here will return 
> UnsafeRows while the list of structs that will become hset will be 
> GenericInternalRows
> ++
> |minA|
> ++
> ++
> {code}
> In.doCodeGen appears to be correct:
> {code}
> spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "3") // now it 
> will not use InSet
> sql("select * from (select min(a) as minA from A) A where minA in 
> (named_struct('a', 1L, 'b', 1L),named_struct('a', 2L, 'b', 
> 2L),named_struct('a', 3L, 'b', 3L))").show
> +-+
> | minA|
> +-+
> |[1,1]|
> +-+
> {code}
> Solution could be either to do safe<->unsafe conversion in InSet.doCodeGen or 
> not trigger InSet optimization at all in this case.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21228) InSet.doCodeGen incorrect handling of structs

2017-06-27 Thread Bogdan Raducanu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bogdan Raducanu updated SPARK-21228:

Description: 
In InSet it's possible that hset contains GenericInternalRows while child 
returns UnsafeRows (and vice versa). InSet.doCodeGen uses hset.contains which 
will always be false in this case.

The following code reproduces the problem:
{code}
spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "2") // the 
default is 10 which requires a longer query text to repro

spark.range(1, 10).selectExpr("named_struct('a', id, 'b', id) as 
a").createOrReplaceTempView("A")

sql("select * from (select min(a) as minA from A) A where minA in 
(named_struct('a', 1L, 'b', 1L),named_struct('a', 2L, 'b', 
2L),named_struct('a', 3L, 'b', 3L))").show // the Aggregate here will return 
UnsafeRows while the list of structs that will become hset will be 
GenericInternalRows
++
|minA|
++
++
{code}
In.doCodeGen appears to be correct:
{code}
spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "3") // now it 
will not use InSet
sql("select * from (select min(a) as minA from A) A where minA in 
(named_struct('a', 1L, 'b', 1L),named_struct('a', 2L, 'b', 
2L),named_struct('a', 3L, 'b', 3L))").show

+-+
| minA|
+-+
|[1,1]|
+-+
{code}

Solution could be either to do safe<->unsafe conversion in InSet.doCodeGen or 
not trigger InSet optimization at all in this case.

  was:
In InSet it's possible that hset contains GenericInternalRows while child 
returns UnsafeRows (and vice versa). InSet.doCodeGen uses hset.contains which 
will always be false in this case.

The following code reproduces the problem:
{code}
spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "2") // the 
default is 10 which requires a longer query text to repro

spark.range(1, 10).selectExpr("named_struct('a', id, 'b', id) as 
a").createOrReplaceTempView("A")

sql("select * from (select min(a) as minA from A) A where minA in 
(named_struct('a', 1L, 'b', 1L),named_struct('a', 2L, 'b', 
2L),named_struct('a', 3L, 'b', 3L))").show -- the Aggregate here will return 
UnsafeRows while the list of structs that will become hset will be 
GenericInternalRows
++
|minA|
++
++
{code}
In.doCodeGen appears to be correct:
{code}
spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "3") // now it 
will not use InSet
sql("select * from (select min(a) as minA from A) A where minA in 
(named_struct('a', 1L, 'b', 1L),named_struct('a', 2L, 'b', 
2L),named_struct('a', 3L, 'b', 3L))").show

+-+
| minA|
+-+
|[1,1]|
+-+
{code}

Solution could be either to do safe<->unsafe conversion in InSet.doCodeGen or 
not trigger InSet optimization at all in this case.


> InSet.doCodeGen incorrect handling of structs
> -
>
> Key: SPARK-21228
> URL: https://issues.apache.org/jira/browse/SPARK-21228
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Bogdan Raducanu
>
> In InSet it's possible that hset contains GenericInternalRows while child 
> returns UnsafeRows (and vice versa). InSet.doCodeGen uses hset.contains which 
> will always be false in this case.
> The following code reproduces the problem:
> {code}
> spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "2") // the 
> default is 10 which requires a longer query text to repro
> spark.range(1, 10).selectExpr("named_struct('a', id, 'b', id) as 
> a").createOrReplaceTempView("A")
> sql("select * from (select min(a) as minA from A) A where minA in 
> (named_struct('a', 1L, 'b', 1L),named_struct('a', 2L, 'b', 
> 2L),named_struct('a', 3L, 'b', 3L))").show // the Aggregate here will return 
> UnsafeRows while the list of structs that will become hset will be 
> GenericInternalRows
> ++
> |minA|
> ++
> ++
> {code}
> In.doCodeGen appears to be correct:
> {code}
> spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "3") // now it 
> will not use InSet
> sql("select * from (select min(a) as minA from A) A where minA in 
> (named_struct('a', 1L, 'b', 1L),named_struct('a', 2L, 'b', 
> 2L),named_struct('a', 3L, 'b', 3L))").show
> +-+
> | minA|
> +-+
> |[1,1]|
> +-+
> {code}
> Solution could be either to do safe<->unsafe conversion in InSet.doCodeGen or 
> not trigger InSet optimization at all in this case.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21228) InSet.doCodeGen incorrect handling of structs

2017-06-27 Thread Bogdan Raducanu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bogdan Raducanu updated SPARK-21228:

Description: 
In InSet it's possible that hset contains GenericInternalRows while child 
returns UnsafeRows (and vice versa). InSet.doCodeGen uses hset.contains which 
will always be false in this case.

The following code reproduces the problem:
{code}
spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "2") // the 
default is 10 which requires a longer query text to repro

spark.range(1, 10).selectExpr("named_struct('a', id, 'b', id) as 
a").createOrReplaceTempView("A")

sql("select * from (select min(a) as minA from A) A where minA in 
(named_struct('a', 1L, 'b', 1L),named_struct('a', 2L, 'b', 
2L),named_struct('a', 3L, 'b', 3L))").show -- the Aggregate here will return 
UnsafeRows while the list of structs that will become hset will be 
GenericInternalRows
++
|minA|
++
++
{code}
In.doCodeGen appears to be correct:
{code}
spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "3") // now it 
will not use InSet
sql("select * from (select min(a) as minA from A) A where minA in 
(named_struct('a', 1L, 'b', 1L),named_struct('a', 2L, 'b', 
2L),named_struct('a', 3L, 'b', 3L))").show

+-+
| minA|
+-+
|[1,1]|
+-+
{code}

Solution could be either to do safe<->unsafe conversion in InSet.doCodeGen or 
not trigger InSet optimization at all in this case.

  was:
In InSet it's possible that hset contains GenericInternalRows while child 
returns UnsafeRows (and vice versa). InSet.doCodeGen uses hset.contains which 
will always be false in this case.

The following code reproduces the problem:
{code}
spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "2") // the 
default is 10 which requires a longer query text to repro

spark.range(1, 10).selectExpr("named_struct('a', id, 'b', id) as 
a").createOrReplaceTempView("A")

sql("select * from (select min(a) as minA from A) A where minA in 
(named_struct('a', 1L, 'b', 1L),named_struct('a', 2L, 'b', 
2L),named_struct('a', 3L, 'b', 3L))").show
++
|minA|
++
++
{code}
In.doCodeGen appears to be correct:
{code}
spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "3") // now it 
will not use InSet
sql("select * from (select min(a) as minA from A) A where minA in 
(named_struct('a', 1L, 'b', 1L),named_struct('a', 2L, 'b', 
2L),named_struct('a', 3L, 'b', 3L))").show

+-+
| minA|
+-+
|[1,1]|
+-+
{code}

Solution could be either to do safe<->unsafe conversion in InSet.doCodeGen or 
not trigger InSet optimization at all in this case.


> InSet.doCodeGen incorrect handling of structs
> -
>
> Key: SPARK-21228
> URL: https://issues.apache.org/jira/browse/SPARK-21228
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Bogdan Raducanu
>
> In InSet it's possible that hset contains GenericInternalRows while child 
> returns UnsafeRows (and vice versa). InSet.doCodeGen uses hset.contains which 
> will always be false in this case.
> The following code reproduces the problem:
> {code}
> spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "2") // the 
> default is 10 which requires a longer query text to repro
> spark.range(1, 10).selectExpr("named_struct('a', id, 'b', id) as 
> a").createOrReplaceTempView("A")
> sql("select * from (select min(a) as minA from A) A where minA in 
> (named_struct('a', 1L, 'b', 1L),named_struct('a', 2L, 'b', 
> 2L),named_struct('a', 3L, 'b', 3L))").show -- the Aggregate here will return 
> UnsafeRows while the list of structs that will become hset will be 
> GenericInternalRows
> ++
> |minA|
> ++
> ++
> {code}
> In.doCodeGen appears to be correct:
> {code}
> spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "3") // now it 
> will not use InSet
> sql("select * from (select min(a) as minA from A) A where minA in 
> (named_struct('a', 1L, 'b', 1L),named_struct('a', 2L, 'b', 
> 2L),named_struct('a', 3L, 'b', 3L))").show
> +-+
> | minA|
> +-+
> |[1,1]|
> +-+
> {code}
> Solution could be either to do safe<->unsafe conversion in InSet.doCodeGen or 
> not trigger InSet optimization at all in this case.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21228) InSet.doCodeGen incorrect handling of structs

2017-06-27 Thread Bogdan Raducanu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bogdan Raducanu updated SPARK-21228:

Description: 
In InSet it's possible that hset contains GenericInternalRows while child 
returns UnsafeRows (and vice versa). InSet.doCodeGen uses hset.contains which 
will always be false in this case.

The following code reproduces the problem:
{code}
spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "2") // the 
default is 10 which requires a longer query text to repro

spark.range(1, 10).selectExpr("named_struct('a', id, 'b', id) as 
a").createOrReplaceTempView("A")

sql("select * from (select min(a) as minA from A) A where minA in 
(named_struct('a', 1L, 'b', 1L),named_struct('a', 2L, 'b', 
2L),named_struct('a', 3L, 'b', 3L))").show
++
|minA|
++
++
{code}
In.doCodeGen appears to be correct:
{code}
spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "3") // now it 
will not use InSet
+-+
| minA|
+-+
|[1,1]|
+-+
{code}

Solution could be either to do safe<->unsafe conversion in InSet.doCodeGen or 
not trigger InSet optimization at all in this case.

  was:
In InSet it's possible that hset contains GenericInternalRows while child 
returns UnsafeRows (and vice versa). InSet.doCodeGen uses hset.contains which 
will always be false in this case.

The following code reproduces the problem:
```
spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "2") // the 
default is 10 which requires a longer query text to repro

spark.range(1, 10).selectExpr("named_struct('a', id, 'b', id) as 
a").createOrReplaceTempView("A")

sql("select * from (select min(a) as minA from A) A where minA in 
(named_struct('a', 1L, 'b', 1L),named_struct('a', 2L, 'b', 
2L),named_struct('a', 3L, 'b', 3L))").show
++
|minA|
++
++
```
In.doCodeGen appears to be correct:
```
spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "3") // now it 
will not use InSet
+-+
| minA|
+-+
|[1,1]|
+-+
```

Solution could be either to do safe<->unsafe conversion in InSet.doCodeGen or 
not trigger InSet optimization at all in this case.


> InSet.doCodeGen incorrect handling of structs
> -
>
> Key: SPARK-21228
> URL: https://issues.apache.org/jira/browse/SPARK-21228
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Bogdan Raducanu
>
> In InSet it's possible that hset contains GenericInternalRows while child 
> returns UnsafeRows (and vice versa). InSet.doCodeGen uses hset.contains which 
> will always be false in this case.
> The following code reproduces the problem:
> {code}
> spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "2") // the 
> default is 10 which requires a longer query text to repro
> spark.range(1, 10).selectExpr("named_struct('a', id, 'b', id) as 
> a").createOrReplaceTempView("A")
> sql("select * from (select min(a) as minA from A) A where minA in 
> (named_struct('a', 1L, 'b', 1L),named_struct('a', 2L, 'b', 
> 2L),named_struct('a', 3L, 'b', 3L))").show
> ++
> |minA|
> ++
> ++
> {code}
> In.doCodeGen appears to be correct:
> {code}
> spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "3") // now it 
> will not use InSet
> +-+
> | minA|
> +-+
> |[1,1]|
> +-+
> {code}
> Solution could be either to do safe<->unsafe conversion in InSet.doCodeGen or 
> not trigger InSet optimization at all in this case.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21228) InSet.doCodeGen incorrect handling of structs

2017-06-27 Thread Bogdan Raducanu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bogdan Raducanu updated SPARK-21228:

Description: 
In InSet it's possible that hset contains GenericInternalRows while child 
returns UnsafeRows (and vice versa). InSet.doCodeGen uses hset.contains which 
will always be false in this case.

The following code reproduces the problem:
{code}
spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "2") // the 
default is 10 which requires a longer query text to repro

spark.range(1, 10).selectExpr("named_struct('a', id, 'b', id) as 
a").createOrReplaceTempView("A")

sql("select * from (select min(a) as minA from A) A where minA in 
(named_struct('a', 1L, 'b', 1L),named_struct('a', 2L, 'b', 
2L),named_struct('a', 3L, 'b', 3L))").show
++
|minA|
++
++
{code}
In.doCodeGen appears to be correct:
{code}
spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "3") // now it 
will not use InSet
sql("select * from (select min(a) as minA from A) A where minA in 
(named_struct('a', 1L, 'b', 1L),named_struct('a', 2L, 'b', 
2L),named_struct('a', 3L, 'b', 3L))").show

+-+
| minA|
+-+
|[1,1]|
+-+
{code}

Solution could be either to do safe<->unsafe conversion in InSet.doCodeGen or 
not trigger InSet optimization at all in this case.

  was:
In InSet it's possible that hset contains GenericInternalRows while child 
returns UnsafeRows (and vice versa). InSet.doCodeGen uses hset.contains which 
will always be false in this case.

The following code reproduces the problem:
{code}
spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "2") // the 
default is 10 which requires a longer query text to repro

spark.range(1, 10).selectExpr("named_struct('a', id, 'b', id) as 
a").createOrReplaceTempView("A")

sql("select * from (select min(a) as minA from A) A where minA in 
(named_struct('a', 1L, 'b', 1L),named_struct('a', 2L, 'b', 
2L),named_struct('a', 3L, 'b', 3L))").show
++
|minA|
++
++
{code}
In.doCodeGen appears to be correct:
{code}
spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "3") // now it 
will not use InSet
+-+
| minA|
+-+
|[1,1]|
+-+
{code}

Solution could be either to do safe<->unsafe conversion in InSet.doCodeGen or 
not trigger InSet optimization at all in this case.


> InSet.doCodeGen incorrect handling of structs
> -
>
> Key: SPARK-21228
> URL: https://issues.apache.org/jira/browse/SPARK-21228
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Bogdan Raducanu
>
> In InSet it's possible that hset contains GenericInternalRows while child 
> returns UnsafeRows (and vice versa). InSet.doCodeGen uses hset.contains which 
> will always be false in this case.
> The following code reproduces the problem:
> {code}
> spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "2") // the 
> default is 10 which requires a longer query text to repro
> spark.range(1, 10).selectExpr("named_struct('a', id, 'b', id) as 
> a").createOrReplaceTempView("A")
> sql("select * from (select min(a) as minA from A) A where minA in 
> (named_struct('a', 1L, 'b', 1L),named_struct('a', 2L, 'b', 
> 2L),named_struct('a', 3L, 'b', 3L))").show
> ++
> |minA|
> ++
> ++
> {code}
> In.doCodeGen appears to be correct:
> {code}
> spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "3") // now it 
> will not use InSet
> sql("select * from (select min(a) as minA from A) A where minA in 
> (named_struct('a', 1L, 'b', 1L),named_struct('a', 2L, 'b', 
> 2L),named_struct('a', 3L, 'b', 3L))").show
> +-+
> | minA|
> +-+
> |[1,1]|
> +-+
> {code}
> Solution could be either to do safe<->unsafe conversion in InSet.doCodeGen or 
> not trigger InSet optimization at all in this case.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21228) InSet.doCodeGen incorrect handling of structs

2017-06-27 Thread Bogdan Raducanu (JIRA)
Bogdan Raducanu created SPARK-21228:
---

 Summary: InSet.doCodeGen incorrect handling of structs
 Key: SPARK-21228
 URL: https://issues.apache.org/jira/browse/SPARK-21228
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Bogdan Raducanu


In InSet it's possible that hset contains GenericInternalRows while child 
returns UnsafeRows (and vice versa). InSet.doCodeGen uses hset.contains which 
will always be false in this case.

The following code reproduces the problem:
```
spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "2") // the 
default is 10 which requires a longer query text to repro

spark.range(1, 10).selectExpr("named_struct('a', id, 'b', id) as 
a").createOrReplaceTempView("A")

sql("select * from (select min(a) as minA from A) A where minA in 
(named_struct('a', 1L, 'b', 1L),named_struct('a', 2L, 'b', 
2L),named_struct('a', 3L, 'b', 3L))").show
++
|minA|
++
++
```
In.doCodeGen appears to be correct:
```
spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "3") // now it 
will not use InSet
+-+
| minA|
+-+
|[1,1]|
+-+
```

Solution could be either to do safe<->unsafe conversion in InSet.doCodeGen or 
not trigger InSet optimization at all in this case.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21067) Thrift Server - CTAS fail with Unable to move source

2017-06-27 Thread Dominic Ricard (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16064833#comment-16064833
 ] 

Dominic Ricard edited comment on SPARK-21067 at 6/27/17 1:31 PM:
-

[~q79969786], yes. As stated in the description, ours is set to 
"/tmp/hive-staging/\{user.name\}"


was (Author: dricard):
[~q79969786], yes. As stated in the description, ours is set to 
"/tmp/hive-staging/{user.name}"

> Thrift Server - CTAS fail with Unable to move source
> 
>
> Key: SPARK-21067
> URL: https://issues.apache.org/jira/browse/SPARK-21067
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: Yarn
> Hive MetaStore
> HDFS (HA)
>Reporter: Dominic Ricard
>
> After upgrading our Thrift cluster to 2.1.1, we ran into an issue where CTAS 
> would fail, sometimes...
> Most of the time, the CTAS would work only once, after starting the thrift 
> server. After that, dropping the table and re-issuing the same CTAS would 
> fail with the following message (Sometime, it fails right away, sometime it 
> work for a long period of time):
> {noformat}
> Error: org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source 
> hdfs://nameservice1//tmp/hive-staging/thrift_hive_2017-06-12_16-56-18_464_7598877199323198104-31/-ext-1/part-0
>  to destination 
> hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; 
> (state=,code=0)
> {noformat}
> We have already found the following Jira 
> (https://issues.apache.org/jira/browse/SPARK-11021) which state that the 
> {{hive.exec.stagingdir}} had to be added in order for Spark to be able to 
> handle CREATE TABLE properly as of 2.0. As you can see in the error, we have 
> ours set to "/tmp/hive-staging/\{user.name\}"
> Same issue with INSERT statements:
> {noformat}
> CREATE TABLE IF NOT EXISTS dricard.test (col1 int); INSERT INTO TABLE 
> dricard.test SELECT 1;
> Error: org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source 
> hdfs://nameservice1/tmp/hive-staging/thrift_hive_2017-06-12_20-41-12_964_3086448130033637241-16/-ext-1/part-0
>  to destination 
> hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; 
> (state=,code=0)
> {noformat}
> This worked fine in 1.6.2, which we currently run in our Production 
> Environment but since 2.0+, we haven't been able to CREATE TABLE consistently 
> on the cluster.
> SQL to reproduce issue:
> {noformat}
> DROP SCHEMA IF EXISTS dricard CASCADE; 
> CREATE SCHEMA dricard; 
> CREATE TABLE dricard.test (col1 int); 
> INSERT INTO TABLE dricard.test SELECT 1; 
> SELECT * from dricard.test; 
> DROP TABLE dricard.test; 
> CREATE TABLE dricard.test AS select 1 as `col1`;
> SELECT * from dricard.test
> {noformat}
> Thrift server usually fails at INSERT...
> Tried the same procedure in a spark context using spark.sql() and didn't 
> encounter the same issue.
> Full stack Trace:
> {noformat}
> 17/06/14 14:52:18 ERROR thriftserver.SparkExecuteStatementOperation: Error 
> executing query, currentState RUNNING,
> org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source 
> hdfs://nameservice1/tmp/hive-staging/thrift_hive_2017-06-14_14-52-18_521_5906917519254880890-5/-ext-1/part-0
>  to desti
> nation hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0;
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.loadTable(HiveExternalCatalog.scala:766)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:374)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:221)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:407)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
> at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
> at 
> org.apache.spark.sql.executio

[jira] [Commented] (SPARK-21067) Thrift Server - CTAS fail with Unable to move source

2017-06-27 Thread Dominic Ricard (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16064833#comment-16064833
 ] 

Dominic Ricard commented on SPARK-21067:


[~q79969786], yes. As stated in the description, ours is set to 
"/tmp/hive-staging/{user.name}"

> Thrift Server - CTAS fail with Unable to move source
> 
>
> Key: SPARK-21067
> URL: https://issues.apache.org/jira/browse/SPARK-21067
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: Yarn
> Hive MetaStore
> HDFS (HA)
>Reporter: Dominic Ricard
>
> After upgrading our Thrift cluster to 2.1.1, we ran into an issue where CTAS 
> would fail, sometimes...
> Most of the time, the CTAS would work only once, after starting the thrift 
> server. After that, dropping the table and re-issuing the same CTAS would 
> fail with the following message (Sometime, it fails right away, sometime it 
> work for a long period of time):
> {noformat}
> Error: org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source 
> hdfs://nameservice1//tmp/hive-staging/thrift_hive_2017-06-12_16-56-18_464_7598877199323198104-31/-ext-1/part-0
>  to destination 
> hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; 
> (state=,code=0)
> {noformat}
> We have already found the following Jira 
> (https://issues.apache.org/jira/browse/SPARK-11021) which state that the 
> {{hive.exec.stagingdir}} had to be added in order for Spark to be able to 
> handle CREATE TABLE properly as of 2.0. As you can see in the error, we have 
> ours set to "/tmp/hive-staging/\{user.name\}"
> Same issue with INSERT statements:
> {noformat}
> CREATE TABLE IF NOT EXISTS dricard.test (col1 int); INSERT INTO TABLE 
> dricard.test SELECT 1;
> Error: org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source 
> hdfs://nameservice1/tmp/hive-staging/thrift_hive_2017-06-12_20-41-12_964_3086448130033637241-16/-ext-1/part-0
>  to destination 
> hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; 
> (state=,code=0)
> {noformat}
> This worked fine in 1.6.2, which we currently run in our Production 
> Environment but since 2.0+, we haven't been able to CREATE TABLE consistently 
> on the cluster.
> SQL to reproduce issue:
> {noformat}
> DROP SCHEMA IF EXISTS dricard CASCADE; 
> CREATE SCHEMA dricard; 
> CREATE TABLE dricard.test (col1 int); 
> INSERT INTO TABLE dricard.test SELECT 1; 
> SELECT * from dricard.test; 
> DROP TABLE dricard.test; 
> CREATE TABLE dricard.test AS select 1 as `col1`;
> SELECT * from dricard.test
> {noformat}
> Thrift server usually fails at INSERT...
> Tried the same procedure in a spark context using spark.sql() and didn't 
> encounter the same issue.
> Full stack Trace:
> {noformat}
> 17/06/14 14:52:18 ERROR thriftserver.SparkExecuteStatementOperation: Error 
> executing query, currentState RUNNING,
> org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source 
> hdfs://nameservice1/tmp/hive-staging/thrift_hive_2017-06-14_14-52-18_521_5906917519254880890-5/-ext-1/part-0
>  to desti
> nation hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0;
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.loadTable(HiveExternalCatalog.scala:766)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:374)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:221)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:407)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
> at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
> at org.apache.spark.sql.Dataset.(Dataset.scala:185)
> at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala

[jira] [Updated] (SPARK-21227) Unicode in Json field causes AnalysisException when selecting from Dataframe

2017-06-27 Thread Seydou Dia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seydou Dia updated SPARK-21227:
---
Description: 
Hi,

please find below the step to reproduce the issue I am facing.
First I create a json with 2 fields:

* city_name
* cıty_name

The first one is valid ascii, while the second contains a unicode (ı, i without 
dot ).
When I try to select from the dataframe I have an  {noformat} AnalysisException 
{noformat}.







{code:python}
$ pyspark

Python 3.4.3 (default, Sep  1 2016, 23:33:38) 
[GCC 4.8.3 20140911 (Red Hat 4.8.3-9)] on linux
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
17/06/27 12:29:05 WARN Utils: Service 'SparkUI' could not bind on port 4040. 
Attempting port 4041.
17/06/27 12:29:05 WARN Utils: Service 'SparkUI' could not bind on port 4041. 
Attempting port 4042.
17/06/27 12:29:08 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive 
is set, falling back to uploading libraries under SPARK_HOME.
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.1.0
  /_/

Using Python version 3.4.3 (default, Sep  1 2016 23:33:38)
SparkSession available as 'spark'.

>>> sc=spark.sparkContext
>>> js = ['{"city_name": "paris"}'
... , '{"city_name": "rome"}'
... , '{"city_name": "berlin"}'
... , '{"cıty_name": "new-york"}'
... , '{"cıty_name": "toronto"}'
... , '{"cıty_name": "chicago"}'
... , '{"cıty_name": "dubai"}']
>>> myRDD = sc.parallelize(js)
>>> myDF = spark.read.json(myRDD)
>>> myDF.printSchema()  
root
 |-- city_name: string (nullable = true)
 |-- cıty_name: string (nullable = true)

>>> myDF.select(myDF['city_name'])
Traceback (most recent call last):
  File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
  File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 
319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o226.apply.
: org.apache.spark.sql.AnalysisException: Reference 'city_name' is ambiguous, 
could be: city_name#29, city_name#30.;
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:264)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:168)
at org.apache.spark.sql.Dataset.resolve(Dataset.scala:217)
at org.apache.spark.sql.Dataset.col(Dataset.scala:1073)
at org.apache.spark.sql.Dataset.apply(Dataset.scala:1059)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745)


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "", line 1, in 
  File "/usr/lib/spark/python/pyspark/sql/dataframe.py", line 943, in 
__getitem__
jc = self._jdf.apply(item)
  File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", 
line 1133, in __call__
  File "/usr/lib/spark/python/pyspark/sql/utils.py", line 69, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: "Reference 'city_name' is ambiguous, could 
be: city_name#29, city_name#30.;"



{code}


  was:
Hi,

please find below the step to reproduce the issue I am facing,


{code:python}
$ pyspark

Python 3.4.3 (default, Sep  1 2016, 23:33:38) 
[GCC 4.8.3 20140911 (Red Hat 4.8.3-9)] on linux
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
17/06/27 12:29:05 WARN Utils: Service 'SparkUI' could not bind on port 4040. 
Attempting port 4041.
17/06/27 12:29:05 WARN Utils: Service 'SparkUI' could not bind on port 4041. 
Attempting port 4042.
17/06/27 12:29:08 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive 
is set, falling back to uploading libraries under SPARK_HOME.
Welcome to
    __
 / __/__  ___ _/ /__

[jira] [Created] (SPARK-21227) Unicode in Json field causes AnalysisException when selecting from Dataframe

2017-06-27 Thread Seydou Dia (JIRA)
Seydou Dia created SPARK-21227:
--

 Summary: Unicode in Json field causes AnalysisException when 
selecting from Dataframe
 Key: SPARK-21227
 URL: https://issues.apache.org/jira/browse/SPARK-21227
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.1.0
Reporter: Seydou Dia


Hi,

please find below the step to reproduce the issue I am facing,
$ pyspark


{code:python}
Python 3.4.3 (default, Sep  1 2016, 23:33:38) 
[GCC 4.8.3 20140911 (Red Hat 4.8.3-9)] on linux
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
17/06/27 12:29:05 WARN Utils: Service 'SparkUI' could not bind on port 4040. 
Attempting port 4041.
17/06/27 12:29:05 WARN Utils: Service 'SparkUI' could not bind on port 4041. 
Attempting port 4042.
17/06/27 12:29:08 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive 
is set, falling back to uploading libraries under SPARK_HOME.
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.1.0
  /_/

Using Python version 3.4.3 (default, Sep  1 2016 23:33:38)
SparkSession available as 'spark'.

>>> sc=spark.sparkContext
>>> js = ['{"city_name": "paris"}'
... , '{"city_name": "rome"}'
... , '{"city_name": "berlin"}'
... , '{"cıty_name": "new-york"}'
... , '{"cıty_name": "toronto"}'
... , '{"cıty_name": "chicago"}'
... , '{"cıty_name": "dubai"}']
>>> myRDD = sc.parallelize(js)
>>> myDF = spark.read.json(myRDD)
>>> myDF.printSchema()  
root
 |-- city_name: string (nullable = true)
 |-- cıty_name: string (nullable = true)

>>> myDF.select(myDF['city_name'])
Traceback (most recent call last):
  File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
  File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 
319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o226.apply.
: org.apache.spark.sql.AnalysisException: Reference 'city_name' is ambiguous, 
could be: city_name#29, city_name#30.;
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:264)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:168)
at org.apache.spark.sql.Dataset.resolve(Dataset.scala:217)
at org.apache.spark.sql.Dataset.col(Dataset.scala:1073)
at org.apache.spark.sql.Dataset.apply(Dataset.scala:1059)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745)


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "", line 1, in 
  File "/usr/lib/spark/python/pyspark/sql/dataframe.py", line 943, in 
__getitem__
jc = self._jdf.apply(item)
  File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", 
line 1133, in __call__
  File "/usr/lib/spark/python/pyspark/sql/utils.py", line 69, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: "Reference 'city_name' is ambiguous, could 
be: city_name#29, city_name#30.;"



{code}




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21227) Unicode in Json field causes AnalysisException when selecting from Dataframe

2017-06-27 Thread Seydou Dia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seydou Dia updated SPARK-21227:
---
Description: 
Hi,

please find below the step to reproduce the issue I am facing,


{code:python}
$ pyspark

Python 3.4.3 (default, Sep  1 2016, 23:33:38) 
[GCC 4.8.3 20140911 (Red Hat 4.8.3-9)] on linux
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
17/06/27 12:29:05 WARN Utils: Service 'SparkUI' could not bind on port 4040. 
Attempting port 4041.
17/06/27 12:29:05 WARN Utils: Service 'SparkUI' could not bind on port 4041. 
Attempting port 4042.
17/06/27 12:29:08 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive 
is set, falling back to uploading libraries under SPARK_HOME.
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.1.0
  /_/

Using Python version 3.4.3 (default, Sep  1 2016 23:33:38)
SparkSession available as 'spark'.

>>> sc=spark.sparkContext
>>> js = ['{"city_name": "paris"}'
... , '{"city_name": "rome"}'
... , '{"city_name": "berlin"}'
... , '{"cıty_name": "new-york"}'
... , '{"cıty_name": "toronto"}'
... , '{"cıty_name": "chicago"}'
... , '{"cıty_name": "dubai"}']
>>> myRDD = sc.parallelize(js)
>>> myDF = spark.read.json(myRDD)
>>> myDF.printSchema()  
root
 |-- city_name: string (nullable = true)
 |-- cıty_name: string (nullable = true)

>>> myDF.select(myDF['city_name'])
Traceback (most recent call last):
  File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
  File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 
319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o226.apply.
: org.apache.spark.sql.AnalysisException: Reference 'city_name' is ambiguous, 
could be: city_name#29, city_name#30.;
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:264)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:168)
at org.apache.spark.sql.Dataset.resolve(Dataset.scala:217)
at org.apache.spark.sql.Dataset.col(Dataset.scala:1073)
at org.apache.spark.sql.Dataset.apply(Dataset.scala:1059)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745)


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "", line 1, in 
  File "/usr/lib/spark/python/pyspark/sql/dataframe.py", line 943, in 
__getitem__
jc = self._jdf.apply(item)
  File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", 
line 1133, in __call__
  File "/usr/lib/spark/python/pyspark/sql/utils.py", line 69, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: "Reference 'city_name' is ambiguous, could 
be: city_name#29, city_name#30.;"



{code}


  was:
Hi,

please find below the step to reproduce the issue I am facing,
$ pyspark


{code:python}
Python 3.4.3 (default, Sep  1 2016, 23:33:38) 
[GCC 4.8.3 20140911 (Red Hat 4.8.3-9)] on linux
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
17/06/27 12:29:05 WARN Utils: Service 'SparkUI' could not bind on port 4040. 
Attempting port 4041.
17/06/27 12:29:05 WARN Utils: Service 'SparkUI' could not bind on port 4041. 
Attempting port 4042.
17/06/27 12:29:08 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive 
is set, falling back to uploading libraries under SPARK_HOME.
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.1.0
  /_/

Using Python version 3.4.3 (default, Sep  1 2016 23:33:38)
SparkSession available as 'spark'.

>>> sc=spark.sparkContext
>>> js = ['{"city_name": "paris"}'
... , 

[jira] [Assigned] (SPARK-21176) Master UI hangs with spark.ui.reverseProxy=true if the master node has many CPUs

2017-06-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21176:


Assignee: Apache Spark

> Master UI hangs with spark.ui.reverseProxy=true if the master node has many 
> CPUs
> 
>
> Key: SPARK-21176
> URL: https://issues.apache.org/jira/browse/SPARK-21176
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.0, 2.1.1, 2.2.0, 2.2.1
> Environment: ppc64le GNU/Linux, POWER8, only master node is reachable 
> externally other nodes are in an internal network
>Reporter: Ingo Schuster
>Assignee: Apache Spark
>  Labels: network, web-ui
>
> In reverse proxy mode, Sparks exhausts the Jetty thread pool if the master 
> node has too many cpus or the cluster has too many executers:
> For each ProxyServlet, Jetty creates Selector threads: minimum 4, maximum 
> half the number of available CPUs:
> {{this(Math.max(1, Runtime.getRuntime().availableProcessors() / 2));}}
> (see 
> https://github.com/eclipse/jetty.project/blob/0c8273f2ca1f9bf2064cd9c4c939d2546443f759/jetty-client/src/main/java/org/eclipse/jetty/client/http/HttpClientTransportOverHTTP.java)
> In reverse proxy mode, a proxy servlet is set up for each executor.
> I have a system with 7 executors and 88 CPUs on the master node. Jetty tries 
> to instantiate 7*44 = 309 selector threads just for the reverse proxy 
> servlets, but since the QueuedThreadPool is initialized with 200 threads by 
> default, the UI gets stuck.
> I have patched JettyUtils.scala to extend the thread pool ( {{val pool = new 
> QueuedThreadPool(400)}}). With this hack, the UI works.
> Obviously, the Jetty defaults are meant for a real web server. If that has 88 
> CPUs, you do certainly expect a lot of traffic.
> For the Spark admin UI however, there will rarely be concurrent accesses for 
> the same application or the same executor.
> I therefore propose to dramatically reduce the number of selector threads 
> that get instantiated - at least by default.
> I will propose a fix in a pull request.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21176) Master UI hangs with spark.ui.reverseProxy=true if the master node has many CPUs

2017-06-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16064747#comment-16064747
 ] 

Apache Spark commented on SPARK-21176:
--

User 'IngoSchuster' has created a pull request for this issue:
https://github.com/apache/spark/pull/18437

> Master UI hangs with spark.ui.reverseProxy=true if the master node has many 
> CPUs
> 
>
> Key: SPARK-21176
> URL: https://issues.apache.org/jira/browse/SPARK-21176
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.0, 2.1.1, 2.2.0, 2.2.1
> Environment: ppc64le GNU/Linux, POWER8, only master node is reachable 
> externally other nodes are in an internal network
>Reporter: Ingo Schuster
>  Labels: network, web-ui
>
> In reverse proxy mode, Sparks exhausts the Jetty thread pool if the master 
> node has too many cpus or the cluster has too many executers:
> For each ProxyServlet, Jetty creates Selector threads: minimum 4, maximum 
> half the number of available CPUs:
> {{this(Math.max(1, Runtime.getRuntime().availableProcessors() / 2));}}
> (see 
> https://github.com/eclipse/jetty.project/blob/0c8273f2ca1f9bf2064cd9c4c939d2546443f759/jetty-client/src/main/java/org/eclipse/jetty/client/http/HttpClientTransportOverHTTP.java)
> In reverse proxy mode, a proxy servlet is set up for each executor.
> I have a system with 7 executors and 88 CPUs on the master node. Jetty tries 
> to instantiate 7*44 = 309 selector threads just for the reverse proxy 
> servlets, but since the QueuedThreadPool is initialized with 200 threads by 
> default, the UI gets stuck.
> I have patched JettyUtils.scala to extend the thread pool ( {{val pool = new 
> QueuedThreadPool(400)}}). With this hack, the UI works.
> Obviously, the Jetty defaults are meant for a real web server. If that has 88 
> CPUs, you do certainly expect a lot of traffic.
> For the Spark admin UI however, there will rarely be concurrent accesses for 
> the same application or the same executor.
> I therefore propose to dramatically reduce the number of selector threads 
> that get instantiated - at least by default.
> I will propose a fix in a pull request.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21176) Master UI hangs with spark.ui.reverseProxy=true if the master node has many CPUs

2017-06-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21176:


Assignee: (was: Apache Spark)

> Master UI hangs with spark.ui.reverseProxy=true if the master node has many 
> CPUs
> 
>
> Key: SPARK-21176
> URL: https://issues.apache.org/jira/browse/SPARK-21176
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.0, 2.1.1, 2.2.0, 2.2.1
> Environment: ppc64le GNU/Linux, POWER8, only master node is reachable 
> externally other nodes are in an internal network
>Reporter: Ingo Schuster
>  Labels: network, web-ui
>
> In reverse proxy mode, Sparks exhausts the Jetty thread pool if the master 
> node has too many cpus or the cluster has too many executers:
> For each ProxyServlet, Jetty creates Selector threads: minimum 4, maximum 
> half the number of available CPUs:
> {{this(Math.max(1, Runtime.getRuntime().availableProcessors() / 2));}}
> (see 
> https://github.com/eclipse/jetty.project/blob/0c8273f2ca1f9bf2064cd9c4c939d2546443f759/jetty-client/src/main/java/org/eclipse/jetty/client/http/HttpClientTransportOverHTTP.java)
> In reverse proxy mode, a proxy servlet is set up for each executor.
> I have a system with 7 executors and 88 CPUs on the master node. Jetty tries 
> to instantiate 7*44 = 309 selector threads just for the reverse proxy 
> servlets, but since the QueuedThreadPool is initialized with 200 threads by 
> default, the UI gets stuck.
> I have patched JettyUtils.scala to extend the thread pool ( {{val pool = new 
> QueuedThreadPool(400)}}). With this hack, the UI works.
> Obviously, the Jetty defaults are meant for a real web server. If that has 88 
> CPUs, you do certainly expect a lot of traffic.
> For the Spark admin UI however, there will rarely be concurrent accesses for 
> the same application or the same executor.
> I therefore propose to dramatically reduce the number of selector threads 
> that get instantiated - at least by default.
> I will propose a fix in a pull request.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21218) Convert IN predicate to equivalent Parquet filter

2017-06-27 Thread Michael Styles (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16064608#comment-16064608
 ] 

Michael Styles edited comment on SPARK-21218 at 6/27/17 12:17 PM:
--

By not pushing the filter to Parquet, are we not preventing Parquet from 
skipping blocks during read operations? I have tests that show big improvements 
when applying this transformation.

For instance, I have a Parquet file with 162,456,394 rows which is sorted on 
column C1.

*IN Predicate*
{noformat}
df.filter[df['C1'].isin([42, 139])).collect()
{noformat}

*OR Predicate*
{noformat}
df.filter((df['C1'] == 42) | (df['C1'] == 139)).collect()
{noformat}

I'm seeing about a 50 -75 % improvement.


was (Author: ptkool):
By not pushing the filter to Parquet, are we not preventing Parquet from 
skipping blocks during read operations? I have tests that show big improvements 
when applying this transformation.

For instance, I have a Parquet file with 162,456,394 rows which is sorted on 
column C1.

*IN Predicate*
{noformat}
df.filter[df['C1'].isin([42, 139])).collect()
{noformat}

*OR Predicate*
{noformat}
df.filter((df['C1'] == 42) | (df['C1'] == 139)).collect()
{noformat}

Notice the difference in the number of output rows for the scans (see 
attachments). Also, the IN predicate test took about 1.1 minutes, while the OR 
predicate test took about 16 seconds. 

> Convert IN predicate to equivalent Parquet filter
> -
>
> Key: SPARK-21218
> URL: https://issues.apache.org/jira/browse/SPARK-21218
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 2.1.1
>Reporter: Michael Styles
> Attachments: IN Predicate.png, OR Predicate.png
>
>
> Convert IN predicate to equivalent expression involving equality conditions 
> to allow the filter to be pushed down to Parquet.
> For instance,
> C1 IN (10, 20) is rewritten as (C1 = 10) OR (C1 = 20)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >