[jira] [Resolved] (SPARK-18368) Regular expression replace throws NullPointerException when serialized

2016-11-08 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-18368.
-
   Resolution: Fixed
 Assignee: Ryan Blue
Fix Version/s: 2.1.0
   2.0.3

> Regular expression replace throws NullPointerException when serialized
> --
>
> Key: SPARK-18368
> URL: https://issues.apache.org/jira/browse/SPARK-18368
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Ryan Blue
>Assignee: Ryan Blue
> Fix For: 2.0.3, 2.1.0
>
>
> This query fails with a [NullPointerException on line 
> 247|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala#L247]:
> {code}
> SELECT POSEXPLODE(SPLIT(REGEXP_REPLACE(ranks, '[\\[ \\]]', ''), ',')) AS 
> (rank, col0) FROM table;
> {code}
> The problem is that POSEXPLODE is causing the REGEXP_REPLACE to be serialized 
> after it is instantiated. The null value is a transient StringBuffer that 
> should hold the result. The fix is to make the result value lazy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18352) Parse normal, multi-line JSON files (not just JSON Lines)

2016-11-08 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15650025#comment-15650025
 ] 

Reynold Xin commented on SPARK-18352:
-

Again, this has nothing to do with streaming. It should just be an option (e.g. 
multilineJson, or wholeFile) for JSON.


> Parse normal, multi-line JSON files (not just JSON Lines)
> -
>
> Key: SPARK-18352
> URL: https://issues.apache.org/jira/browse/SPARK-18352
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>
> Spark currently can only parse JSON files that are JSON lines, i.e. each 
> record has an entire line and records are separated by new line. In reality, 
> a lot of users want to use Spark to parse actual JSON files, and are 
> surprised to learn that it doesn't do that.
> We can introduce a new mode (wholeJsonFile?) in which we don't split the 
> files, and rather stream through them to parse the JSON files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18333) Revert hacks in parquet and orc reader to support case insensitive resolution

2016-11-08 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-18333.
-
   Resolution: Fixed
 Assignee: Eric Liang
Fix Version/s: 2.1.0

> Revert hacks in parquet and orc reader to support case insensitive resolution
> -
>
> Key: SPARK-18333
> URL: https://issues.apache.org/jira/browse/SPARK-18333
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Eric Liang
>Assignee: Eric Liang
> Fix For: 2.1.0
>
>
> These are no longer needed after 
> https://issues.apache.org/jira/browse/SPARK-17183



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18352) Parse normal, multi-line JSON files (not just JSON Lines)

2016-11-08 Thread Thomas Sebastian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649982#comment-15649982
 ] 

Thomas Sebastian edited comment on SPARK-18352 at 11/9/16 6:49 AM:
---

Hi [~rxin],
So, do you mean that stream API need not be used,and  there should be a new API 
which can read multiple json files?

-Thomas


was (Author: thomastechs):
Hi Reynold,
So, do you mean that stream API need not be used,and  there should be a new API 
which can read multiple json files?

-Thomas

> Parse normal, multi-line JSON files (not just JSON Lines)
> -
>
> Key: SPARK-18352
> URL: https://issues.apache.org/jira/browse/SPARK-18352
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>
> Spark currently can only parse JSON files that are JSON lines, i.e. each 
> record has an entire line and records are separated by new line. In reality, 
> a lot of users want to use Spark to parse actual JSON files, and are 
> surprised to learn that it doesn't do that.
> We can introduce a new mode (wholeJsonFile?) in which we don't split the 
> files, and rather stream through them to parse the JSON files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18352) Parse normal, multi-line JSON files (not just JSON Lines)

2016-11-08 Thread Thomas Sebastian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649982#comment-15649982
 ] 

Thomas Sebastian commented on SPARK-18352:
--

Hi Reynold,
So, do you mean that stream API need not be used,and  there should be a new API 
which can read multiple json files?

-Thomas

> Parse normal, multi-line JSON files (not just JSON Lines)
> -
>
> Key: SPARK-18352
> URL: https://issues.apache.org/jira/browse/SPARK-18352
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>
> Spark currently can only parse JSON files that are JSON lines, i.e. each 
> record has an entire line and records are separated by new line. In reality, 
> a lot of users want to use Spark to parse actual JSON files, and are 
> surprised to learn that it doesn't do that.
> We can introduce a new mode (wholeJsonFile?) in which we don't split the 
> files, and rather stream through them to parse the JSON files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18350) Support session local timezone

2016-11-08 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649969#comment-15649969
 ] 

Xiao Li commented on SPARK-18350:
-

Agree. Session-specific SQL conf can be used here.

> Support session local timezone
> --
>
> Key: SPARK-18350
> URL: https://issues.apache.org/jira/browse/SPARK-18350
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>
> As of Spark 2.1, Spark SQL assumes the machine timezone for datetime 
> manipulation, which is bad if users are not in the same timezones as the 
> machines, or if different users have different timezones.
> We should introduce a session local timezone setting that is used for 
> execution.
> An explicit non-goal is locale handling.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18374) Incorrect words in StopWords/english.txt

2016-11-08 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649953#comment-15649953
 ] 

yuhao yang commented on SPARK-18374:


Just to provide some history info for the issue: 
https://github.com/apache/spark/pull/11871#issuecomment-199832509, according to 
which english stopwords were argumented.

> Incorrect words in StopWords/english.txt
> 
>
> Key: SPARK-18374
> URL: https://issues.apache.org/jira/browse/SPARK-18374
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.0.1
>Reporter: nirav patel
>
> I was just double checking english.txt for list of stopwords as I felt it was 
> taking out valid tokens like 'won'. I think issue is english.txt list is 
> missing apostrophe character and all character after apostrophe. So "won't" 
> becam "won" in that list; "wouldn't" is "wouldn" .
> Here are some incorrect tokens in this list:
> won
> wouldn
> ma
> mightn
> mustn
> needn
> shan
> shouldn
> wasn
> weren
> I think ideal list should have both style. i.e. won't and wont both should be 
> part of english.txt as some tokenizer might remove special characters. But 
> 'won' is obviously shouldn't be in this list.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18372) .Hive-staging folders created from Spark hiveContext are not getting cleaned up

2016-11-08 Thread mingjie tang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mingjie tang updated SPARK-18372:
-
Attachment: _thumb_37664.png

the staging directory fail to be removed when hive table in the hdfs. 

> .Hive-staging folders created from Spark hiveContext are not getting cleaned 
> up
> ---
>
> Key: SPARK-18372
> URL: https://issues.apache.org/jira/browse/SPARK-18372
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.2, 1.6.3
> Environment: spark standalone and spark yarn 
>Reporter: mingjie tang
> Fix For: 2.0.1
>
> Attachments: _thumb_37664.png
>
>
> Steps to reproduce:
> 
> 1. Launch spark-shell 
> 2. Run the following scala code via Spark-Shell 
> scala> val hivesampletabledf = sqlContext.table("hivesampletable") 
> scala> import org.apache.spark.sql.DataFrameWriter 
> scala> val dfw : DataFrameWriter = hivesampletabledf.write 
> scala> sqlContext.sql("CREATE TABLE IF NOT EXISTS hivesampletablecopypy ( 
> clientid string, querytime string, market string, deviceplatform string, 
> devicemake string, devicemodel string, state string, country string, 
> querydwelltime double, sessionid bigint, sessionpagevieworder bigint )") 
> scala> dfw.insertInto("hivesampletablecopypy") 
> scala> val hivesampletablecopypydfdf = sqlContext.sql("""SELECT clientid, 
> querytime, deviceplatform, querydwelltime FROM hivesampletablecopypy WHERE 
> state = 'Washington' AND devicemake = 'Microsoft' AND querydwelltime > 15 """)
> hivesampletablecopypydfdf.show
> 3. in HDFS (in our case, WASB), we can see the following folders 
> hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666
>  
> hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666-1/-ext-1
>  
> hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693
> the issue is that these don't get cleaned up and get accumulated
> =
> with the customer, we have tried setting "SET 
> hive.exec.stagingdir=/tmp/hive;" in hive-site.xml - didn't make any 
> difference.
> .hive-staging folders are created under the  folder - 
> hive/warehouse/hivesampletablecopypy/
> we have tried adding this property to hive-site.xml and restart the 
> components -
>  
> hive.exec.stagingdir 
> $ {hive.exec.scratchdir}
> /$
> {user.name}
> /.staging 
> 
> a new .hive-staging folder was created in hive/warehouse/ folder
> moreover, please understand that if we run the hive query in pure Hive via 
> Hive CLI on the same Spark cluster, we don't see the behavior
> so it doesn't appear to be a Hive issue/behavior in this case- this is a 
> spark behavior
> I checked in Ambari, spark.yarn.preserve.staging.files=false in Spark 
> configuration already
> The issue happens via Spark-submit as well - customer used the following 
> command to reproduce this -
> spark-submit test-hive-staging-cleanup.py



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18372) .Hive-staging folders created from Spark hiveContext are not getting cleaned up

2016-11-08 Thread mingjie tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649935#comment-15649935
 ] 

mingjie tang commented on SPARK-18372:
--

This bug can be reproduced by the following codes: 

val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlContext.sql("CREATE TABLE IF NOT EXISTS T1 (key INT, value STRING)")
sqlContext.sql("LOAD DATA LOCAL INPATH '../examples/src/main/resources/kv1.txt' 
INTO TABLE T1")
sqlContext.sql("CREATE TABLE IF NOT EXISTS T2 (key INT, value STRING)")
val sparktestdf = sqlContext.table("T1")
val dfw  = sparktestdf.write
dfw.insertInto("T2")
val sparktestcopypydfdf = sqlContext.sql("""SELECT *  from T2  """) 
sparktestcopypydfdf.show

After user quit the spark-shell, the related .staging directory generated by 
hive writer will not be removed. 
For example: 
the hive table in the directory: /user/hive/warehouse/t2
drwxr-xrwx  3 root  wheel   102 Oct 28 15:02 
.hive-staging_hive_2016-10-28_15-02-43_288_7070526396398178792-1


> .Hive-staging folders created from Spark hiveContext are not getting cleaned 
> up
> ---
>
> Key: SPARK-18372
> URL: https://issues.apache.org/jira/browse/SPARK-18372
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.2, 1.6.3
> Environment: spark standalone and spark yarn 
>Reporter: mingjie tang
> Fix For: 2.0.1
>
>
> Steps to reproduce:
> 
> 1. Launch spark-shell 
> 2. Run the following scala code via Spark-Shell 
> scala> val hivesampletabledf = sqlContext.table("hivesampletable") 
> scala> import org.apache.spark.sql.DataFrameWriter 
> scala> val dfw : DataFrameWriter = hivesampletabledf.write 
> scala> sqlContext.sql("CREATE TABLE IF NOT EXISTS hivesampletablecopypy ( 
> clientid string, querytime string, market string, deviceplatform string, 
> devicemake string, devicemodel string, state string, country string, 
> querydwelltime double, sessionid bigint, sessionpagevieworder bigint )") 
> scala> dfw.insertInto("hivesampletablecopypy") 
> scala> val hivesampletablecopypydfdf = sqlContext.sql("""SELECT clientid, 
> querytime, deviceplatform, querydwelltime FROM hivesampletablecopypy WHERE 
> state = 'Washington' AND devicemake = 'Microsoft' AND querydwelltime > 15 """)
> hivesampletablecopypydfdf.show
> 3. in HDFS (in our case, WASB), we can see the following folders 
> hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666
>  
> hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666-1/-ext-1
>  
> hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693
> the issue is that these don't get cleaned up and get accumulated
> =
> with the customer, we have tried setting "SET 
> hive.exec.stagingdir=/tmp/hive;" in hive-site.xml - didn't make any 
> difference.
> .hive-staging folders are created under the  folder - 
> hive/warehouse/hivesampletablecopypy/
> we have tried adding this property to hive-site.xml and restart the 
> components -
>  
> hive.exec.stagingdir 
> $ {hive.exec.scratchdir}
> /$
> {user.name}
> /.staging 
> 
> a new .hive-staging folder was created in hive/warehouse/ folder
> moreover, please understand that if we run the hive query in pure Hive via 
> Hive CLI on the same Spark cluster, we don't see the behavior
> so it doesn't appear to be a Hive issue/behavior in this case- this is a 
> spark behavior
> I checked in Ambari, spark.yarn.preserve.staging.files=false in Spark 
> configuration already
> The issue happens via Spark-submit as well - customer used the following 
> command to reproduce this -
> spark-submit test-hive-staging-cleanup.py



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18282) Add model summaries for Python GMM and BisectingKMeans

2016-11-08 Thread zhengruifeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649900#comment-15649900
 ] 

zhengruifeng commented on SPARK-18282:
--

This is a duplicate of SPARK-18240. But I prefer you to take it over.

> Add model summaries for Python GMM and BisectingKMeans
> --
>
> Key: SPARK-18282
> URL: https://issues.apache.org/jira/browse/SPARK-18282
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Reporter: Seth Hendrickson
>Priority: Minor
>
> GaussianMixtureModel and BisectingKMeansModel in python do not have model 
> summaries, but they are implemented in Scala. We should add them for API 
> parity before 2.1 release. After the QA Jiras are created, this can be linked 
> as a subtask.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-18240) Add Summary of BiKMeans and GMM in pyspark

2016-11-08 Thread zhengruifeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng closed SPARK-18240.

Resolution: Duplicate

> Add Summary of BiKMeans and GMM in pyspark
> --
>
> Key: SPARK-18240
> URL: https://issues.apache.org/jira/browse/SPARK-18240
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: zhengruifeng
>
> Add Summary of BiKMeans and GMM in pyspark.
> Since KMeansSummary in pyspark will be added in SPARK-15819, this JIRA will 
> not deal with KMeans.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18371) Spark Streaming backpressure bug - generates a batch with large number of records

2016-11-08 Thread mapreduced (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649891#comment-15649891
 ] 

mapreduced commented on SPARK-18371:


I'll try to test it out hopefully soon.

> Spark Streaming backpressure bug - generates a batch with large number of 
> records
> -
>
> Key: SPARK-18371
> URL: https://issues.apache.org/jira/browse/SPARK-18371
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.0
>Reporter: mapreduced
> Attachments: GiantBatch2.png, GiantBatch3.png, 
> Giant_batch_at_23_00.png, Look_at_batch_at_22_14.png
>
>
> When the streaming job is configured with backpressureEnabled=true, it 
> generates a GIANT batch of records if the processing time + scheduled delay 
> is (much) larger than batchDuration. This creates a backlog of records like 
> no other and results in batches queueing for hours until it chews through 
> this giant batch.
> Expectation is that it should reduce the number of records per batch in some 
> time to whatever it can really process.
> Attaching some screen shots where it seems that this issue is quite easily 
> reproducible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18350) Support session local timezone

2016-11-08 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649865#comment-15649865
 ] 

Reynold Xin commented on SPARK-18350:
-

If it is session specific, I don't think we need an API. Just use the existing 
SQL conf.

Also no need to introduce a new data type. That's a much bigger scope.


> Support session local timezone
> --
>
> Key: SPARK-18350
> URL: https://issues.apache.org/jira/browse/SPARK-18350
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>
> As of Spark 2.1, Spark SQL assumes the machine timezone for datetime 
> manipulation, which is bad if users are not in the same timezones as the 
> machines, or if different users have different timezones.
> We should introduce a session local timezone setting that is used for 
> execution.
> An explicit non-goal is locale handling.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18350) Support session local timezone

2016-11-08 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649858#comment-15649858
 ] 

Xiao Li edited comment on SPARK-18350 at 11/9/16 5:26 AM:
--

Below might be needed if we want to support session timezone?

- Add a SQL statement and API to set the current session timezone?
Link: 
https://docs.oracle.com/cd/B19306_01/server.102/b14225/ch4datetime.htm#i1006728
- Add a SQL statement and API to get the current session timezone? 
Link: 
https://www.ibm.com/support/knowledgecenter/SSEPEK_10.0.0/sqlref/src/tpc/db2z_currenttz.html
- Add time zone specific expressions? 
Link: 
http://www.ibm.com/support/knowledgecenter/SSEPEK_10.0.0/sqlref/src/tpc/db2z_tzspecificexpression.html


More works are needed if we want to add a new data type {{TIMESTAMP WITH TIME 
ZONE}}
Link: 
https://docs.oracle.com/cd/B19306_01/server.102/b14225/ch4datetime.htm#i1005946


was (Author: smilegator):
Below might be needed if we want to support session timezone?

- Add a SQL statement and API to set the current session timezone?
Link: 
https://docs.oracle.com/cd/B19306_01/server.102/b14225/ch4datetime.htm#i1006728
- Add a SQL statement and API to get the current session timezone? 
Link: 
https://www.ibm.com/support/knowledgecenter/SSEPEK_10.0.0/sqlref/src/tpc/db2z_currenttz.html
- Add time zone specific expressions? 
Link: 
http://www.ibm.com/support/knowledgecenter/SSEPEK_10.0.0/sqlref/src/tpc/db2z_tzspecificexpression.html


More works are needed if we want to add a new data type {TIMESTAMP WITH TIME 
ZONE}
Link: 
https://docs.oracle.com/cd/B19306_01/server.102/b14225/ch4datetime.htm#i1005946

> Support session local timezone
> --
>
> Key: SPARK-18350
> URL: https://issues.apache.org/jira/browse/SPARK-18350
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>
> As of Spark 2.1, Spark SQL assumes the machine timezone for datetime 
> manipulation, which is bad if users are not in the same timezones as the 
> machines, or if different users have different timezones.
> We should introduce a session local timezone setting that is used for 
> execution.
> An explicit non-goal is locale handling.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18350) Support session local timezone

2016-11-08 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649858#comment-15649858
 ] 

Xiao Li commented on SPARK-18350:
-

Below might be needed if we want to support session timezone?

- Add a SQL statement and API to set the current session timezone?
Link: 
https://docs.oracle.com/cd/B19306_01/server.102/b14225/ch4datetime.htm#i1006728
- Add a SQL statement and API to get the current session timezone? 
Link: 
https://www.ibm.com/support/knowledgecenter/SSEPEK_10.0.0/sqlref/src/tpc/db2z_currenttz.html
- Add time zone specific expressions? 
Link: 
http://www.ibm.com/support/knowledgecenter/SSEPEK_10.0.0/sqlref/src/tpc/db2z_tzspecificexpression.html


More works are needed if we want to add a new data type {TIMESTAMP WITH TIME 
ZONE}
Link: 
https://docs.oracle.com/cd/B19306_01/server.102/b14225/ch4datetime.htm#i1005946

> Support session local timezone
> --
>
> Key: SPARK-18350
> URL: https://issues.apache.org/jira/browse/SPARK-18350
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>
> As of Spark 2.1, Spark SQL assumes the machine timezone for datetime 
> manipulation, which is bad if users are not in the same timezones as the 
> machines, or if different users have different timezones.
> We should introduce a session local timezone setting that is used for 
> execution.
> An explicit non-goal is locale handling.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18352) Parse normal, multi-line JSON files (not just JSON Lines)

2016-11-08 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649835#comment-15649835
 ] 

Reynold Xin commented on SPARK-18352:
-

There is already a readStream.json.

"Stream" here means not having to read the entire file in memory at once, but 
rather just "stream through" it, i.e. parse as we scan.


> Parse normal, multi-line JSON files (not just JSON Lines)
> -
>
> Key: SPARK-18352
> URL: https://issues.apache.org/jira/browse/SPARK-18352
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>
> Spark currently can only parse JSON files that are JSON lines, i.e. each 
> record has an entire line and records are separated by new line. In reality, 
> a lot of users want to use Spark to parse actual JSON files, and are 
> surprised to learn that it doesn't do that.
> We can introduce a new mode (wholeJsonFile?) in which we don't split the 
> files, and rather stream through them to parse the JSON files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18377) warehouse path should be a static conf

2016-11-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649809#comment-15649809
 ] 

Apache Spark commented on SPARK-18377:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/15825

> warehouse path should be a static conf
> --
>
> Key: SPARK-18377
> URL: https://issues.apache.org/jira/browse/SPARK-18377
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18377) warehouse path should be a static conf

2016-11-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18377:


Assignee: Wenchen Fan  (was: Apache Spark)

> warehouse path should be a static conf
> --
>
> Key: SPARK-18377
> URL: https://issues.apache.org/jira/browse/SPARK-18377
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18377) warehouse path should be a static conf

2016-11-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18377:


Assignee: Apache Spark  (was: Wenchen Fan)

> warehouse path should be a static conf
> --
>
> Key: SPARK-18377
> URL: https://issues.apache.org/jira/browse/SPARK-18377
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18352) Parse normal, multi-line JSON files (not just JSON Lines)

2016-11-08 Thread Jayadevan M (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649803#comment-15649803
 ] 

Jayadevan M commented on SPARK-18352:
-

[~rxin] Are you looking a new api like spark.readStream.json(path) similor to 
spark.read.json(path) ?

> Parse normal, multi-line JSON files (not just JSON Lines)
> -
>
> Key: SPARK-18352
> URL: https://issues.apache.org/jira/browse/SPARK-18352
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>
> Spark currently can only parse JSON files that are JSON lines, i.e. each 
> record has an entire line and records are separated by new line. In reality, 
> a lot of users want to use Spark to parse actual JSON files, and are 
> surprised to learn that it doesn't do that.
> We can introduce a new mode (wholeJsonFile?) in which we don't split the 
> files, and rather stream through them to parse the JSON files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18377) warehouse path should be a static conf

2016-11-08 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-18377:
---

 Summary: warehouse path should be a static conf
 Key: SPARK-18377
 URL: https://issues.apache.org/jira/browse/SPARK-18377
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17691) Add aggregate function to collect list with maximum number of elements

2016-11-08 Thread Assaf Mendelson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649792#comment-15649792
 ] 

Assaf Mendelson commented on SPARK-17691:
-

I don't believe UDF (or rather UDAF in this case) is the way to go.
UDAF must use standard spark types. This means it must use immutable arrays.
Since ordering is an issue here it will mean an array would have to be created 
and copied each time the data changes instead of just the relevant array 
elements being changed leading to horrible performance (I actually tried it. It 
cost more time than the full collect_list). 
Implementing this as a new function would allow internally to use a mutable 
array and improve performance.


> Add aggregate function to collect list with maximum number of elements
> --
>
> Key: SPARK-17691
> URL: https://issues.apache.org/jira/browse/SPARK-17691
> Project: Spark
>  Issue Type: New Feature
>Reporter: Assaf Mendelson
>Priority: Minor
>
> One of the aggregate functions we have today is the collect_list function. 
> This is a useful tool to do a "catch all" aggregation which doesn't really 
> fit anywhere else.
> The problem with collect_list is that it is unbounded. I would like to see a 
> means to do a collect_list where we limit the maximum number of elements.
> I would see that the input for this would be the maximum number of elements 
> to use and the method of choosing (pick whatever, pick the top N, pick the 
> bottom B)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18376) Skip subexpression elimination for conditional expressions

2016-11-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18376:


Assignee: Apache Spark

> Skip subexpression elimination for conditional expressions
> --
>
> Key: SPARK-18376
> URL: https://issues.apache.org/jira/browse/SPARK-18376
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Apache Spark
>
> We should disallow subexpression elimination for expressions wrapped in 
> conditional expressions such as {{If}}.
> Because for an example like this:
> {code}
> if (isNull(subexpr)) {
>   ...
> } else {
>   AssertNotNull(subexpr)  // subexpr2
>   
>   SomeExpr(AssertNotNull(subexpr)) // SomeExpr(subexpr2)
> }
> {code}
> AssertNotNull(subexpr) will be recognized as  a subexpression and evaluate 
> even the else branch is never run. Under such cases, it possibly causes not 
> excepted exception and performance regression. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18376) Skip subexpression elimination for conditional expressions

2016-11-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18376:


Assignee: (was: Apache Spark)

> Skip subexpression elimination for conditional expressions
> --
>
> Key: SPARK-18376
> URL: https://issues.apache.org/jira/browse/SPARK-18376
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>
> We should disallow subexpression elimination for expressions wrapped in 
> conditional expressions such as {{If}}.
> Because for an example like this:
> {code}
> if (isNull(subexpr)) {
>   ...
> } else {
>   AssertNotNull(subexpr)  // subexpr2
>   
>   SomeExpr(AssertNotNull(subexpr)) // SomeExpr(subexpr2)
> }
> {code}
> AssertNotNull(subexpr) will be recognized as  a subexpression and evaluate 
> even the else branch is never run. Under such cases, it possibly causes not 
> excepted exception and performance regression. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18376) Skip subexpression elimination for conditional expressions

2016-11-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649779#comment-15649779
 ] 

Apache Spark commented on SPARK-18376:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/15824

> Skip subexpression elimination for conditional expressions
> --
>
> Key: SPARK-18376
> URL: https://issues.apache.org/jira/browse/SPARK-18376
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>
> We should disallow subexpression elimination for expressions wrapped in 
> conditional expressions such as {{If}}.
> Because for an example like this:
> {code}
> if (isNull(subexpr)) {
>   ...
> } else {
>   AssertNotNull(subexpr)  // subexpr2
>   
>   SomeExpr(AssertNotNull(subexpr)) // SomeExpr(subexpr2)
> }
> {code}
> AssertNotNull(subexpr) will be recognized as  a subexpression and evaluate 
> even the else branch is never run. Under such cases, it possibly causes not 
> excepted exception and performance regression. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18376) Skip subexpression elimination for conditional expressions

2016-11-08 Thread Liang-Chi Hsieh (JIRA)
Liang-Chi Hsieh created SPARK-18376:
---

 Summary: Skip subexpression elimination for conditional expressions
 Key: SPARK-18376
 URL: https://issues.apache.org/jira/browse/SPARK-18376
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Liang-Chi Hsieh


We should disallow subexpression elimination for expressions wrapped in 
conditional expressions such as {{If}}.

Because for an example like this:

{code}
if (isNull(subexpr)) {
  ...
} else {
  AssertNotNull(subexpr)  // subexpr2
  
  SomeExpr(AssertNotNull(subexpr)) // SomeExpr(subexpr2)
}
{code}

AssertNotNull(subexpr) will be recognized as  a subexpression and evaluate even 
the else branch is never run. Under such cases, it possibly causes not excepted 
exception and performance regression. 





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18191) Port RDD API to use commit protocol

2016-11-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649674#comment-15649674
 ] 

Apache Spark commented on SPARK-18191:
--

User 'jiangxb1987' has created a pull request for this issue:
https://github.com/apache/spark/pull/15823

> Port RDD API to use commit protocol
> ---
>
> Key: SPARK-18191
> URL: https://issues.apache.org/jira/browse/SPARK-18191
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Reynold Xin
>Assignee: Jiang Xingbo
> Fix For: 2.1.0
>
>
> Commit protocol is actually not specific to SQL. We can move it over to core 
> so the RDD API can use it too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18371) Spark Streaming backpressure bug - generates a batch with large number of records

2016-11-08 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649638#comment-15649638
 ] 

Cody Koeninger commented on SPARK-18371:


Thanks for digging into this.  The other thing I noticed when working on

https://github.com/apache/spark/pull/15132

is that the return value of getLatestRate was cast to Int, which seems wrong 
and possibly subject to overflow.

If you have the ability to test that PR (shouldn't require a spark redeploy, 
since the kafka jar is standalone), may want to test it out.

> Spark Streaming backpressure bug - generates a batch with large number of 
> records
> -
>
> Key: SPARK-18371
> URL: https://issues.apache.org/jira/browse/SPARK-18371
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.0
>Reporter: mapreduced
> Attachments: GiantBatch2.png, GiantBatch3.png, 
> Giant_batch_at_23_00.png, Look_at_batch_at_22_14.png
>
>
> When the streaming job is configured with backpressureEnabled=true, it 
> generates a GIANT batch of records if the processing time + scheduled delay 
> is (much) larger than batchDuration. This creates a backlog of records like 
> no other and results in batches queueing for hours until it chews through 
> this giant batch.
> Expectation is that it should reduce the number of records per batch in some 
> time to whatever it can really process.
> Attaching some screen shots where it seems that this issue is quite easily 
> reproducible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18375) Upgrade netty to 4.0.42.Final

2016-11-08 Thread Guoqiang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li updated SPARK-18375:

Summary: Upgrade netty to 4.0.42.Final   (was: Upgrade netty to 4.0.42)

> Upgrade netty to 4.0.42.Final 
> --
>
> Key: SPARK-18375
> URL: https://issues.apache.org/jira/browse/SPARK-18375
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Spark Core
>Affects Versions: 1.6.2, 2.0.1
>Reporter: Guoqiang Li
>
> One of the important changes for 4.0.42.Final is "Support any FileRegion 
> implementation when using epoll transport 
> [#5825|https://github.com/netty/netty/pull/5825];.
> In 
> 4.0.42.Final,[MessageWithHeader|https://github.com/apache/spark/blob/master/common/network-common/src/main/java/org/apache/spark/network/protocol/MessageWithHeader.java]
>  can work properly when {{spark.(shufflem, rpc).io.mode}} is set to epoll



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18375) Upgrade netty to 4.0.42

2016-11-08 Thread Guoqiang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li updated SPARK-18375:

Summary: Upgrade netty to 4.0.42  (was: Upgrade netty to 4.042)

> Upgrade netty to 4.0.42
> ---
>
> Key: SPARK-18375
> URL: https://issues.apache.org/jira/browse/SPARK-18375
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Spark Core
>Affects Versions: 1.6.2, 2.0.1
>Reporter: Guoqiang Li
>
> One of the important changes for 4.0.42.Final is "Support any FileRegion 
> implementation when using epoll transport 
> [#5825|https://github.com/netty/netty/pull/5825];.
> In 
> 4.0.42.Final,[MessageWithHeader|https://github.com/apache/spark/blob/master/common/network-common/src/main/java/org/apache/spark/network/protocol/MessageWithHeader.java]
>  can work properly when {{spark.(shufflem, rpc).io.mode}} is set to epoll



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18375) Upgrade netty to 4.042

2016-11-08 Thread Guoqiang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li updated SPARK-18375:

Description: 
One of the important changes for 4.0.42.Final is "Support any FileRegion 
implementation when using epoll transport 
[#5825|https://github.com/netty/netty/pull/5825];.

In 
4.0.42.Final,[MessageWithHeader|https://github.com/apache/spark/blob/master/common/network-common/src/main/java/org/apache/spark/network/protocol/MessageWithHeader.java]
 can work properly when {{spark.(shufflem, rpc).io.mode}} is set to epoll

  was:
The important changes for 4.0.42.Final is "Support any FileRegion 
implementation when using epoll transport 
[#5825|https://github.com/netty/netty/pull/5825];.

In 
4.0.42.Final,[MessageWithHeader|https://github.com/apache/spark/blob/master/common/network-common/src/main/java/org/apache/spark/network/protocol/MessageWithHeader.java]
 can work properly when {{spark.(shufflem, rpc).io.mode}} is set to epoll


> Upgrade netty to 4.042
> --
>
> Key: SPARK-18375
> URL: https://issues.apache.org/jira/browse/SPARK-18375
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Spark Core
>Affects Versions: 1.6.2, 2.0.1
>Reporter: Guoqiang Li
>
> One of the important changes for 4.0.42.Final is "Support any FileRegion 
> implementation when using epoll transport 
> [#5825|https://github.com/netty/netty/pull/5825];.
> In 
> 4.0.42.Final,[MessageWithHeader|https://github.com/apache/spark/blob/master/common/network-common/src/main/java/org/apache/spark/network/protocol/MessageWithHeader.java]
>  can work properly when {{spark.(shufflem, rpc).io.mode}} is set to epoll



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18375) Upgrade netty to 4.042

2016-11-08 Thread Guoqiang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li updated SPARK-18375:

Affects Version/s: 1.6.2

> Upgrade netty to 4.042
> --
>
> Key: SPARK-18375
> URL: https://issues.apache.org/jira/browse/SPARK-18375
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Spark Core
>Affects Versions: 1.6.2, 2.0.1
>Reporter: Guoqiang Li
>
> The important changes for 4.0.42.Final is "Support any FileRegion 
> implementation when using epoll transport 
> [#5825|https://github.com/netty/netty/pull/5825];.
> In 
> 4.0.42.Final,[MessageWithHeader|https://github.com/apache/spark/blob/master/common/network-common/src/main/java/org/apache/spark/network/protocol/MessageWithHeader.java]
>  can work properly when {{spark.(shufflem, rpc).io.mode}} is set to epoll



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18374) Incorrect words in StopWords/english.txt

2016-11-08 Thread nirav patel (JIRA)
nirav patel created SPARK-18374:
---

 Summary: Incorrect words in StopWords/english.txt
 Key: SPARK-18374
 URL: https://issues.apache.org/jira/browse/SPARK-18374
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 2.0.1
Reporter: nirav patel


I was just double checking english.txt for list of stopwords as I felt it was 
taking out valid tokens like 'won'. I think issue is english.txt list is 
missing apostrophe character and all character after apostrophe. So "won't" 
becam "won" in that list; "wouldn't" is "wouldn" .

Here are some incorrect tokens in this list:

won
wouldn
ma
mightn
mustn
needn
shan
shouldn
wasn
weren

I think ideal list should have both style. i.e. won't and wont both should be 
part of english.txt as some tokenizer might remove special characters. But 
'won' is obviously shouldn't be in this list.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18375) Upgrade netty to 4.042

2016-11-08 Thread Guoqiang Li (JIRA)
Guoqiang Li created SPARK-18375:
---

 Summary: Upgrade netty to 4.042
 Key: SPARK-18375
 URL: https://issues.apache.org/jira/browse/SPARK-18375
 Project: Spark
  Issue Type: Bug
  Components: Build, Spark Core
Affects Versions: 2.0.1
Reporter: Guoqiang Li


The important changes for 4.0.42.Final is "Support any FileRegion 
implementation when using epoll transport 
[#5825|https://github.com/netty/netty/pull/5825];.

In 
4.0.42.Final,[MessageWithHeader|https://github.com/apache/spark/blob/master/common/network-common/src/main/java/org/apache/spark/network/protocol/MessageWithHeader.java]
 can work properly when {{spark.(shufflem, rpc).io.mode}} is set to epoll



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18364) expose metrics for YarnShuffleService

2016-11-08 Thread Steven Rand (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649506#comment-15649506
 ] 

Steven Rand commented on SPARK-18364:
-

The solution I'd initially had in mind, simply creating a {{MetricsSystem}} in 
{{YarnShuffleService}}, isn't ideal, since then {{network-yarn}} has to depend 
on {{core}}.

I tried splitting the metrics code currently in {{core}} into its own module, 
but this is pretty tough given how much other code from {{core}} the metrics 
code depends on. And the {{metrics}} module can't depend on {{core}}, since 
that creates a circular dependency (core has to depend on metrics).

Another idea might be to try to get metrics out of the executors instead of 
from ExternalShuffleBlockHandler, though this also seems tricky.

Open to ideas on how to proceed here if anyone has them.


> expose metrics for YarnShuffleService
> -
>
> Key: SPARK-18364
> URL: https://issues.apache.org/jira/browse/SPARK-18364
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.0.1
>Reporter: Steven Rand
>Priority: Minor
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> ExternalShuffleService exposes metrics as of SPARK-16405. However, 
> YarnShuffleService does not.
> The work of instrumenting ExternalShuffleBlockHandler was already done in 
> SPARK-1645, so this JIRA is for creating a MetricsSystem in 
> YarnShuffleService similarly to how ExternalShuffleService already does it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13534) Implement Apache Arrow serializer for Spark DataFrame for use in DataFrame.toPandas

2016-11-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649436#comment-15649436
 ] 

Apache Spark commented on SPARK-13534:
--

User 'BryanCutler' has created a pull request for this issue:
https://github.com/apache/spark/pull/15821

> Implement Apache Arrow serializer for Spark DataFrame for use in 
> DataFrame.toPandas
> ---
>
> Key: SPARK-13534
> URL: https://issues.apache.org/jira/browse/SPARK-13534
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Wes McKinney
>
> The current code path for accessing Spark DataFrame data in Python using 
> PySpark passes through an inefficient serialization-deserialiation process 
> that I've examined at a high level here: 
> https://gist.github.com/wesm/0cb5531b1c2e346a0007. Currently, RDD[Row] 
> objects are being deserialized in pure Python as a list of tuples, which are 
> then converted to pandas.DataFrame using its {{from_records}} alternate 
> constructor. This also uses a large amount of memory.
> For flat (no nested types) schemas, the Apache Arrow memory layout 
> (https://github.com/apache/arrow/tree/master/format) can be deserialized to 
> {{pandas.DataFrame}} objects with comparatively small overhead compared with 
> memcpy / system memory bandwidth -- Arrow's bitmasks must be examined, 
> replacing the corresponding null values with pandas's sentinel values (None 
> or NaN as appropriate).
> I will be contributing patches to Arrow in the coming weeks for converting 
> between Arrow and pandas in the general case, so if Spark can send Arrow 
> memory to PySpark, we will hopefully be able to increase the Python data 
> access throughput by an order of magnitude or more. I propose to add an new 
> serializer for Spark DataFrame and a new method that can be invoked from 
> PySpark to request a Arrow memory-layout byte stream, prefixed by a data 
> header indicating array buffer offsets and sizes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13534) Implement Apache Arrow serializer for Spark DataFrame for use in DataFrame.toPandas

2016-11-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13534:


Assignee: (was: Apache Spark)

> Implement Apache Arrow serializer for Spark DataFrame for use in 
> DataFrame.toPandas
> ---
>
> Key: SPARK-13534
> URL: https://issues.apache.org/jira/browse/SPARK-13534
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Wes McKinney
>
> The current code path for accessing Spark DataFrame data in Python using 
> PySpark passes through an inefficient serialization-deserialiation process 
> that I've examined at a high level here: 
> https://gist.github.com/wesm/0cb5531b1c2e346a0007. Currently, RDD[Row] 
> objects are being deserialized in pure Python as a list of tuples, which are 
> then converted to pandas.DataFrame using its {{from_records}} alternate 
> constructor. This also uses a large amount of memory.
> For flat (no nested types) schemas, the Apache Arrow memory layout 
> (https://github.com/apache/arrow/tree/master/format) can be deserialized to 
> {{pandas.DataFrame}} objects with comparatively small overhead compared with 
> memcpy / system memory bandwidth -- Arrow's bitmasks must be examined, 
> replacing the corresponding null values with pandas's sentinel values (None 
> or NaN as appropriate).
> I will be contributing patches to Arrow in the coming weeks for converting 
> between Arrow and pandas in the general case, so if Spark can send Arrow 
> memory to PySpark, we will hopefully be able to increase the Python data 
> access throughput by an order of magnitude or more. I propose to add an new 
> serializer for Spark DataFrame and a new method that can be invoked from 
> PySpark to request a Arrow memory-layout byte stream, prefixed by a data 
> header indicating array buffer offsets and sizes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13534) Implement Apache Arrow serializer for Spark DataFrame for use in DataFrame.toPandas

2016-11-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13534:


Assignee: Apache Spark

> Implement Apache Arrow serializer for Spark DataFrame for use in 
> DataFrame.toPandas
> ---
>
> Key: SPARK-13534
> URL: https://issues.apache.org/jira/browse/SPARK-13534
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Wes McKinney
>Assignee: Apache Spark
>
> The current code path for accessing Spark DataFrame data in Python using 
> PySpark passes through an inefficient serialization-deserialiation process 
> that I've examined at a high level here: 
> https://gist.github.com/wesm/0cb5531b1c2e346a0007. Currently, RDD[Row] 
> objects are being deserialized in pure Python as a list of tuples, which are 
> then converted to pandas.DataFrame using its {{from_records}} alternate 
> constructor. This also uses a large amount of memory.
> For flat (no nested types) schemas, the Apache Arrow memory layout 
> (https://github.com/apache/arrow/tree/master/format) can be deserialized to 
> {{pandas.DataFrame}} objects with comparatively small overhead compared with 
> memcpy / system memory bandwidth -- Arrow's bitmasks must be examined, 
> replacing the corresponding null values with pandas's sentinel values (None 
> or NaN as appropriate).
> I will be contributing patches to Arrow in the coming weeks for converting 
> between Arrow and pandas in the general case, so if Spark can send Arrow 
> memory to PySpark, we will hopefully be able to increase the Python data 
> access throughput by an order of magnitude or more. I propose to add an new 
> serializer for Spark DataFrame and a new method that can be invoked from 
> PySpark to request a Arrow memory-layout byte stream, prefixed by a data 
> header indicating array buffer offsets and sizes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18373) Make KafkaSource's failOnDataLoss=false work with Spark jobs

2016-11-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18373:


Assignee: Apache Spark  (was: Shixiong Zhu)

> Make KafkaSource's failOnDataLoss=false work with Spark jobs
> 
>
> Key: SPARK-18373
> URL: https://issues.apache.org/jira/browse/SPARK-18373
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>
> Right now failOnDataLoss=false doesn't affect Spark jobs launched by 
> KafkaSource. The job may still fail the query when some topics are deleted or 
> some data is aged out. We should handle these corner cases in Spark jobs as 
> well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18373) Make KafkaSource's failOnDataLoss=false work with Spark jobs

2016-11-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18373:


Assignee: Shixiong Zhu  (was: Apache Spark)

> Make KafkaSource's failOnDataLoss=false work with Spark jobs
> 
>
> Key: SPARK-18373
> URL: https://issues.apache.org/jira/browse/SPARK-18373
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> Right now failOnDataLoss=false doesn't affect Spark jobs launched by 
> KafkaSource. The job may still fail the query when some topics are deleted or 
> some data is aged out. We should handle these corner cases in Spark jobs as 
> well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18373) Make KafkaSource's failOnDataLoss=false work with Spark jobs

2016-11-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649430#comment-15649430
 ] 

Apache Spark commented on SPARK-18373:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/15820

> Make KafkaSource's failOnDataLoss=false work with Spark jobs
> 
>
> Key: SPARK-18373
> URL: https://issues.apache.org/jira/browse/SPARK-18373
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> Right now failOnDataLoss=false doesn't affect Spark jobs launched by 
> KafkaSource. The job may still fail the query when some topics are deleted or 
> some data is aged out. We should handle these corner cases in Spark jobs as 
> well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18373) Make KafkaSource's failOnDataLoss=false work with Spark jobs

2016-11-08 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-18373:


 Summary: Make KafkaSource's failOnDataLoss=false work with Spark 
jobs
 Key: SPARK-18373
 URL: https://issues.apache.org/jira/browse/SPARK-18373
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu


Right now failOnDataLoss=false doesn't affect Spark jobs launched by 
KafkaSource. The job may still fail the query when some topics are deleted or 
some data is aged out. We should handle these corner cases in Spark jobs as 
well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18371) Spark Streaming backpressure bug - generates a batch with large number of records

2016-11-08 Thread mapreduced (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649409#comment-15649409
 ] 

mapreduced commented on SPARK-18371:


I worked the math for 
[PIDRateEstimator|https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/scheduler/rate/PIDRateEstimator.scala]
 and I found that when there are long scheduling delays it increases the 
'historicalError' by a LOT, which even when multiplied by integral (=0.2 
default), results in a large negative in the formula:
val newRate = (latestRate - proportional * error - integral * historicalError - 
derivative * dError).max(minRate)

As a result, minRate (=100 default) becomes the newRate. Now, when it comes in 
[DirectKafkaInputDStream|https://github.com/apache/spark/blob/master/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala#L107]
 , if you have more than 100 partitions in your kafka topics, you're almost 
guaranteed to get Math.rounded backpressureRate = 0.  Which then 
[here|https://github.com/apache/spark/blob/master/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala#L114]
 leads to returning None. That as a result returns 
[leaderOffsets|https://github.com/apache/spark/blob/master/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala#L149]
 for that batch - hence the giant batch with all records from last batch till 
head of the queue (leaderOffsets).

Proposed solution: Not sure, maybe the minRate could be defaulted to Total 
number of Partitions in all your kafka topics + some constant. Not sure if 
anyone has any suggestions to changes in PIDRateEstimator itself.

Here's the math I did from the example in Look_at_batch_at_22_14.png : 

*Run1* :

latestRate = -1
latestTime = -1
latestError = -1
time = 1478587297000
processingDelay = 342000
delaySinceUpdate = 1478587298

processingRate = (1080/342000) * 1000 = 31578.94

error = -1 -31579 = -31580
historicalError = 8 *  31579 / 6 = 4.21
dError = (-31580 - 4.21)/ 1478587298 = 0.2136107254

newRate =  (-1 -(1*-31580) - (0.2*4.21) - (0 * 0.2136107254)).max(100) = 
(31578.158).max(100) = 31578.158

latestTime = 1478587297
latestError = 0
latestRate = 31578.94
Returns None - which results in picking up maxRatePerPartition

*Run 2* :
time = 1478587615000
processingDelay = 5.3 * 60 * 1000 = 318000
schedulingDelay = 282000

delaySinceUpdate = (1478587615000 - 1478587297000) = 318
processingRate = 1080/318000 * 1000 = 33962.2
error = 31578 - 33962 = -2384
historicalError = 282000 * 33962 / 6 = 159621.4

dError = doesnt matter since multiplied by 0

newRate = (31578 - (1*-2384) - (0.2*159621) - (0 * dError)).max(100) = 
(2037.72).max(100) = 2037.72

latestRate = 2037.72
latestError = -2384
latestTime = 1478587615000
Returns newRate = 2037.72

*Run 3* :
totalLag = 1972830183
perpartition lag = 6576100.61
backpressureRate = 126000 - 129000

time = 1478587795000
delaySinceUpdate = 1478587795000 - 1478587615000 = 180

processingRate = 1080/18 * 1000 = 6
error = 2037 - 6 = -57963
historicalError = 54 * 6 / 6 = 54
dError = doesntMatter

newRate = (2037.72 - (1*-57963) -(0.2*54) - 0).max(100) = (-48000).max(100) 
= 100

latestTime = 1478587795000
latestRate = 100
latestError = -62384

Returns newRate = 100

> Spark Streaming backpressure bug - generates a batch with large number of 
> records
> -
>
> Key: SPARK-18371
> URL: https://issues.apache.org/jira/browse/SPARK-18371
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.0
>Reporter: mapreduced
> Attachments: GiantBatch2.png, GiantBatch3.png, 
> Giant_batch_at_23_00.png, Look_at_batch_at_22_14.png
>
>
> When the streaming job is configured with backpressureEnabled=true, it 
> generates a GIANT batch of records if the processing time + scheduled delay 
> is (much) larger than batchDuration. This creates a backlog of records like 
> no other and results in batches queueing for hours until it chews through 
> this giant batch.
> Expectation is that it should reduce the number of records per batch in some 
> time to whatever it can really process.
> Attaching some screen shots where it seems that this issue is quite easily 
> reproducible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13534) Implement Apache Arrow serializer for Spark DataFrame for use in DataFrame.toPandas

2016-11-08 Thread Bryan Cutler (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649405#comment-15649405
 ] 

Bryan Cutler commented on SPARK-13534:
--

I've been working on this with [~xusen].  We have a very rough WIP branch that 
I'll link here in case others want to pitch in or review while we are working 
out the kinks.

> Implement Apache Arrow serializer for Spark DataFrame for use in 
> DataFrame.toPandas
> ---
>
> Key: SPARK-13534
> URL: https://issues.apache.org/jira/browse/SPARK-13534
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Wes McKinney
>
> The current code path for accessing Spark DataFrame data in Python using 
> PySpark passes through an inefficient serialization-deserialiation process 
> that I've examined at a high level here: 
> https://gist.github.com/wesm/0cb5531b1c2e346a0007. Currently, RDD[Row] 
> objects are being deserialized in pure Python as a list of tuples, which are 
> then converted to pandas.DataFrame using its {{from_records}} alternate 
> constructor. This also uses a large amount of memory.
> For flat (no nested types) schemas, the Apache Arrow memory layout 
> (https://github.com/apache/arrow/tree/master/format) can be deserialized to 
> {{pandas.DataFrame}} objects with comparatively small overhead compared with 
> memcpy / system memory bandwidth -- Arrow's bitmasks must be examined, 
> replacing the corresponding null values with pandas's sentinel values (None 
> or NaN as appropriate).
> I will be contributing patches to Arrow in the coming weeks for converting 
> between Arrow and pandas in the general case, so if Spark can send Arrow 
> memory to PySpark, we will hopefully be able to increase the Python data 
> access throughput by an order of magnitude or more. I propose to add an new 
> serializer for Spark DataFrame and a new method that can be invoked from 
> PySpark to request a Arrow memory-layout byte stream, prefixed by a data 
> header indicating array buffer offsets and sizes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18372) .Hive-staging folders created from Spark hiveContext are not getting cleaned up

2016-11-08 Thread mingjie tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649397#comment-15649397
 ] 

mingjie tang commented on SPARK-18372:
--

Solution: 
This bug is reported by customers.
The reason is the org.spark.sql.hive.InsertIntoHiveTable call the hive class of 
(org.apache.hadoop.hive.) to create the staging directory. Default, from the 
hive side, this staging file would be removed after the hive session is 
expired. However, spark fail to notify the hive to remove the staging files.
Thus, follow the code of spark 2.0.x, I just write one function inside the 
InsertIntoHiveTable to create the .staging directory, then, after the session 
expired of spark, this .staging directory would be removed.
This update is tested for the spark 1.5.2 and spark 1.6.3, and the push request 
is : 

For the test, I have manually checking .staging files from table belong 
directory after the spark shell close. 

> .Hive-staging folders created from Spark hiveContext are not getting cleaned 
> up
> ---
>
> Key: SPARK-18372
> URL: https://issues.apache.org/jira/browse/SPARK-18372
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.2, 1.6.3
> Environment: spark standalone and spark yarn 
>Reporter: mingjie tang
> Fix For: 2.0.1
>
>
> Steps to reproduce:
> 
> 1. Launch spark-shell 
> 2. Run the following scala code via Spark-Shell 
> scala> val hivesampletabledf = sqlContext.table("hivesampletable") 
> scala> import org.apache.spark.sql.DataFrameWriter 
> scala> val dfw : DataFrameWriter = hivesampletabledf.write 
> scala> sqlContext.sql("CREATE TABLE IF NOT EXISTS hivesampletablecopypy ( 
> clientid string, querytime string, market string, deviceplatform string, 
> devicemake string, devicemodel string, state string, country string, 
> querydwelltime double, sessionid bigint, sessionpagevieworder bigint )") 
> scala> dfw.insertInto("hivesampletablecopypy") 
> scala> val hivesampletablecopypydfdf = sqlContext.sql("""SELECT clientid, 
> querytime, deviceplatform, querydwelltime FROM hivesampletablecopypy WHERE 
> state = 'Washington' AND devicemake = 'Microsoft' AND querydwelltime > 15 """)
> hivesampletablecopypydfdf.show
> 3. in HDFS (in our case, WASB), we can see the following folders 
> hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666
>  
> hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666-1/-ext-1
>  
> hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693
> the issue is that these don't get cleaned up and get accumulated
> =
> with the customer, we have tried setting "SET 
> hive.exec.stagingdir=/tmp/hive;" in hive-site.xml - didn't make any 
> difference.
> .hive-staging folders are created under the  folder - 
> hive/warehouse/hivesampletablecopypy/
> we have tried adding this property to hive-site.xml and restart the 
> components -
>  
> hive.exec.stagingdir 
> $ {hive.exec.scratchdir}
> /$
> {user.name}
> /.staging 
> 
> a new .hive-staging folder was created in hive/warehouse/ folder
> moreover, please understand that if we run the hive query in pure Hive via 
> Hive CLI on the same Spark cluster, we don't see the behavior
> so it doesn't appear to be a Hive issue/behavior in this case- this is a 
> spark behavior
> I checked in Ambari, spark.yarn.preserve.staging.files=false in Spark 
> configuration already
> The issue happens via Spark-submit as well - customer used the following 
> command to reproduce this -
> spark-submit test-hive-staging-cleanup.py



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18372) .Hive-staging folders created from Spark hiveContext are not getting cleaned up

2016-11-08 Thread mingjie tang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mingjie tang updated SPARK-18372:
-
Description: 
Steps to reproduce:

1. Launch spark-shell 
2. Run the following scala code via Spark-Shell 
scala> val hivesampletabledf = sqlContext.table("hivesampletable") 
scala> import org.apache.spark.sql.DataFrameWriter 
scala> val dfw : DataFrameWriter = hivesampletabledf.write 
scala> sqlContext.sql("CREATE TABLE IF NOT EXISTS hivesampletablecopypy ( 
clientid string, querytime string, market string, deviceplatform string, 
devicemake string, devicemodel string, state string, country string, 
querydwelltime double, sessionid bigint, sessionpagevieworder bigint )") 
scala> dfw.insertInto("hivesampletablecopypy") 
scala> val hivesampletablecopypydfdf = sqlContext.sql("""SELECT clientid, 
querytime, deviceplatform, querydwelltime FROM hivesampletablecopypy WHERE 
state = 'Washington' AND devicemake = 'Microsoft' AND querydwelltime > 15 """)
hivesampletablecopypydfdf.show
3. in HDFS (in our case, WASB), we can see the following folders 
hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666
 
hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666-1/-ext-1
 
hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693
the issue is that these don't get cleaned up and get accumulated
=
with the customer, we have tried setting "SET hive.exec.stagingdir=/tmp/hive;" 
in hive-site.xml - didn't make any difference.
.hive-staging folders are created under the  folder - 
hive/warehouse/hivesampletablecopypy/
we have tried adding this property to hive-site.xml and restart the components -
 
hive.exec.stagingdir 
$ {hive.exec.scratchdir}
/$
{user.name}
/.staging 

a new .hive-staging folder was created in hive/warehouse/ folder
moreover, please understand that if we run the hive query in pure Hive via Hive 
CLI on the same Spark cluster, we don't see the behavior
so it doesn't appear to be a Hive issue/behavior in this case- this is a spark 
behavior
I checked in Ambari, spark.yarn.preserve.staging.files=false in Spark 
configuration already
The issue happens via Spark-submit as well - customer used the following 
command to reproduce this -
spark-submit test-hive-staging-cleanup.py


  was:
Steps to reproduce:

1. Launch spark-shell 
2. Run the following scala code via Spark-Shell 
scala> val hivesampletabledf = sqlContext.table("hivesampletable") 
scala> import org.apache.spark.sql.DataFrameWriter 
scala> val dfw : DataFrameWriter = hivesampletabledf.write 
scala> sqlContext.sql("CREATE TABLE IF NOT EXISTS hivesampletablecopypy ( 
clientid string, querytime string, market string, deviceplatform string, 
devicemake string, devicemodel string, state string, country string, 
querydwelltime double, sessionid bigint, sessionpagevieworder bigint )") 
scala> dfw.insertInto("hivesampletablecopypy") 
scala> val hivesampletablecopypydfdf = sqlContext.sql("""SELECT clientid, 
querytime, deviceplatform, querydwelltime FROM hivesampletablecopypy WHERE 
state = 'Washington' AND devicemake = 'Microsoft' AND querydwelltime > 15 """)
hivesampletablecopypydfdf.show
3. in HDFS (in our case, WASB), we can see the following folders 
hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666
 
hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666-1/-ext-1
 
hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693
the issue is that these don't get cleaned up and get accumulated
=
with the customer, we have tried setting "SET hive.exec.stagingdir=/tmp/hive;" 
in hive-site.xml - didn't make any difference.
.hive-staging folders are created under the  folder - 
hive/warehouse/hivesampletablecopypy/
we have tried adding this property to hive-site.xml and restart the components -
 
hive.exec.stagingdir 
$ {hive.exec.scratchdir}
/$
{user.name}
/.staging 

a new .hive-staging folder was created in hive/warehouse/ folder
moreover, please understand that if we run the hive query in pure Hive via Hive 
CLI on the same Spark cluster, we don't see the behavior
so it doesn't appear to be a Hive issue/behavior in this case- this is a spark 
behavior
I checked in Ambari, spark.yarn.preserve.staging.files=false in Spark 
configuration already
The issue happens via Spark-submit as well - customer used the following 
command to reproduce this -
spark-submit test-hive-staging-cleanup.py

Solution: 
This bug is reported by customers.
The reason is the org.spark.sql.hive.InsertIntoHiveTable call the hive class of 
(org.apache.hadoop.hive.) to create the staging directory. Default, from the 
hive side, this staging file would be removed after the hive session is 

[jira] [Commented] (SPARK-18372) .Hive-staging folders created from Spark hiveContext are not getting cleaned up

2016-11-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649387#comment-15649387
 ] 

Apache Spark commented on SPARK-18372:
--

User 'merlintang' has created a pull request for this issue:
https://github.com/apache/spark/pull/15819

> .Hive-staging folders created from Spark hiveContext are not getting cleaned 
> up
> ---
>
> Key: SPARK-18372
> URL: https://issues.apache.org/jira/browse/SPARK-18372
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.2, 1.6.3
> Environment: spark standalone and spark yarn 
>Reporter: mingjie tang
> Fix For: 2.0.1
>
>
> Steps to reproduce:
> 
> 1. Launch spark-shell 
> 2. Run the following scala code via Spark-Shell 
> scala> val hivesampletabledf = sqlContext.table("hivesampletable") 
> scala> import org.apache.spark.sql.DataFrameWriter 
> scala> val dfw : DataFrameWriter = hivesampletabledf.write 
> scala> sqlContext.sql("CREATE TABLE IF NOT EXISTS hivesampletablecopypy ( 
> clientid string, querytime string, market string, deviceplatform string, 
> devicemake string, devicemodel string, state string, country string, 
> querydwelltime double, sessionid bigint, sessionpagevieworder bigint )") 
> scala> dfw.insertInto("hivesampletablecopypy") 
> scala> val hivesampletablecopypydfdf = sqlContext.sql("""SELECT clientid, 
> querytime, deviceplatform, querydwelltime FROM hivesampletablecopypy WHERE 
> state = 'Washington' AND devicemake = 'Microsoft' AND querydwelltime > 15 """)
> hivesampletablecopypydfdf.show
> 3. in HDFS (in our case, WASB), we can see the following folders 
> hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666
>  
> hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666-1/-ext-1
>  
> hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693
> the issue is that these don't get cleaned up and get accumulated
> =
> with the customer, we have tried setting "SET 
> hive.exec.stagingdir=/tmp/hive;" in hive-site.xml - didn't make any 
> difference.
> .hive-staging folders are created under the  folder - 
> hive/warehouse/hivesampletablecopypy/
> we have tried adding this property to hive-site.xml and restart the 
> components -
>  
> hive.exec.stagingdir 
> $ {hive.exec.scratchdir}
> /$
> {user.name}
> /.staging 
> 
> a new .hive-staging folder was created in hive/warehouse/ folder
> moreover, please understand that if we run the hive query in pure Hive via 
> Hive CLI on the same Spark cluster, we don't see the behavior
> so it doesn't appear to be a Hive issue/behavior in this case- this is a 
> spark behavior
> I checked in Ambari, spark.yarn.preserve.staging.files=false in Spark 
> configuration already
> The issue happens via Spark-submit as well - customer used the following 
> command to reproduce this -
> spark-submit test-hive-staging-cleanup.py
> Solution: 
> This bug is reported by customers.
> The reason is the org.spark.sql.hive.InsertIntoHiveTable call the hive class 
> of (org.apache.hadoop.hive.) to create the staging directory. Default, from 
> the hive side, this staging file would be removed after the hive session is 
> expired. However, spark fail to notify the hive to remove the staging files.
> Thus, follow the code of spark 2.0.x, I just write one function inside the 
> InsertIntoHiveTable to create the .staging directory, then, after the session 
> expired of spark, this .staging directory would be removed.
> This update is tested for the spark 1.5.2 and spark 1.6.3, and the push 
> request is : 
> For the test, I have manually checking .staging files from table belong 
> directory after the spark shell close. meanwhile, please advise how to write 
> the test case? because the directory for the related tables can not get.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18372) .Hive-staging folders created from Spark hiveContext are not getting cleaned up

2016-11-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18372:


Assignee: (was: Apache Spark)

> .Hive-staging folders created from Spark hiveContext are not getting cleaned 
> up
> ---
>
> Key: SPARK-18372
> URL: https://issues.apache.org/jira/browse/SPARK-18372
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.2, 1.6.3
> Environment: spark standalone and spark yarn 
>Reporter: mingjie tang
> Fix For: 2.0.1
>
>
> Steps to reproduce:
> 
> 1. Launch spark-shell 
> 2. Run the following scala code via Spark-Shell 
> scala> val hivesampletabledf = sqlContext.table("hivesampletable") 
> scala> import org.apache.spark.sql.DataFrameWriter 
> scala> val dfw : DataFrameWriter = hivesampletabledf.write 
> scala> sqlContext.sql("CREATE TABLE IF NOT EXISTS hivesampletablecopypy ( 
> clientid string, querytime string, market string, deviceplatform string, 
> devicemake string, devicemodel string, state string, country string, 
> querydwelltime double, sessionid bigint, sessionpagevieworder bigint )") 
> scala> dfw.insertInto("hivesampletablecopypy") 
> scala> val hivesampletablecopypydfdf = sqlContext.sql("""SELECT clientid, 
> querytime, deviceplatform, querydwelltime FROM hivesampletablecopypy WHERE 
> state = 'Washington' AND devicemake = 'Microsoft' AND querydwelltime > 15 """)
> hivesampletablecopypydfdf.show
> 3. in HDFS (in our case, WASB), we can see the following folders 
> hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666
>  
> hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666-1/-ext-1
>  
> hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693
> the issue is that these don't get cleaned up and get accumulated
> =
> with the customer, we have tried setting "SET 
> hive.exec.stagingdir=/tmp/hive;" in hive-site.xml - didn't make any 
> difference.
> .hive-staging folders are created under the  folder - 
> hive/warehouse/hivesampletablecopypy/
> we have tried adding this property to hive-site.xml and restart the 
> components -
>  
> hive.exec.stagingdir 
> $ {hive.exec.scratchdir}
> /$
> {user.name}
> /.staging 
> 
> a new .hive-staging folder was created in hive/warehouse/ folder
> moreover, please understand that if we run the hive query in pure Hive via 
> Hive CLI on the same Spark cluster, we don't see the behavior
> so it doesn't appear to be a Hive issue/behavior in this case- this is a 
> spark behavior
> I checked in Ambari, spark.yarn.preserve.staging.files=false in Spark 
> configuration already
> The issue happens via Spark-submit as well - customer used the following 
> command to reproduce this -
> spark-submit test-hive-staging-cleanup.py
> Solution: 
> This bug is reported by customers.
> The reason is the org.spark.sql.hive.InsertIntoHiveTable call the hive class 
> of (org.apache.hadoop.hive.) to create the staging directory. Default, from 
> the hive side, this staging file would be removed after the hive session is 
> expired. However, spark fail to notify the hive to remove the staging files.
> Thus, follow the code of spark 2.0.x, I just write one function inside the 
> InsertIntoHiveTable to create the .staging directory, then, after the session 
> expired of spark, this .staging directory would be removed.
> This update is tested for the spark 1.5.2 and spark 1.6.3, and the push 
> request is : 
> For the test, I have manually checking .staging files from table belong 
> directory after the spark shell close. meanwhile, please advise how to write 
> the test case? because the directory for the related tables can not get.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18372) .Hive-staging folders created from Spark hiveContext are not getting cleaned up

2016-11-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18372:


Assignee: Apache Spark

> .Hive-staging folders created from Spark hiveContext are not getting cleaned 
> up
> ---
>
> Key: SPARK-18372
> URL: https://issues.apache.org/jira/browse/SPARK-18372
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.2, 1.6.3
> Environment: spark standalone and spark yarn 
>Reporter: mingjie tang
>Assignee: Apache Spark
> Fix For: 2.0.1
>
>
> Steps to reproduce:
> 
> 1. Launch spark-shell 
> 2. Run the following scala code via Spark-Shell 
> scala> val hivesampletabledf = sqlContext.table("hivesampletable") 
> scala> import org.apache.spark.sql.DataFrameWriter 
> scala> val dfw : DataFrameWriter = hivesampletabledf.write 
> scala> sqlContext.sql("CREATE TABLE IF NOT EXISTS hivesampletablecopypy ( 
> clientid string, querytime string, market string, deviceplatform string, 
> devicemake string, devicemodel string, state string, country string, 
> querydwelltime double, sessionid bigint, sessionpagevieworder bigint )") 
> scala> dfw.insertInto("hivesampletablecopypy") 
> scala> val hivesampletablecopypydfdf = sqlContext.sql("""SELECT clientid, 
> querytime, deviceplatform, querydwelltime FROM hivesampletablecopypy WHERE 
> state = 'Washington' AND devicemake = 'Microsoft' AND querydwelltime > 15 """)
> hivesampletablecopypydfdf.show
> 3. in HDFS (in our case, WASB), we can see the following folders 
> hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666
>  
> hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666-1/-ext-1
>  
> hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693
> the issue is that these don't get cleaned up and get accumulated
> =
> with the customer, we have tried setting "SET 
> hive.exec.stagingdir=/tmp/hive;" in hive-site.xml - didn't make any 
> difference.
> .hive-staging folders are created under the  folder - 
> hive/warehouse/hivesampletablecopypy/
> we have tried adding this property to hive-site.xml and restart the 
> components -
>  
> hive.exec.stagingdir 
> $ {hive.exec.scratchdir}
> /$
> {user.name}
> /.staging 
> 
> a new .hive-staging folder was created in hive/warehouse/ folder
> moreover, please understand that if we run the hive query in pure Hive via 
> Hive CLI on the same Spark cluster, we don't see the behavior
> so it doesn't appear to be a Hive issue/behavior in this case- this is a 
> spark behavior
> I checked in Ambari, spark.yarn.preserve.staging.files=false in Spark 
> configuration already
> The issue happens via Spark-submit as well - customer used the following 
> command to reproduce this -
> spark-submit test-hive-staging-cleanup.py
> Solution: 
> This bug is reported by customers.
> The reason is the org.spark.sql.hive.InsertIntoHiveTable call the hive class 
> of (org.apache.hadoop.hive.) to create the staging directory. Default, from 
> the hive side, this staging file would be removed after the hive session is 
> expired. However, spark fail to notify the hive to remove the staging files.
> Thus, follow the code of spark 2.0.x, I just write one function inside the 
> InsertIntoHiveTable to create the .staging directory, then, after the session 
> expired of spark, this .staging directory would be removed.
> This update is tested for the spark 1.5.2 and spark 1.6.3, and the push 
> request is : 
> For the test, I have manually checking .staging files from table belong 
> directory after the spark shell close. meanwhile, please advise how to write 
> the test case? because the directory for the related tables can not get.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18372) .Hive-staging folders created from Spark hiveContext are not getting cleaned up

2016-11-08 Thread mingjie tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649386#comment-15649386
 ] 

mingjie tang commented on SPARK-18372:
--

the PR is https://github.com/apache/spark/pull/15819

> .Hive-staging folders created from Spark hiveContext are not getting cleaned 
> up
> ---
>
> Key: SPARK-18372
> URL: https://issues.apache.org/jira/browse/SPARK-18372
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.2, 1.6.3
> Environment: spark standalone and spark yarn 
>Reporter: mingjie tang
> Fix For: 2.0.1
>
>
> Steps to reproduce:
> 
> 1. Launch spark-shell 
> 2. Run the following scala code via Spark-Shell 
> scala> val hivesampletabledf = sqlContext.table("hivesampletable") 
> scala> import org.apache.spark.sql.DataFrameWriter 
> scala> val dfw : DataFrameWriter = hivesampletabledf.write 
> scala> sqlContext.sql("CREATE TABLE IF NOT EXISTS hivesampletablecopypy ( 
> clientid string, querytime string, market string, deviceplatform string, 
> devicemake string, devicemodel string, state string, country string, 
> querydwelltime double, sessionid bigint, sessionpagevieworder bigint )") 
> scala> dfw.insertInto("hivesampletablecopypy") 
> scala> val hivesampletablecopypydfdf = sqlContext.sql("""SELECT clientid, 
> querytime, deviceplatform, querydwelltime FROM hivesampletablecopypy WHERE 
> state = 'Washington' AND devicemake = 'Microsoft' AND querydwelltime > 15 """)
> hivesampletablecopypydfdf.show
> 3. in HDFS (in our case, WASB), we can see the following folders 
> hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666
>  
> hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666-1/-ext-1
>  
> hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693
> the issue is that these don't get cleaned up and get accumulated
> =
> with the customer, we have tried setting "SET 
> hive.exec.stagingdir=/tmp/hive;" in hive-site.xml - didn't make any 
> difference.
> .hive-staging folders are created under the  folder - 
> hive/warehouse/hivesampletablecopypy/
> we have tried adding this property to hive-site.xml and restart the 
> components -
>  
> hive.exec.stagingdir 
> $ {hive.exec.scratchdir}
> /$
> {user.name}
> /.staging 
> 
> a new .hive-staging folder was created in hive/warehouse/ folder
> moreover, please understand that if we run the hive query in pure Hive via 
> Hive CLI on the same Spark cluster, we don't see the behavior
> so it doesn't appear to be a Hive issue/behavior in this case- this is a 
> spark behavior
> I checked in Ambari, spark.yarn.preserve.staging.files=false in Spark 
> configuration already
> The issue happens via Spark-submit as well - customer used the following 
> command to reproduce this -
> spark-submit test-hive-staging-cleanup.py
> Solution: 
> This bug is reported by customers.
> The reason is the org.spark.sql.hive.InsertIntoHiveTable call the hive class 
> of (org.apache.hadoop.hive.) to create the staging directory. Default, from 
> the hive side, this staging file would be removed after the hive session is 
> expired. However, spark fail to notify the hive to remove the staging files.
> Thus, follow the code of spark 2.0.x, I just write one function inside the 
> InsertIntoHiveTable to create the .staging directory, then, after the session 
> expired of spark, this .staging directory would be removed.
> This update is tested for the spark 1.5.2 and spark 1.6.3, and the push 
> request is : 
> For the test, I have manually checking .staging files from table belong 
> directory after the spark shell close. meanwhile, please advise how to write 
> the test case? because the directory for the related tables can not get.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18372) .Hive-staging folders created from Spark hiveContext are not getting cleaned up

2016-11-08 Thread mingjie tang (JIRA)
mingjie tang created SPARK-18372:


 Summary: .Hive-staging folders created from Spark hiveContext are 
not getting cleaned up
 Key: SPARK-18372
 URL: https://issues.apache.org/jira/browse/SPARK-18372
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.2, 1.5.2, 1.6.3
 Environment: spark standalone and spark yarn 
Reporter: mingjie tang
 Fix For: 2.0.1


Steps to reproduce:

1. Launch spark-shell 
2. Run the following scala code via Spark-Shell 
scala> val hivesampletabledf = sqlContext.table("hivesampletable") 
scala> import org.apache.spark.sql.DataFrameWriter 
scala> val dfw : DataFrameWriter = hivesampletabledf.write 
scala> sqlContext.sql("CREATE TABLE IF NOT EXISTS hivesampletablecopypy ( 
clientid string, querytime string, market string, deviceplatform string, 
devicemake string, devicemodel string, state string, country string, 
querydwelltime double, sessionid bigint, sessionpagevieworder bigint )") 
scala> dfw.insertInto("hivesampletablecopypy") 
scala> val hivesampletablecopypydfdf = sqlContext.sql("""SELECT clientid, 
querytime, deviceplatform, querydwelltime FROM hivesampletablecopypy WHERE 
state = 'Washington' AND devicemake = 'Microsoft' AND querydwelltime > 15 """)
hivesampletablecopypydfdf.show
3. in HDFS (in our case, WASB), we can see the following folders 
hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666
 
hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666-1/-ext-1
 
hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693
the issue is that these don't get cleaned up and get accumulated
=
with the customer, we have tried setting "SET hive.exec.stagingdir=/tmp/hive;" 
in hive-site.xml - didn't make any difference.
.hive-staging folders are created under the  folder - 
hive/warehouse/hivesampletablecopypy/
we have tried adding this property to hive-site.xml and restart the components -
 
hive.exec.stagingdir 
$ {hive.exec.scratchdir}
/$
{user.name}
/.staging 

a new .hive-staging folder was created in hive/warehouse/ folder
moreover, please understand that if we run the hive query in pure Hive via Hive 
CLI on the same Spark cluster, we don't see the behavior
so it doesn't appear to be a Hive issue/behavior in this case- this is a spark 
behavior
I checked in Ambari, spark.yarn.preserve.staging.files=false in Spark 
configuration already
The issue happens via Spark-submit as well - customer used the following 
command to reproduce this -
spark-submit test-hive-staging-cleanup.py

Solution: 
This bug is reported by customers.
The reason is the org.spark.sql.hive.InsertIntoHiveTable call the hive class of 
(org.apache.hadoop.hive.) to create the staging directory. Default, from the 
hive side, this staging file would be removed after the hive session is 
expired. However, spark fail to notify the hive to remove the staging files.
Thus, follow the code of spark 2.0.x, I just write one function inside the 
InsertIntoHiveTable to create the .staging directory, then, after the session 
expired of spark, this .staging directory would be removed.
This update is tested for the spark 1.5.2 and spark 1.6.3, and the push request 
is : 

For the test, I have manually checking .staging files from table belong 
directory after the spark shell close. meanwhile, please advise how to write 
the test case? because the directory for the related tables can not get.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18359) No option in read csv for other decimal delimiter than dot

2016-11-08 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649352#comment-15649352
 ] 

Hyukjin Kwon commented on SPARK-18359:
--

I guess that should be locale specific. I guess we recently sweeped the local 
to {{Locale.US}}. Maybe we need an option to specify the locale. I guess this 
is also loosely related with SPARK-18350.

> No option in read csv for other decimal delimiter than dot
> --
>
> Key: SPARK-18359
> URL: https://issues.apache.org/jira/browse/SPARK-18359
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: yannick Radji
>
> On the DataFrameReader object there no CSV-specific option to set decimal 
> delimiter on comma whereas dot like it use to be in France and Europe.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14914) Test Cases fail on Windows

2016-11-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649344#comment-15649344
 ] 

Apache Spark commented on SPARK-14914:
--

User 'wangmiao1981' has created a pull request for this issue:
https://github.com/apache/spark/pull/15818

> Test Cases fail on Windows
> --
>
> Key: SPARK-14914
> URL: https://issues.apache.org/jira/browse/SPARK-14914
> Project: Spark
>  Issue Type: Test
>Reporter: Tao LI
>Assignee: Tao LI
> Fix For: 2.1.0
>
>
> There are lots of test failure on windows. Mainly for the following reasons:
> 1. resources (e.g., temp files) haven't been properly closed after using, 
> thus IOException raised while deleting the temp directory. 
> 2. Command line too long on windows. 
> 3. File path problems caused by the drive label and back slash on absolute 
> windows path. 
> 4. Simply java bug on windows
>a. setReadable doesn't work on windows for directories. 
>b. setWritable doesn't work on windows
>c. Memory-mapped file can't be read at the same time. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18369) Deprecate runs in Pyspark mllib KMeans

2016-11-08 Thread Sandeep Singh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649345#comment-15649345
 ] 

Sandeep Singh commented on SPARK-18369:
---

I think its already deprecated 
https://github.com/apache/spark/blob/master/python/pyspark/mllib/clustering.py#L321-L322
https://github.com/apache/spark/blob/master/python/pyspark/mllib/clustering.py#L347


> Deprecate runs in Pyspark mllib KMeans
> --
>
> Key: SPARK-18369
> URL: https://issues.apache.org/jira/browse/SPARK-18369
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, PySpark
>Reporter: Seth Hendrickson
>Priority: Minor
>
> We should deprecate runs in pyspark mllib kmeans algo as we have done in 
> Scala.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18371) Spark Streaming backpressure bug - generates a batch with large number of records

2016-11-08 Thread mapreduced (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mapreduced updated SPARK-18371:
---
Attachment: GiantBatch3.png

> Spark Streaming backpressure bug - generates a batch with large number of 
> records
> -
>
> Key: SPARK-18371
> URL: https://issues.apache.org/jira/browse/SPARK-18371
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.0
>Reporter: mapreduced
> Attachments: GiantBatch2.png, GiantBatch3.png, 
> Giant_batch_at_23_00.png, Look_at_batch_at_22_14.png
>
>
> When the streaming job is configured with backpressureEnabled=true, it 
> generates a GIANT batch of records if the processing time + scheduled delay 
> is (much) larger than batchDuration. This creates a backlog of records like 
> no other and results in batches queueing for hours until it chews through 
> this giant batch.
> Expectation is that it should reduce the number of records per batch in some 
> time to whatever it can really process.
> Attaching some screen shots where it seems that this issue is quite easily 
> reproducible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18371) Spark Streaming backpressure bug - generates a batch with large number of records

2016-11-08 Thread mapreduced (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mapreduced updated SPARK-18371:
---
Attachment: Giant_batch_at_23_00.png

> Spark Streaming backpressure bug - generates a batch with large number of 
> records
> -
>
> Key: SPARK-18371
> URL: https://issues.apache.org/jira/browse/SPARK-18371
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.0
>Reporter: mapreduced
> Attachments: GiantBatch2.png, GiantBatch3.png, 
> Giant_batch_at_23_00.png, Look_at_batch_at_22_14.png
>
>
> When the streaming job is configured with backpressureEnabled=true, it 
> generates a GIANT batch of records if the processing time + scheduled delay 
> is (much) larger than batchDuration. This creates a backlog of records like 
> no other and results in batches queueing for hours until it chews through 
> this giant batch.
> Expectation is that it should reduce the number of records per batch in some 
> time to whatever it can really process.
> Attaching some screen shots where it seems that this issue is quite easily 
> reproducible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18371) Spark Streaming backpressure bug - generates a batch with large number of records

2016-11-08 Thread mapreduced (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mapreduced updated SPARK-18371:
---
Attachment: Look_at_batch_at_22_14.png

> Spark Streaming backpressure bug - generates a batch with large number of 
> records
> -
>
> Key: SPARK-18371
> URL: https://issues.apache.org/jira/browse/SPARK-18371
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.0
>Reporter: mapreduced
> Attachments: GiantBatch2.png, Look_at_batch_at_22_14.png
>
>
> When the streaming job is configured with backpressureEnabled=true, it 
> generates a GIANT batch of records if the processing time + scheduled delay 
> is (much) larger than batchDuration. This creates a backlog of records like 
> no other and results in batches queueing for hours until it chews through 
> this giant batch.
> Expectation is that it should reduce the number of records per batch in some 
> time to whatever it can really process.
> Attaching some screen shots where it seems that this issue is quite easily 
> reproducible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18371) Spark Streaming backpressure bug - generates a batch with large number of records

2016-11-08 Thread mapreduced (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mapreduced updated SPARK-18371:
---
Attachment: GiantBatch2.png

> Spark Streaming backpressure bug - generates a batch with large number of 
> records
> -
>
> Key: SPARK-18371
> URL: https://issues.apache.org/jira/browse/SPARK-18371
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.0
>Reporter: mapreduced
> Attachments: GiantBatch2.png, Look_at_batch_at_22_14.png
>
>
> When the streaming job is configured with backpressureEnabled=true, it 
> generates a GIANT batch of records if the processing time + scheduled delay 
> is (much) larger than batchDuration. This creates a backlog of records like 
> no other and results in batches queueing for hours until it chews through 
> this giant batch.
> Expectation is that it should reduce the number of records per batch in some 
> time to whatever it can really process.
> Attaching some screen shots where it seems that this issue is quite easily 
> reproducible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18371) Spark Streaming backpressure bug - generates a batch with large number of records

2016-11-08 Thread mapreduced (JIRA)
mapreduced created SPARK-18371:
--

 Summary: Spark Streaming backpressure bug - generates a batch with 
large number of records
 Key: SPARK-18371
 URL: https://issues.apache.org/jira/browse/SPARK-18371
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.0.0
Reporter: mapreduced


When the streaming job is configured with backpressureEnabled=true, it 
generates a GIANT batch of records if the processing time + scheduled delay is 
(much) larger than batchDuration. This creates a backlog of records like no 
other and results in batches queueing for hours until it chews through this 
giant batch.
Expectation is that it should reduce the number of records per batch in some 
time to whatever it can really process.
Attaching some screen shots where it seems that this issue is quite easily 
reproducible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18366) Add handleInvalid to Pyspark for QuantileDiscretizer and Bucketizer

2016-11-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18366:


Assignee: (was: Apache Spark)

> Add handleInvalid to Pyspark for QuantileDiscretizer and Bucketizer
> ---
>
> Key: SPARK-18366
> URL: https://issues.apache.org/jira/browse/SPARK-18366
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Reporter: Seth Hendrickson
>Priority: Minor
>
> We should add the new {{handleInvalid}} param for these transformers to 
> Python to maintain API parity.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18366) Add handleInvalid to Pyspark for QuantileDiscretizer and Bucketizer

2016-11-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649319#comment-15649319
 ] 

Apache Spark commented on SPARK-18366:
--

User 'techaddict' has created a pull request for this issue:
https://github.com/apache/spark/pull/15817

> Add handleInvalid to Pyspark for QuantileDiscretizer and Bucketizer
> ---
>
> Key: SPARK-18366
> URL: https://issues.apache.org/jira/browse/SPARK-18366
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Reporter: Seth Hendrickson
>Priority: Minor
>
> We should add the new {{handleInvalid}} param for these transformers to 
> Python to maintain API parity.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18366) Add handleInvalid to Pyspark for QuantileDiscretizer and Bucketizer

2016-11-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18366:


Assignee: Apache Spark

> Add handleInvalid to Pyspark for QuantileDiscretizer and Bucketizer
> ---
>
> Key: SPARK-18366
> URL: https://issues.apache.org/jira/browse/SPARK-18366
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Reporter: Seth Hendrickson
>Assignee: Apache Spark
>Priority: Minor
>
> We should add the new {{handleInvalid}} param for these transformers to 
> Python to maintain API parity.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18370) InsertIntoHadoopFsRelationCommand should keep track of its table

2016-11-08 Thread Herman van Hovell (JIRA)
Herman van Hovell created SPARK-18370:
-

 Summary: InsertIntoHadoopFsRelationCommand should keep track of 
its table
 Key: SPARK-18370
 URL: https://issues.apache.org/jira/browse/SPARK-18370
 Project: Spark
  Issue Type: Bug
Reporter: Herman van Hovell
Assignee: Herman van Hovell
Priority: Minor


When we plan a {{InsertIntoHadoopFsRelationCommand}} we drop the {{Table}} 
name. This is quite annoying when debugging plans.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18239) Gradient Boosted Tree wrapper in SparkR

2016-11-08 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-18239.
--
  Resolution: Fixed
   Fix Version/s: 2.2.0
  2.1.0
Target Version/s:   (was: 2.1.0)

> Gradient Boosted Tree wrapper in SparkR
> ---
>
> Key: SPARK-18239
> URL: https://issues.apache.org/jira/browse/SPARK-18239
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Affects Versions: 2.0.1
>Reporter: Felix Cheung
>Assignee: Felix Cheung
> Fix For: 2.1.0, 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18131) Support returning Vector/Dense Vector from backend

2016-11-08 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649200#comment-15649200
 ] 

Felix Cheung commented on SPARK-18131:
--

We discussed this as a part of the GBT PR, from here 
https://github.com/apache/spark/pull/15746#issuecomment-259217671
"
The best sparse vector support in R comes from the Matrix package - But its a 
big package and I dont think we should add that as a dependency. We could try 
to do a wrapper where if the user already has the package installed we return 
it using Matrix ?
"
I think this is a good idea to do in a general sense for any SparseVector in 
the SerDe code.

> Support returning Vector/Dense Vector from backend
> --
>
> Key: SPARK-18131
> URL: https://issues.apache.org/jira/browse/SPARK-18131
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Miao Wang
>
> For `spark.logit`, there is a `probabilityCol`, which is a vector in the 
> backend (scala side). When we do collect(select(df, "probabilityCol")), 
> backend returns the java object handle (memory address). We need to implement 
> a method to convert a Vector/Dense Vector column as R vector, which can be 
> read in SparkR. It is a followup JIRA of adding `spark.logit`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18320) ML 2.1 QA: API: Python API coverage

2016-11-08 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649171#comment-15649171
 ] 

Seth Hendrickson edited comment on SPARK-18320 at 11/8/16 11:23 PM:


I scanned through the {{@Since("2.1.0")}} tags in ml/mllib. The major things 
that were added were LSH and clustering summaries, which are linked and have 
JIRAs. I made JIRAs for a couple other minor things as well and linked them.


was (Author: sethah):
I scanned through the {{@Since("2.1.0") tags in ml/mllib}}. The major things 
that were added were LSH and clustering summaries, which are linked and have 
JIRAs. I made JIRAs for a couple other minor things as well and linked them.

> ML 2.1 QA: API: Python API coverage
> ---
>
> Key: SPARK-18320
> URL: https://issues.apache.org/jira/browse/SPARK-18320
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, PySpark
>Reporter: Joseph K. Bradley
>Priority: Blocker
>
> For new public APIs added to MLlib ({{spark.ml}} only), we need to check the 
> generated HTML doc and compare the Scala & Python versions.
> * *GOAL*: Audit and create JIRAs to fix in the next release.
> * *NON-GOAL*: This JIRA is _not_ for fixing the API parity issues.
> We need to track:
> * Inconsistency: Do class/method/parameter names match?
> * Docs: Is the Python doc missing or just a stub?  We want the Python doc to 
> be as complete as the Scala doc.
> * API breaking changes: These should be very rare but are occasionally either 
> necessary (intentional) or accidental.  These must be recorded and added in 
> the Migration Guide for this release.
> ** Note: If the API change is for an Alpha/Experimental/DeveloperApi 
> component, please note that as well.
> * Missing classes/methods/parameters: We should create to-do JIRAs for 
> functionality missing from Python, to be added in the next release cycle.  
> *Please use a _separate_ JIRA (linked below as "requires") for this list of 
> to-do items.*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18320) ML 2.1 QA: API: Python API coverage

2016-11-08 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649171#comment-15649171
 ] 

Seth Hendrickson commented on SPARK-18320:
--

I scanned through the {{@Since("2.1.0") tags in ml/mllib}}. The major things 
that were added were LSH and clustering summaries, which are linked and have 
JIRAs. I made JIRAs for a couple other minor things as well and linked them.

> ML 2.1 QA: API: Python API coverage
> ---
>
> Key: SPARK-18320
> URL: https://issues.apache.org/jira/browse/SPARK-18320
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, PySpark
>Reporter: Joseph K. Bradley
>Priority: Blocker
>
> For new public APIs added to MLlib ({{spark.ml}} only), we need to check the 
> generated HTML doc and compare the Scala & Python versions.
> * *GOAL*: Audit and create JIRAs to fix in the next release.
> * *NON-GOAL*: This JIRA is _not_ for fixing the API parity issues.
> We need to track:
> * Inconsistency: Do class/method/parameter names match?
> * Docs: Is the Python doc missing or just a stub?  We want the Python doc to 
> be as complete as the Scala doc.
> * API breaking changes: These should be very rare but are occasionally either 
> necessary (intentional) or accidental.  These must be recorded and added in 
> the Migration Guide for this release.
> ** Note: If the API change is for an Alpha/Experimental/DeveloperApi 
> component, please note that as well.
> * Missing classes/methods/parameters: We should create to-do JIRAs for 
> functionality missing from Python, to be added in the next release cycle.  
> *Please use a _separate_ JIRA (linked below as "requires") for this list of 
> to-do items.*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18369) Deprecate runs in Pyspark mllib KMeans

2016-11-08 Thread Seth Hendrickson (JIRA)
Seth Hendrickson created SPARK-18369:


 Summary: Deprecate runs in Pyspark mllib KMeans
 Key: SPARK-18369
 URL: https://issues.apache.org/jira/browse/SPARK-18369
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, PySpark
Reporter: Seth Hendrickson
Priority: Minor


We should deprecate runs in pyspark mllib kmeans algo as we have done in Scala.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18367) limit() makes the lame walk again

2016-11-08 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649132#comment-15649132
 ] 

Nicholas Chammas commented on SPARK-18367:
--

On 2.0.x the caching is required due to SPARK-18254, which is fixed in 2.1+.

On master I get the same "Too many open files in system" error without the 
caching if I remove the limit().

> limit() makes the lame walk again
> -
>
> Key: SPARK-18367
> URL: https://issues.apache.org/jira/browse/SPARK-18367
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1, 2.1.0
> Environment: Python 3.5, Java 8
>Reporter: Nicholas Chammas
> Attachments: plan-with-limit.txt, plan-without-limit.txt
>
>
> I have a complex DataFrame query that fails to run normally but succeeds if I 
> add a dummy {{limit()}} upstream in the query tree.
> The failure presents itself like this:
> {code}
> ERROR DiskBlockObjectWriter: Uncaught exception while reverting partial 
> writes to file 
> /private/var/folders/f5/t48vxz555b51mr3g6jjhxv40gq/T/blockmgr-1e908314-1d49-47ba-8c95-fa43ff43cee4/31/temp_shuffle_6b939273-c6ce-44a0-99b1-db668e6c89dc
> java.io.FileNotFoundException: 
> /private/var/folders/f5/t48vxz555b51mr3g6jjhxv40gq/T/blockmgr-1e908314-1d49-47ba-8c95-fa43ff43cee4/31/temp_shuffle_6b939273-c6ce-44a0-99b1-db668e6c89dc
>  (Too many open files in system)
> {code}
> My {{ulimit -n}} is already set to 10,000, and I can't set it much higher on 
> macOS. However, I don't think that's the issue, since if I add a dummy 
> {{limit()}} early on the query tree -- dummy as in it does not actually 
> reduce the number of rows queried -- then the same query works.
> I've diffed the physical query plans to see what this {{limit()}} is actually 
> doing, and the diff is as follows:
> {code}
> diff plan-with-limit.txt plan-without-limit.txt
> 24,28c24
> <:  : : +- *GlobalLimit 100
> <:  : :+- Exchange SinglePartition
> <:  : :   +- *LocalLimit 100
> <:  : :  +- *Project [...]
> <:  : : +- *Scan orc [...] 
> Format: ORC, InputPaths: file:/..., PartitionFilters: [], PushedFilters: [], 
> ReadSchema: struct<...
> ---
> >:  : : +- *Scan orc [...] Format: ORC, 
> > InputPaths: file:/..., PartitionFilters: [], PushedFilters: [], ReadSchema: 
> > struct<...
> 49,53c45
> <   : : +- *GlobalLimit 100
> <   : :+- Exchange SinglePartition
> <   : :   +- *LocalLimit 100
> <   : :  +- *Project [...]
> <   : : +- *Scan orc [...] 
> Format: ORC, InputPaths: file:/..., PartitionFilters: [], PushedFilters: [], 
> ReadSchema: struct<...
> ---
> >   : : +- *Scan orc [] Format: ORC, 
> > InputPaths: file:/..., PartitionFilters: [], PushedFilters: [], ReadSchema: 
> > struct<...
> {code}
> Does this give any clues as to why this {{limit()}} is helping? Again, the 
> 100 limit you can see in the plan is much higher than the cardinality of 
> the dataset I'm reading, so there is no theoretical impact on the output. You 
> can see the full query plans attached to this ticket.
> Unfortunately, I don't have a minimal reproduction for this issue, but I can 
> work towards one with some clues.
> I'm seeing this behavior on 2.0.1 and on master at commit 
> {{26e1c53aceee37e3687a372ff6c6f05463fd8a94}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18342) HDFSBackedStateStore can fail to rename files causing snapshotting and recovery to fail

2016-11-08 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-18342.
---
   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.2

Issue resolved by pull request 15804
[https://github.com/apache/spark/pull/15804]

> HDFSBackedStateStore can fail to rename files causing snapshotting and 
> recovery to fail
> ---
>
> Key: SPARK-18342
> URL: https://issues.apache.org/jira/browse/SPARK-18342
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.1
>Reporter: Burak Yavuz
>Priority: Critical
> Fix For: 2.0.2, 2.1.0
>
>
> The HDFSBackedStateStore renames temporary files to delta files as it commits 
> new versions. It however doesn't check whether the rename succeeded. If the 
> rename fails, then recovery will not be possible. It should fail during the 
> rename stage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18367) limit() makes the lame walk again

2016-11-08 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649110#comment-15649110
 ] 

Herman van Hovell commented on SPARK-18367:
---

Could you try this without caching?

> limit() makes the lame walk again
> -
>
> Key: SPARK-18367
> URL: https://issues.apache.org/jira/browse/SPARK-18367
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1, 2.1.0
> Environment: Python 3.5, Java 8
>Reporter: Nicholas Chammas
> Attachments: plan-with-limit.txt, plan-without-limit.txt
>
>
> I have a complex DataFrame query that fails to run normally but succeeds if I 
> add a dummy {{limit()}} upstream in the query tree.
> The failure presents itself like this:
> {code}
> ERROR DiskBlockObjectWriter: Uncaught exception while reverting partial 
> writes to file 
> /private/var/folders/f5/t48vxz555b51mr3g6jjhxv40gq/T/blockmgr-1e908314-1d49-47ba-8c95-fa43ff43cee4/31/temp_shuffle_6b939273-c6ce-44a0-99b1-db668e6c89dc
> java.io.FileNotFoundException: 
> /private/var/folders/f5/t48vxz555b51mr3g6jjhxv40gq/T/blockmgr-1e908314-1d49-47ba-8c95-fa43ff43cee4/31/temp_shuffle_6b939273-c6ce-44a0-99b1-db668e6c89dc
>  (Too many open files in system)
> {code}
> My {{ulimit -n}} is already set to 10,000, and I can't set it much higher on 
> macOS. However, I don't think that's the issue, since if I add a dummy 
> {{limit()}} early on the query tree -- dummy as in it does not actually 
> reduce the number of rows queried -- then the same query works.
> I've diffed the physical query plans to see what this {{limit()}} is actually 
> doing, and the diff is as follows:
> {code}
> diff plan-with-limit.txt plan-without-limit.txt
> 24,28c24
> <:  : : +- *GlobalLimit 100
> <:  : :+- Exchange SinglePartition
> <:  : :   +- *LocalLimit 100
> <:  : :  +- *Project [...]
> <:  : : +- *Scan orc [...] 
> Format: ORC, InputPaths: file:/..., PartitionFilters: [], PushedFilters: [], 
> ReadSchema: struct<...
> ---
> >:  : : +- *Scan orc [...] Format: ORC, 
> > InputPaths: file:/..., PartitionFilters: [], PushedFilters: [], ReadSchema: 
> > struct<...
> 49,53c45
> <   : : +- *GlobalLimit 100
> <   : :+- Exchange SinglePartition
> <   : :   +- *LocalLimit 100
> <   : :  +- *Project [...]
> <   : : +- *Scan orc [...] 
> Format: ORC, InputPaths: file:/..., PartitionFilters: [], PushedFilters: [], 
> ReadSchema: struct<...
> ---
> >   : : +- *Scan orc [] Format: ORC, 
> > InputPaths: file:/..., PartitionFilters: [], PushedFilters: [], ReadSchema: 
> > struct<...
> {code}
> Does this give any clues as to why this {{limit()}} is helping? Again, the 
> 100 limit you can see in the plan is much higher than the cardinality of 
> the dataset I'm reading, so there is no theoretical impact on the output. You 
> can see the full query plans attached to this ticket.
> Unfortunately, I don't have a minimal reproduction for this issue, but I can 
> work towards one with some clues.
> I'm seeing this behavior on 2.0.1 and on master at commit 
> {{26e1c53aceee37e3687a372ff6c6f05463fd8a94}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18368) Regular expression replace throws NullPointerException when serialized

2016-11-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18368:


Assignee: (was: Apache Spark)

> Regular expression replace throws NullPointerException when serialized
> --
>
> Key: SPARK-18368
> URL: https://issues.apache.org/jira/browse/SPARK-18368
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Ryan Blue
>
> This query fails with a [NullPointerException on line 
> 247|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala#L247]:
> {code}
> SELECT POSEXPLODE(SPLIT(REGEXP_REPLACE(ranks, '[\\[ \\]]', ''), ',')) AS 
> (rank, col0) FROM table;
> {code}
> The problem is that POSEXPLODE is causing the REGEXP_REPLACE to be serialized 
> after it is instantiated. The null value is a transient StringBuffer that 
> should hold the result. The fix is to make the result value lazy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18368) Regular expression replace throws NullPointerException when serialized

2016-11-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18368:


Assignee: Apache Spark

> Regular expression replace throws NullPointerException when serialized
> --
>
> Key: SPARK-18368
> URL: https://issues.apache.org/jira/browse/SPARK-18368
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Ryan Blue
>Assignee: Apache Spark
>
> This query fails with a [NullPointerException on line 
> 247|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala#L247]:
> {code}
> SELECT POSEXPLODE(SPLIT(REGEXP_REPLACE(ranks, '[\\[ \\]]', ''), ',')) AS 
> (rank, col0) FROM table;
> {code}
> The problem is that POSEXPLODE is causing the REGEXP_REPLACE to be serialized 
> after it is instantiated. The null value is a transient StringBuffer that 
> should hold the result. The fix is to make the result value lazy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18368) Regular expression replace throws NullPointerException when serialized

2016-11-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649104#comment-15649104
 ] 

Apache Spark commented on SPARK-18368:
--

User 'rdblue' has created a pull request for this issue:
https://github.com/apache/spark/pull/15816

> Regular expression replace throws NullPointerException when serialized
> --
>
> Key: SPARK-18368
> URL: https://issues.apache.org/jira/browse/SPARK-18368
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Ryan Blue
>
> This query fails with a [NullPointerException on line 
> 247|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala#L247]:
> {code}
> SELECT POSEXPLODE(SPLIT(REGEXP_REPLACE(ranks, '[\\[ \\]]', ''), ',')) AS 
> (rank, col0) FROM table;
> {code}
> The problem is that POSEXPLODE is causing the REGEXP_REPLACE to be serialized 
> after it is instantiated. The null value is a transient StringBuffer that 
> should hold the result. The fix is to make the result value lazy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18368) Regular expression replace throws NullPointerException when serialized

2016-11-08 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated SPARK-18368:
--
Description: 
This query fails with a [NullPointerException on line 
247|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala#L247]:

{code}
SELECT POSEXPLODE(SPLIT(REGEXP_REPLACE(ranks, '[\\[ \\]]', ''), ',')) AS (rank, 
col0) FROM table;
{code}

The problem is that POSEXPLODE is causing the REGEXP_REPLACE to be serialized 
after it is instantiated. The null value is a transient StringBuffer that 
should hold the result. The fix is to make the result value lazy.

  was:
This query fails with a [NullPointerException on line 
247|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala#L247]:

{code}
SELECT POSEXPLODE(SPLIT(REGEXP_REPLACE(ranks, '[\\[ \\]]', ''), ',')) AS (rank, 
col0)
FROM table;
{code}

The problem is that POSEXPLODE is causing the REGEXP_REPLACE to be serialized 
after it is instantiated. The null value is a transient StringBuffer that 
should hold the result. The fix is to make the result value lazy.


> Regular expression replace throws NullPointerException when serialized
> --
>
> Key: SPARK-18368
> URL: https://issues.apache.org/jira/browse/SPARK-18368
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Ryan Blue
>
> This query fails with a [NullPointerException on line 
> 247|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala#L247]:
> {code}
> SELECT POSEXPLODE(SPLIT(REGEXP_REPLACE(ranks, '[\\[ \\]]', ''), ',')) AS 
> (rank, col0) FROM table;
> {code}
> The problem is that POSEXPLODE is causing the REGEXP_REPLACE to be serialized 
> after it is instantiated. The null value is a transient StringBuffer that 
> should hold the result. The fix is to make the result value lazy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14540) Support Scala 2.12 closures and Java 8 lambdas in ClosureCleaner

2016-11-08 Thread Taro L. Saito (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649090#comment-15649090
 ] 

Taro L. Saito commented on SPARK-14540:
---

I'm also hitting a similar problem in my dependency injection library for 
Scala: https://github.com/wvlet/airframe/pull/39. 
I feel ClosureCleaner like functionality is necessary in Scala 2.12 itself. 

> Support Scala 2.12 closures and Java 8 lambdas in ClosureCleaner
> 
>
> Key: SPARK-14540
> URL: https://issues.apache.org/jira/browse/SPARK-14540
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Josh Rosen
>
> Using https://github.com/JoshRosen/spark/tree/build-for-2.12, I tried running 
> ClosureCleanerSuite with Scala 2.12 and ran into two bad test failures:
> {code}
> [info] - toplevel return statements in closures are identified at cleaning 
> time *** FAILED *** (32 milliseconds)
> [info]   Expected exception 
> org.apache.spark.util.ReturnStatementInClosureException to be thrown, but no 
> exception was thrown. (ClosureCleanerSuite.scala:57)
> {code}
> and
> {code}
> [info] - user provided closures are actually cleaned *** FAILED *** (56 
> milliseconds)
> [info]   Expected ReturnStatementInClosureException, but got 
> org.apache.spark.SparkException: Job aborted due to stage failure: Task not 
> serializable: java.io.NotSerializableException: java.lang.Object
> [info]- element of array (index: 0)
> [info]- array (class "[Ljava.lang.Object;", size: 1)
> [info]- field (class "java.lang.invoke.SerializedLambda", name: 
> "capturedArgs", type: "class [Ljava.lang.Object;")
> [info]- object (class "java.lang.invoke.SerializedLambda", 
> SerializedLambda[capturingClass=class 
> org.apache.spark.util.TestUserClosuresActuallyCleaned$, 
> functionalInterfaceMethod=scala/runtime/java8/JFunction1$mcII$sp.apply$mcII$sp:(I)I,
>  implementation=invokeStatic 
> org/apache/spark/util/TestUserClosuresActuallyCleaned$.org$apache$spark$util$TestUserClosuresActuallyCleaned$$$anonfun$69:(Ljava/lang/Object;I)I,
>  instantiatedMethodType=(I)I, numCaptured=1])
> [info]- element of array (index: 0)
> [info]- array (class "[Ljava.lang.Object;", size: 1)
> [info]- field (class "java.lang.invoke.SerializedLambda", name: 
> "capturedArgs", type: "class [Ljava.lang.Object;")
> [info]- object (class "java.lang.invoke.SerializedLambda", 
> SerializedLambda[capturingClass=class org.apache.spark.rdd.RDD, 
> functionalInterfaceMethod=scala/Function3.apply:(Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;,
>  implementation=invokeStatic 
> org/apache/spark/rdd/RDD.org$apache$spark$rdd$RDD$$$anonfun$20$adapted:(Lscala/Function1;Lorg/apache/spark/TaskContext;Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;,
>  
> instantiatedMethodType=(Lorg/apache/spark/TaskContext;Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;,
>  numCaptured=1])
> [info]- field (class "org.apache.spark.rdd.MapPartitionsRDD", name: 
> "f", type: "interface scala.Function3")
> [info]- object (class "org.apache.spark.rdd.MapPartitionsRDD", 
> MapPartitionsRDD[2] at apply at Transformer.scala:22)
> [info]- field (class "scala.Tuple2", name: "_1", type: "class 
> java.lang.Object")
> [info]- root object (class "scala.Tuple2", (MapPartitionsRDD[2] at 
> apply at 
> Transformer.scala:22,org.apache.spark.SparkContext$$Lambda$957/431842435@6e803685)).
> [info]   This means the closure provided by user is not actually cleaned. 
> (ClosureCleanerSuite.scala:78)
> {code}
> We'll need to figure out a closure cleaning strategy which works for 2.12 
> lambdas.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18368) Regular expression replace throws NullPointerException when serialized

2016-11-08 Thread Ryan Blue (JIRA)
Ryan Blue created SPARK-18368:
-

 Summary: Regular expression replace throws NullPointerException 
when serialized
 Key: SPARK-18368
 URL: https://issues.apache.org/jira/browse/SPARK-18368
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.1, 2.1.0
Reporter: Ryan Blue


This query fails with a [NullPointerException on line 
247|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala#L247]:

{code}
SELECT POSEXPLODE(SPLIT(REGEXP_REPLACE(ranks, '[\\[ \\]]', ''), ',')) AS (rank, 
col0)
FROM table;
{code}

The problem is that POSEXPLODE is causing the REGEXP_REPLACE to be serialized 
after it is instantiated. The null value is a transient StringBuffer that 
should hold the result. The fix is to make the result value lazy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18367) limit() makes the lame walk again

2016-11-08 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649063#comment-15649063
 ] 

Nicholas Chammas commented on SPARK-18367:
--

I'm not trying to write any files actually. In this case I'm just running 
show() to trigger the error. I've tried coalescing partitions right before the 
show(), down to 5 even, but that doesn't seem to help. 

> limit() makes the lame walk again
> -
>
> Key: SPARK-18367
> URL: https://issues.apache.org/jira/browse/SPARK-18367
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1, 2.1.0
> Environment: Python 3.5, Java 8
>Reporter: Nicholas Chammas
> Attachments: plan-with-limit.txt, plan-without-limit.txt
>
>
> I have a complex DataFrame query that fails to run normally but succeeds if I 
> add a dummy {{limit()}} upstream in the query tree.
> The failure presents itself like this:
> {code}
> ERROR DiskBlockObjectWriter: Uncaught exception while reverting partial 
> writes to file 
> /private/var/folders/f5/t48vxz555b51mr3g6jjhxv40gq/T/blockmgr-1e908314-1d49-47ba-8c95-fa43ff43cee4/31/temp_shuffle_6b939273-c6ce-44a0-99b1-db668e6c89dc
> java.io.FileNotFoundException: 
> /private/var/folders/f5/t48vxz555b51mr3g6jjhxv40gq/T/blockmgr-1e908314-1d49-47ba-8c95-fa43ff43cee4/31/temp_shuffle_6b939273-c6ce-44a0-99b1-db668e6c89dc
>  (Too many open files in system)
> {code}
> My {{ulimit -n}} is already set to 10,000, and I can't set it much higher on 
> macOS. However, I don't think that's the issue, since if I add a dummy 
> {{limit()}} early on the query tree -- dummy as in it does not actually 
> reduce the number of rows queried -- then the same query works.
> I've diffed the physical query plans to see what this {{limit()}} is actually 
> doing, and the diff is as follows:
> {code}
> diff plan-with-limit.txt plan-without-limit.txt
> 24,28c24
> <:  : : +- *GlobalLimit 100
> <:  : :+- Exchange SinglePartition
> <:  : :   +- *LocalLimit 100
> <:  : :  +- *Project [...]
> <:  : : +- *Scan orc [...] 
> Format: ORC, InputPaths: file:/..., PartitionFilters: [], PushedFilters: [], 
> ReadSchema: struct<...
> ---
> >:  : : +- *Scan orc [...] Format: ORC, 
> > InputPaths: file:/..., PartitionFilters: [], PushedFilters: [], ReadSchema: 
> > struct<...
> 49,53c45
> <   : : +- *GlobalLimit 100
> <   : :+- Exchange SinglePartition
> <   : :   +- *LocalLimit 100
> <   : :  +- *Project [...]
> <   : : +- *Scan orc [...] 
> Format: ORC, InputPaths: file:/..., PartitionFilters: [], PushedFilters: [], 
> ReadSchema: struct<...
> ---
> >   : : +- *Scan orc [] Format: ORC, 
> > InputPaths: file:/..., PartitionFilters: [], PushedFilters: [], ReadSchema: 
> > struct<...
> {code}
> Does this give any clues as to why this {{limit()}} is helping? Again, the 
> 100 limit you can see in the plan is much higher than the cardinality of 
> the dataset I'm reading, so there is no theoretical impact on the output. You 
> can see the full query plans attached to this ticket.
> Unfortunately, I don't have a minimal reproduction for this issue, but I can 
> work towards one with some clues.
> I'm seeing this behavior on 2.0.1 and on master at commit 
> {{26e1c53aceee37e3687a372ff6c6f05463fd8a94}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17916) CSV data source treats empty string as null no matter what nullValue option is

2016-11-08 Thread Eric Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649039#comment-15649039
 ] 

Eric Liang commented on SPARK-17916:


We're hitting this as a regression from 2.0 as well.

Ideally, we don't want the empty string to be treated specially in any 
scenario. The only logic that converts it to nulls should be due to the 
nullValue option.

> CSV data source treats empty string as null no matter what nullValue option is
> --
>
> Key: SPARK-17916
> URL: https://issues.apache.org/jira/browse/SPARK-17916
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Hossein Falaki
>
> When user configures {{nullValue}} in CSV data source, in addition to those 
> values, all empty string values are also converted to null.
> {code}
> data:
> col1,col2
> 1,"-"
> 2,""
> {code}
> {code}
> spark.read.format("csv").option("nullValue", "-")
> {code}
> We will find a null in both rows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18367) limit() makes the lame walk again

2016-11-08 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649022#comment-15649022
 ] 

Herman van Hovell commented on SPARK-18367:
---

You might be trying to write a lot of partitions (files) here (given the 
error). The global limit will reduce the number of partitions to one. Could you 
try to coalesce partitions before writing?

> limit() makes the lame walk again
> -
>
> Key: SPARK-18367
> URL: https://issues.apache.org/jira/browse/SPARK-18367
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1, 2.1.0
> Environment: Python 3.5, Java 8
>Reporter: Nicholas Chammas
> Attachments: plan-with-limit.txt, plan-without-limit.txt
>
>
> I have a complex DataFrame query that fails to run normally but succeeds if I 
> add a dummy {{limit()}} upstream in the query tree.
> The failure presents itself like this:
> {code}
> ERROR DiskBlockObjectWriter: Uncaught exception while reverting partial 
> writes to file 
> /private/var/folders/f5/t48vxz555b51mr3g6jjhxv40gq/T/blockmgr-1e908314-1d49-47ba-8c95-fa43ff43cee4/31/temp_shuffle_6b939273-c6ce-44a0-99b1-db668e6c89dc
> java.io.FileNotFoundException: 
> /private/var/folders/f5/t48vxz555b51mr3g6jjhxv40gq/T/blockmgr-1e908314-1d49-47ba-8c95-fa43ff43cee4/31/temp_shuffle_6b939273-c6ce-44a0-99b1-db668e6c89dc
>  (Too many open files in system)
> {code}
> My {{ulimit -n}} is already set to 10,000, and I can't set it much higher on 
> macOS. However, I don't think that's the issue, since if I add a dummy 
> {{limit()}} early on the query tree -- dummy as in it does not actually 
> reduce the number of rows queried -- then the same query works.
> I've diffed the physical query plans to see what this {{limit()}} is actually 
> doing, and the diff is as follows:
> {code}
> diff plan-with-limit.txt plan-without-limit.txt
> 24,28c24
> <:  : : +- *GlobalLimit 100
> <:  : :+- Exchange SinglePartition
> <:  : :   +- *LocalLimit 100
> <:  : :  +- *Project [...]
> <:  : : +- *Scan orc [...] 
> Format: ORC, InputPaths: file:/..., PartitionFilters: [], PushedFilters: [], 
> ReadSchema: struct<...
> ---
> >:  : : +- *Scan orc [...] Format: ORC, 
> > InputPaths: file:/..., PartitionFilters: [], PushedFilters: [], ReadSchema: 
> > struct<...
> 49,53c45
> <   : : +- *GlobalLimit 100
> <   : :+- Exchange SinglePartition
> <   : :   +- *LocalLimit 100
> <   : :  +- *Project [...]
> <   : : +- *Scan orc [...] 
> Format: ORC, InputPaths: file:/..., PartitionFilters: [], PushedFilters: [], 
> ReadSchema: struct<...
> ---
> >   : : +- *Scan orc [] Format: ORC, 
> > InputPaths: file:/..., PartitionFilters: [], PushedFilters: [], ReadSchema: 
> > struct<...
> {code}
> Does this give any clues as to why this {{limit()}} is helping? Again, the 
> 100 limit you can see in the plan is much higher than the cardinality of 
> the dataset I'm reading, so there is no theoretical impact on the output. You 
> can see the full query plans attached to this ticket.
> Unfortunately, I don't have a minimal reproduction for this issue, but I can 
> work towards one with some clues.
> I'm seeing this behavior on 2.0.1 and on master at commit 
> {{26e1c53aceee37e3687a372ff6c6f05463fd8a94}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18226) SparkR displaying vector columns in incorrect way

2016-11-08 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649026#comment-15649026
 ] 

Felix Cheung commented on SPARK-18226:
--

We discussed this as a part of the GBT PR, from here 
https://github.com/apache/spark/pull/15746#issuecomment-259217671

"
The best sparse vector support in R comes from the Matrix package - But its a 
big package and I dont think we should add that as a dependency. We could try 
to do a wrapper where if the user already has the package installed we return 
it using Matrix ?
"

I think this is a good idea to do in a general sense for any SparseVector in 
the SerDe code.


> SparkR displaying vector columns in incorrect way
> -
>
> Key: SPARK-18226
> URL: https://issues.apache.org/jira/browse/SPARK-18226
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Grzegorz Chilkiewicz
>Priority: Trivial
>
> I have encountered a problem with SparkR presenting Spark vectors from 
> org.apache.spark.mllib.linalg package
> * `head(df)` shows in vector column: ""
> * cast to string does not work as expected, it shows: 
> "[1,null,null,org.apache.spark.sql.catalyst.expressions.UnsafeArrayData@79f50a91]"
> * `showDF(df)` work correctly
> to reproduce, start SparkR and paste following code (example taken from 
> https://spark.apache.org/docs/latest/sparkr.html#naive-bayes-model)
> {code}
> # Fit a Bernoulli naive Bayes model with spark.naiveBayes
> titanic <- as.data.frame(Titanic)
> titanicDF <- createDataFrame(titanic[titanic$Freq > 0, -5])
> nbDF <- titanicDF
> nbTestDF <- titanicDF
> nbModel <- spark.naiveBayes(nbDF, Survived ~ Class + Sex + Age)
> # Model summary
> summary(nbModel)
> # Prediction
> nbPredictions <- predict(nbModel, nbTestDF)
> #
> # My modification to expose the problem #
> nbPredictions$rawPrediction_str <- cast(nbPredictions$rawPrediction, "string")
> head(nbPredictions)
> showDF(nbPredictions)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18367) limit() makes the lame walk again

2016-11-08 Thread Nicholas Chammas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-18367:
-
Description: 
I have a complex DataFrame query that fails to run normally but succeeds if I 
add a dummy {{limit()}} upstream in the query tree.

The failure presents itself like this:

{code}
ERROR DiskBlockObjectWriter: Uncaught exception while reverting partial writes 
to file 
/private/var/folders/f5/t48vxz555b51mr3g6jjhxv40gq/T/blockmgr-1e908314-1d49-47ba-8c95-fa43ff43cee4/31/temp_shuffle_6b939273-c6ce-44a0-99b1-db668e6c89dc
java.io.FileNotFoundException: 
/private/var/folders/f5/t48vxz555b51mr3g6jjhxv40gq/T/blockmgr-1e908314-1d49-47ba-8c95-fa43ff43cee4/31/temp_shuffle_6b939273-c6ce-44a0-99b1-db668e6c89dc
 (Too many open files in system)
{code}

My {{ulimit -n}} is already set to 10,000, and I can't set it much higher on 
macOS. However, I don't think that's the issue, since if I add a dummy 
{{limit()}} early on the query tree -- dummy as in it does not actually reduce 
the number of rows queried -- then the same query works.

I've diffed the physical query plans to see what this {{limit()}} is actually 
doing, and the diff is as follows:

{code}
diff plan-with-limit.txt plan-without-limit.txt
24,28c24
<:  : : +- *GlobalLimit 100
<:  : :+- Exchange SinglePartition
<:  : :   +- *LocalLimit 100
<:  : :  +- *Project [...]
<:  : : +- *Scan orc [...] 
Format: ORC, InputPaths: file:/..., PartitionFilters: [], PushedFilters: [], 
ReadSchema: struct<...
---
>:  : : +- *Scan orc [...] Format: ORC, 
> InputPaths: file:/..., PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct<...
49,53c45
<   : : +- *GlobalLimit 100
<   : :+- Exchange SinglePartition
<   : :   +- *LocalLimit 100
<   : :  +- *Project [...]
<   : : +- *Scan orc [...] 
Format: ORC, InputPaths: file:/..., PartitionFilters: [], PushedFilters: [], 
ReadSchema: struct<...
---
>   : : +- *Scan orc [] Format: ORC, 
> InputPaths: file:/..., PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct<...
{code}

Does this give any clues as to why this {{limit()}} is helping? Again, the 
100 limit you can see in the plan is much higher than the cardinality of 
the dataset I'm reading, so there is no theoretical impact on the output. You 
can see the full query plans attached to this ticket.

Unfortunately, I don't have a minimal reproduction for this issue, but I can 
work towards one with some clues.

I'm seeing this behavior on 2.0.1 and on master at commit 
{{26e1c53aceee37e3687a372ff6c6f05463fd8a94}}.

  was:
I have a complex DataFrame query that fails to run normally but succeeds if I 
add a dummy {{limit()}} upstream in the query tree.

The failure presents itself like this:

{code}
ERROR DiskBlockObjectWriter: Uncaught exception while reverting partial writes 
to file 
/private/var/folders/f5/t48vxz555b51mr3g6jjhxv40gq/T/blockmgr-1e908314-1d49-47ba-8c95-fa43ff43cee4/31/temp_shuffle_6b939273-c6ce-44a0-99b1-db668e6c89dc
java.io.FileNotFoundException: 
/private/var/folders/f5/t48vxz555b51mr3g6jjhxv40gq/T/blockmgr-1e908314-1d49-47ba-8c95-fa43ff43cee4/31/temp_shuffle_6b939273-c6ce-44a0-99b1-db668e6c89dc
 (Too many open files in system)
{code}

My {{ulimit -n}} is already set to 10,000, and I can't set it much higher on 
macOS. However, I don't think that's the issue, since if I add a dummy 
{{limit()}} early on the query tree -- dummy as in it does not actually reduce 
the number of rows queried -- then the same query works.

I've diffed the physical query plans to see what this {{limit()}} is actually 
doing, and the diff is as follows:

{code}
diff plan-with-limit.txt plan-without-limit.txt
24,28c24
<:  : : +- *GlobalLimit 100
<:  : :+- Exchange SinglePartition
<:  : :   +- *LocalLimit 100
<:  : :  +- *Project [...]
<:  : : +- *Scan orc [...] 
Format: ORC, InputPaths: file:/..., PartitionFilters: [], PushedFilters: [], 
ReadSchema: struct<...
---
>:  : : +- *Scan orc [...] Format: ORC, 
> InputPaths: file:/..., PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct<...
49,53c45
<   : : +- *GlobalLimit 100
<   

[jira] [Updated] (SPARK-18367) limit() makes the lame walk again

2016-11-08 Thread Nicholas Chammas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-18367:
-
Attachment: plan-without-limit.txt
plan-with-limit.txt

> limit() makes the lame walk again
> -
>
> Key: SPARK-18367
> URL: https://issues.apache.org/jira/browse/SPARK-18367
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1, 2.1.0
> Environment: Python 3.5, Java 8
>Reporter: Nicholas Chammas
> Attachments: plan-with-limit.txt, plan-without-limit.txt
>
>
> I have a complex DataFrame query that fails to run normally but succeeds if I 
> add a dummy {{limit()}} upstream in the query tree.
> The failure presents itself like this:
> {code}
> ERROR DiskBlockObjectWriter: Uncaught exception while reverting partial 
> writes to file 
> /private/var/folders/f5/t48vxz555b51mr3g6jjhxv40gq/T/blockmgr-1e908314-1d49-47ba-8c95-fa43ff43cee4/31/temp_shuffle_6b939273-c6ce-44a0-99b1-db668e6c89dc
> java.io.FileNotFoundException: 
> /private/var/folders/f5/t48vxz555b51mr3g6jjhxv40gq/T/blockmgr-1e908314-1d49-47ba-8c95-fa43ff43cee4/31/temp_shuffle_6b939273-c6ce-44a0-99b1-db668e6c89dc
>  (Too many open files in system)
> {code}
> My {{ulimit -n}} is already set to 10,000, and I can't set it much higher on 
> macOS. However, I don't think that's the issue, since if I add a dummy 
> {{limit()}} early on the query tree -- dummy as in it does not actually 
> reduce the number of rows queried -- then the same query works.
> I've diffed the physical query plans to see what this {{limit()}} is actually 
> doing, and the diff is as follows:
> {code}
> diff plan-with-limit.txt plan-without-limit.txt
> 24,28c24
> <:  : : +- *GlobalLimit 100
> <:  : :+- Exchange SinglePartition
> <:  : :   +- *LocalLimit 100
> <:  : :  +- *Project [...]
> <:  : : +- *Scan orc [...] 
> Format: ORC, InputPaths: file:/..., PartitionFilters: [], PushedFilters: [], 
> ReadSchema: struct<...
> ---
> >:  : : +- *Scan orc [...] Format: ORC, 
> > InputPaths: file:/..., PartitionFilters: [], PushedFilters: [], ReadSchema: 
> > struct<...
> 49,53c45
> <   : : +- *GlobalLimit 100
> <   : :+- Exchange SinglePartition
> <   : :   +- *LocalLimit 100
> <   : :  +- *Project [...]
> <   : : +- *Scan orc [...] 
> Format: ORC, InputPaths: file:/..., PartitionFilters: [], PushedFilters: [], 
> ReadSchema: struct<...
> ---
> >   : : +- *Scan orc [] Format: ORC, 
> > InputPaths: file:/..., PartitionFilters: [], PushedFilters: [], ReadSchema: 
> > struct<...
> {code}
> Does this give any clues as to why this {{limit()}} is helping? Again, the 
> 100 limit you can see in the plan is much higher than the cardinality of 
> the dataset I'm reading, so there is no theoretical impact on the output. You 
> can see the full query plans attached to this ticket.
> Unfortunately, I don't have a minimal reproduction for this issue, but I can 
> work towards one with some clues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18367) limit() makes the lame walk again

2016-11-08 Thread Nicholas Chammas (JIRA)
Nicholas Chammas created SPARK-18367:


 Summary: limit() makes the lame walk again
 Key: SPARK-18367
 URL: https://issues.apache.org/jira/browse/SPARK-18367
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.1, 2.1.0
 Environment: Python 3.5, Java 8
Reporter: Nicholas Chammas


I have a complex DataFrame query that fails to run normally but succeeds if I 
add a dummy {{limit()}} upstream in the query tree.

The failure presents itself like this:

{code}
ERROR DiskBlockObjectWriter: Uncaught exception while reverting partial writes 
to file 
/private/var/folders/f5/t48vxz555b51mr3g6jjhxv40gq/T/blockmgr-1e908314-1d49-47ba-8c95-fa43ff43cee4/31/temp_shuffle_6b939273-c6ce-44a0-99b1-db668e6c89dc
java.io.FileNotFoundException: 
/private/var/folders/f5/t48vxz555b51mr3g6jjhxv40gq/T/blockmgr-1e908314-1d49-47ba-8c95-fa43ff43cee4/31/temp_shuffle_6b939273-c6ce-44a0-99b1-db668e6c89dc
 (Too many open files in system)
{code}

My {{ulimit -n}} is already set to 10,000, and I can't set it much higher on 
macOS. However, I don't think that's the issue, since if I add a dummy 
{{limit()}} early on the query tree -- dummy as in it does not actually reduce 
the number of rows queried -- then the same query works.

I've diffed the physical query plans to see what this {{limit()}} is actually 
doing, and the diff is as follows:

{code}
diff plan-with-limit.txt plan-without-limit.txt
24,28c24
<:  : : +- *GlobalLimit 100
<:  : :+- Exchange SinglePartition
<:  : :   +- *LocalLimit 100
<:  : :  +- *Project [...]
<:  : : +- *Scan orc [...] 
Format: ORC, InputPaths: file:/..., PartitionFilters: [], PushedFilters: [], 
ReadSchema: struct<...
---
>:  : : +- *Scan orc [...] Format: ORC, 
> InputPaths: file:/..., PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct<...
49,53c45
<   : : +- *GlobalLimit 100
<   : :+- Exchange SinglePartition
<   : :   +- *LocalLimit 100
<   : :  +- *Project [...]
<   : : +- *Scan orc [...] 
Format: ORC, InputPaths: file:/..., PartitionFilters: [], PushedFilters: [], 
ReadSchema: struct<...
---
>   : : +- *Scan orc [] Format: ORC, 
> InputPaths: file:/..., PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct<...
{code}

Does this give any clues as to why this {{limit()}} is helping? Again, the 
100 limit you can see in the plan is much higher than the cardinality of 
the dataset I'm reading, so there is no theoretical impact on the output. You 
can see the full query plans attached to this ticket.

Unfortunately, I don't have a minimal reproduction for this issue, but I can 
work towards one with some clues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18366) Add handleInvalid to Pyspark for QuantileDiscretizer and Bucketizer

2016-11-08 Thread Seth Hendrickson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seth Hendrickson updated SPARK-18366:
-
Component/s: PySpark
 ML

> Add handleInvalid to Pyspark for QuantileDiscretizer and Bucketizer
> ---
>
> Key: SPARK-18366
> URL: https://issues.apache.org/jira/browse/SPARK-18366
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Reporter: Seth Hendrickson
>Priority: Minor
>
> We should add the new {{handleInvalid}} param for these transformers to 
> Python to maintain API parity.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18366) Add handleInvalid to Pyspark for QuantileDiscretizer and Bucketizer

2016-11-08 Thread Seth Hendrickson (JIRA)
Seth Hendrickson created SPARK-18366:


 Summary: Add handleInvalid to Pyspark for QuantileDiscretizer and 
Bucketizer
 Key: SPARK-18366
 URL: https://issues.apache.org/jira/browse/SPARK-18366
 Project: Spark
  Issue Type: New Feature
Reporter: Seth Hendrickson
Priority: Minor


We should add the new {{handleInvalid}} param for these transformers to Python 
to maintain API parity.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18339) Don't push down current_timestamp for filters in StructuredStreaming

2016-11-08 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-18339:
-
Labels:   (was: correctness)

> Don't push down current_timestamp for filters in StructuredStreaming
> 
>
> Key: SPARK-18339
> URL: https://issues.apache.org/jira/browse/SPARK-18339
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.1
>Reporter: Burak Yavuz
>
> For the following workflow:
> 1. I have a column called time which is at minute level precision in a 
> Streaming DataFrame
> 2. I want to perform groupBy time, count
> 3. Then I want my MemorySink to only have the last 30 minutes of counts and I 
> perform this by
> {code}
> .where('time >= current_timestamp().cast("long") - 30 * 60)
> {code}
> what happens is that the `filter` gets pushed down before the aggregation, 
> and the filter happens on the source data for the aggregation instead of the 
> result of the aggregation (where I actually want to filter).
> I guess the main issue here is that `current_timestamp` is non-deterministic 
> in the streaming context and shouldn't be pushed down the filter.
> Does this require us to store the `current_timestamp` for each trigger of the 
> streaming job, that is something to discuss.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18336) SQL started to fail with OOM and etc. after move from 1.6.2 to 2.0.2

2016-11-08 Thread Egor Pahomov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15648912#comment-15648912
 ] 

Egor Pahomov commented on SPARK-18336:
--

[~srowen], I've read everything in documentation about new memory management 
and tryied next config:
{code}
.set("spark.executor.memory", "28g")
.set("spark.executor.instances", "50")
.set("spark.dynamicAllocation.enabled", "false")
.set("spark.yarn.executor.memoryOverhead", "3000")
.set("spark.executor.cores", "6")
.set("spark.driver.memory", "25g")
.set("spark.driver.cores", "5")
.set("spark.yarn.am.memory", "20g")
.set("spark.shuffle.io.numConnectionsPerPeer", "5")
.set("spark.sql.autoBroadcastJoinThreshold", "30485760")
.set("spark.network.timeout", "4000s")
.set("spark.driver.maxResultSize", "5g")
.set("spark.sql.parquet.compression.codec", "gzip")
.set("spark.kryoserializer.buffer.max", "1200m")
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.set("spark.yarn.driver.memoryOverhead", "1000")
.set("spark.memory.storageFraction", "0.2")
.set("spark.memory.fraction", "0.8")
.set("spark.sql.parquet.cacheMetadata", "false")
.set("spark.scheduler.mode", "FIFO")
.set("spark.sql.broadcastTimeout", "2")
.set("spark.akka.frameSize", "200")
.set("spark.sql.shuffle.partitions", partitions)
.set("spark.network.timeout", "1000s")
{code}


Is there a manual - "I'm happy with how it was working before, can I configure 
my job be exactly like old times" ? Can I read somewere what happened to 
Tungsten project, beacause it seemed like big deal and now it turned of by 
default?

> SQL started to fail with OOM and etc. after move from 1.6.2 to 2.0.2
> 
>
> Key: SPARK-18336
> URL: https://issues.apache.org/jira/browse/SPARK-18336
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Egor Pahomov
>
> I had several(~100) quires, which were run one after another in single spark 
> context. I can provide code of runner - it's very simple. It worked fine on 
> 1.6.2, than I moved to 2551d959a6c9fb27a54d38599a2301d735532c24 (branch-2.0 
> on 31.10.2016 17:04:12). It started to fail with OOM and other errors. When I 
> separate my 100 quires to 2 sets and run set after set it works fine. I would 
> suspect problems with memory on driver, but nothing points to that. 
> My conf: 
> {code}
> lazy val sparkConfTemplate = new SparkConf()
> .setMaster("yarn-client")
> .setAppName(appName)
> .set("spark.executor.memory", "25g")
> .set("spark.executor.instances", "40")
> .set("spark.dynamicAllocation.enabled", "false")
> .set("spark.yarn.executor.memoryOverhead", "3000")
> .set("spark.executor.cores", "6")
> .set("spark.driver.memory", "25g")
> .set("spark.driver.cores", "5")
> .set("spark.yarn.am.memory", "20g")
> .set("spark.shuffle.io.numConnectionsPerPeer", "5")
> .set("spark.sql.autoBroadcastJoinThreshold", "10")
> .set("spark.network.timeout", "4000s")
> .set("spark.driver.maxResultSize", "5g")
> .set("spark.sql.parquet.compression.codec", "gzip")
> .set("spark.kryoserializer.buffer.max", "1200m")
> .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
> .set("spark.yarn.driver.memoryOverhead", "1000")
> .set("spark.scheduler.mode", "FIFO")
> .set("spark.sql.broadcastTimeout", "2")
> .set("spark.akka.frameSize", "200")
> .set("spark.sql.shuffle.partitions", partitions)
> .set("spark.network.timeout", "1000s")
> 
> .setJars(List(this.getClass.getProtectionDomain().getCodeSource().getLocation().toURI().getPath()))
> {code}
> Errors, which started to happen:
> {code}
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x7f04c6cf3ea8, pid=17479, tid=139658116687616
> #
> # JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build 
> 1.8.0_60-b27)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode 
> linux-amd64 compressed oops)
> # Problematic frame:
> # V  [libjvm.so+0x64bea8]  
> InstanceKlass::oop_follow_contents(ParCompactionManager*, oopDesc*)+0x88
> #
> # Failed to write core dump. Core dumps have been disabled. To enable core 
> dumping, try "ulimit -c unlimited" before starting Java again
> #
> # An error report file with more information is saved as:
> # /home/egor/hs_err_pid17479.log
> #
> # If you would like to submit a bug report, please visit:
> #   http://bugreport.java.com/bugreport/crash.jsp
> #
> {code}
> {code}
> Exception in thread "refresh progress" java.lang.OutOfMemoryError: Java heap 
> space
>   at scala.collection.immutable.Iterable$.newBuilder(Iterable.scala:44)
>   at 

[jira] [Commented] (SPARK-18343) FileSystem$Statistics$StatisticsDataReferenceCleaner hangs on s3 write

2016-11-08 Thread Luke Miner (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15648900#comment-15648900
 ] 

Luke Miner commented on SPARK-18343:


I ran jstack on an executor and on the driver and have attached it. Not sure 
how to read it myself.

> FileSystem$Statistics$StatisticsDataReferenceCleaner hangs on s3 write
> --
>
> Key: SPARK-18343
> URL: https://issues.apache.org/jira/browse/SPARK-18343
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.1
> Environment: Spark 2.0.1
> Hadoop 2.7.1
> Mesos 1.0.1
> Ubuntu 14.04
>Reporter: Luke Miner
>
> I have a driver program where I write read data in from Cassandra using 
> spark, perform some operations, and then write out to JSON on S3. The program 
> runs fine when I use Spark 1.6.1 and the spark-cassandra-connector 1.6.0-M1.
> However, if I try to upgrade to Spark 2.0.1 (hadoop 2.7.1) and 
> spark-cassandra-connector 2.0.0-M3, the program completes in the sense that 
> all the expected files are written to S3, but the program never terminates.
> I do run `sc.stop()` at the end of the program. I am also using Mesos 1.0.1. 
> In both cases I use the default output committer.
> From the thread dump (included below) it seems like it could be waiting on: 
> `org.apache.hadoop.fs.FileSystem$Statistics$StatisticsDataReferenceCleaner`
> Code snippet:
> {code}
> // get MongoDB oplog operations
> val operations = sc.cassandraTable[JsonOperation](keyspace, namespace)
>   .where("ts >= ? AND ts < ?", minTimestamp, maxTimestamp)
> 
> // replay oplog operations into documents
> val documents = operations
>   .spanBy(op => op.id)
>   .map { case (id: String, ops: Iterable[T]) => (id, apply(ops)) }
>   .filter { case (id, result) => result.isInstanceOf[Document] }
>   .map { case (id, document) => MergedDocument(id = id, document = 
> document
> .asInstanceOf[Document])
>   }
> 
> // write documents to json on s3
> documents
>   .map(document => document.toJson)
>   .coalesce(partitions)
>   .saveAsTextFile(path, classOf[GzipCodec])
> sc.stop()
> {code}
> Thread dump on the driver:
> {code}
> 60  context-cleaner-periodic-gc TIMED_WAITING
> 46  dag-scheduler-event-loopWAITING
> 4389DestroyJavaVM   RUNNABLE
> 12  dispatcher-event-loop-0 WAITING
> 13  dispatcher-event-loop-1 WAITING
> 14  dispatcher-event-loop-2 WAITING
> 15  dispatcher-event-loop-3 WAITING
> 47  driver-revive-threadTIMED_WAITING
> 3   Finalizer   WAITING
> 82  ForkJoinPool-1-worker-17WAITING
> 43  heartbeat-receiver-event-loop-threadTIMED_WAITING
> 93  java-sdk-http-connection-reaper TIMED_WAITING
> 4387java-sdk-progress-listener-callback-thread  WAITING
> 25  map-output-dispatcher-0 WAITING
> 26  map-output-dispatcher-1 WAITING
> 27  map-output-dispatcher-2 WAITING
> 28  map-output-dispatcher-3 WAITING
> 29  map-output-dispatcher-4 WAITING
> 30  map-output-dispatcher-5 WAITING
> 31  map-output-dispatcher-6 WAITING
> 32  map-output-dispatcher-7 WAITING
> 48  MesosCoarseGrainedSchedulerBackend-mesos-driver RUNNABLE
> 44  netty-rpc-env-timeout   TIMED_WAITING
> 92  
> org.apache.hadoop.fs.FileSystem$Statistics$StatisticsDataReferenceCleaner   
> WAITING
> 62  pool-19-thread-1TIMED_WAITING
> 2   Reference Handler   WAITING
> 61  Scheduler-1112394071TIMED_WAITING
> 20  shuffle-server-0RUNNABLE
> 55  shuffle-server-0RUNNABLE
> 21  shuffle-server-1RUNNABLE
> 56  shuffle-server-1RUNNABLE
> 22  shuffle-server-2RUNNABLE
> 57  shuffle-server-2RUNNABLE
> 23  shuffle-server-3RUNNABLE
> 58  shuffle-server-3RUNNABLE
> 4   Signal Dispatcher   RUNNABLE
> 59  Spark Context Cleaner   TIMED_WAITING
> 9   SparkListenerBusWAITING
> 35  SparkUI-35-selector-ServerConnectorManager@651d3734/0   RUNNABLE
> 36  
> SparkUI-36-acceptor-0@467924cb-ServerConnector@3b5eaf92{HTTP/1.1}{0.0.0.0:4040}
>  RUNNABLE
> 37  SparkUI-37-selector-ServerConnectorManager@651d3734/1   RUNNABLE
> 38  SparkUI-38  TIMED_WAITING
> 39  SparkUI-39  TIMED_WAITING
> 40  SparkUI-40  TIMED_WAITING
> 41  SparkUI-41  RUNNABLE
> 42  SparkUI-42  TIMED_WAITING
> 438 task-result-getter-0WAITING
> 450 task-result-getter-1WAITING
> 489 task-result-getter-2WAITING
> 492 task-result-getter-3WAITING
> 75  threadDeathWatcher-2-1  TIMED_WAITING
> 45  Timer-0 WAITING
> {code}
> Thread dump on the executors. It's the same on all of them:
> {code}
> 24  dispatcher-event-loop-0 WAITING
> 25  dispatcher-event-loop-1 WAITING
> 26  

[jira] [Commented] (SPARK-17691) Add aggregate function to collect list with maximum number of elements

2016-11-08 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1564#comment-1564
 ] 

Michael Armbrust commented on SPARK-17691:
--

+1

> Add aggregate function to collect list with maximum number of elements
> --
>
> Key: SPARK-17691
> URL: https://issues.apache.org/jira/browse/SPARK-17691
> Project: Spark
>  Issue Type: New Feature
>Reporter: Assaf Mendelson
>Priority: Minor
>
> One of the aggregate functions we have today is the collect_list function. 
> This is a useful tool to do a "catch all" aggregation which doesn't really 
> fit anywhere else.
> The problem with collect_list is that it is unbounded. I would like to see a 
> means to do a collect_list where we limit the maximum number of elements.
> I would see that the input for this would be the maximum number of elements 
> to use and the method of choosing (pick whatever, pick the top N, pick the 
> bottom B)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18343) FileSystem$Statistics$StatisticsDataReferenceCleaner hangs on s3 write

2016-11-08 Thread Luke Miner (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luke Miner updated SPARK-18343:
---
Description: 
I have a driver program where I write read data in from Cassandra using spark, 
perform some operations, and then write out to JSON on S3. The program runs 
fine when I use Spark 1.6.1 and the spark-cassandra-connector 1.6.0-M1.

However, if I try to upgrade to Spark 2.0.1 (hadoop 2.7.1) and 
spark-cassandra-connector 2.0.0-M3, the program completes in the sense that all 
the expected files are written to S3, but the program never terminates.

I do run `sc.stop()` at the end of the program. I am also using Mesos 1.0.1. In 
both cases I use the default output committer.

>From the thread dump (included below) it seems like it could be waiting on: 
>`org.apache.hadoop.fs.FileSystem$Statistics$StatisticsDataReferenceCleaner`

Code snippet:
{code}
// get MongoDB oplog operations
val operations = sc.cassandraTable[JsonOperation](keyspace, namespace)
  .where("ts >= ? AND ts < ?", minTimestamp, maxTimestamp)

// replay oplog operations into documents
val documents = operations
  .spanBy(op => op.id)
  .map { case (id: String, ops: Iterable[T]) => (id, apply(ops)) }
  .filter { case (id, result) => result.isInstanceOf[Document] }
  .map { case (id, document) => MergedDocument(id = id, document = document
.asInstanceOf[Document])
  }

// write documents to json on s3
documents
  .map(document => document.toJson)
  .coalesce(partitions)
  .saveAsTextFile(path, classOf[GzipCodec])
sc.stop()
{code}

Thread dump on the driver:

{code}
60  context-cleaner-periodic-gc TIMED_WAITING
46  dag-scheduler-event-loopWAITING
4389DestroyJavaVM   RUNNABLE
12  dispatcher-event-loop-0 WAITING
13  dispatcher-event-loop-1 WAITING
14  dispatcher-event-loop-2 WAITING
15  dispatcher-event-loop-3 WAITING
47  driver-revive-threadTIMED_WAITING
3   Finalizer   WAITING
82  ForkJoinPool-1-worker-17WAITING
43  heartbeat-receiver-event-loop-threadTIMED_WAITING
93  java-sdk-http-connection-reaper TIMED_WAITING
4387java-sdk-progress-listener-callback-thread  WAITING
25  map-output-dispatcher-0 WAITING
26  map-output-dispatcher-1 WAITING
27  map-output-dispatcher-2 WAITING
28  map-output-dispatcher-3 WAITING
29  map-output-dispatcher-4 WAITING
30  map-output-dispatcher-5 WAITING
31  map-output-dispatcher-6 WAITING
32  map-output-dispatcher-7 WAITING
48  MesosCoarseGrainedSchedulerBackend-mesos-driver RUNNABLE
44  netty-rpc-env-timeout   TIMED_WAITING
92  
org.apache.hadoop.fs.FileSystem$Statistics$StatisticsDataReferenceCleaner   
WAITING
62  pool-19-thread-1TIMED_WAITING
2   Reference Handler   WAITING
61  Scheduler-1112394071TIMED_WAITING
20  shuffle-server-0RUNNABLE
55  shuffle-server-0RUNNABLE
21  shuffle-server-1RUNNABLE
56  shuffle-server-1RUNNABLE
22  shuffle-server-2RUNNABLE
57  shuffle-server-2RUNNABLE
23  shuffle-server-3RUNNABLE
58  shuffle-server-3RUNNABLE
4   Signal Dispatcher   RUNNABLE
59  Spark Context Cleaner   TIMED_WAITING
9   SparkListenerBusWAITING
35  SparkUI-35-selector-ServerConnectorManager@651d3734/0   RUNNABLE
36  
SparkUI-36-acceptor-0@467924cb-ServerConnector@3b5eaf92{HTTP/1.1}{0.0.0.0:4040} 
RUNNABLE
37  SparkUI-37-selector-ServerConnectorManager@651d3734/1   RUNNABLE
38  SparkUI-38  TIMED_WAITING
39  SparkUI-39  TIMED_WAITING
40  SparkUI-40  TIMED_WAITING
41  SparkUI-41  RUNNABLE
42  SparkUI-42  TIMED_WAITING
438 task-result-getter-0WAITING
450 task-result-getter-1WAITING
489 task-result-getter-2WAITING
492 task-result-getter-3WAITING
75  threadDeathWatcher-2-1  TIMED_WAITING
45  Timer-0 WAITING
{code}

Thread dump on the executors. It's the same on all of them:

{code}
24  dispatcher-event-loop-0 WAITING
25  dispatcher-event-loop-1 WAITING
26  dispatcher-event-loop-2 RUNNABLE
27  dispatcher-event-loop-3 WAITING
39  driver-heartbeater  TIMED_WAITING
3   Finalizer   WAITING
58  java-sdk-http-connection-reaper TIMED_WAITING
75  java-sdk-progress-listener-callback-thread  WAITING
1   mainTIMED_WAITING
33  netty-rpc-env-timeout   TIMED_WAITING
55  
org.apache.hadoop.fs.FileSystem$Statistics$StatisticsDataReferenceCleaner   
WAITING
59  pool-17-thread-1TIMED_WAITING
2   Reference Handler   WAITING
28  shuffle-client-0RUNNABLE
35  shuffle-client-0RUNNABLE
41  shuffle-client-0RUNNABLE
37  shuffle-server-0RUNNABLE
5   Signal Dispatcher   RUNNABLE
23  threadDeathWatcher-2-1  TIMED_WAITING
{code}

Jstack of an executor:

{code}
ubuntu@ip-10-0-230-88:~$ 

[jira] [Updated] (SPARK-18365) Improve Documentation for Sample Method

2016-11-08 Thread Bill Chambers (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Chambers updated SPARK-18365:
--
Summary: Improve Documentation for Sample Method  (was: Documentation for 
Sampling is Incorrect)

> Improve Documentation for Sample Method
> ---
>
> Key: SPARK-18365
> URL: https://issues.apache.org/jira/browse/SPARK-18365
> Project: Spark
>  Issue Type: Bug
>Reporter: Bill Chambers
>
> The parameter documentation is switched.
> PR coming shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18365) Documentation for Sampling is Incorrect

2016-11-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18365:


Assignee: Apache Spark

> Documentation for Sampling is Incorrect
> ---
>
> Key: SPARK-18365
> URL: https://issues.apache.org/jira/browse/SPARK-18365
> Project: Spark
>  Issue Type: Bug
>Reporter: Bill Chambers
>Assignee: Apache Spark
>
> The parameter documentation is switched.
> PR coming shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18365) Documentation for Sampling is Incorrect

2016-11-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18365:


Assignee: (was: Apache Spark)

> Documentation for Sampling is Incorrect
> ---
>
> Key: SPARK-18365
> URL: https://issues.apache.org/jira/browse/SPARK-18365
> Project: Spark
>  Issue Type: Bug
>Reporter: Bill Chambers
>
> The parameter documentation is switched.
> PR coming shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18365) Documentation for Sampling is Incorrect

2016-11-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15648845#comment-15648845
 ] 

Apache Spark commented on SPARK-18365:
--

User 'anabranch' has created a pull request for this issue:
https://github.com/apache/spark/pull/15815

> Documentation for Sampling is Incorrect
> ---
>
> Key: SPARK-18365
> URL: https://issues.apache.org/jira/browse/SPARK-18365
> Project: Spark
>  Issue Type: Bug
>Reporter: Bill Chambers
>
> The parameter documentation is switched.
> PR coming shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18365) Documentation for Sampling is Incorrect

2016-11-08 Thread Bill Chambers (JIRA)
Bill Chambers created SPARK-18365:
-

 Summary: Documentation for Sampling is Incorrect
 Key: SPARK-18365
 URL: https://issues.apache.org/jira/browse/SPARK-18365
 Project: Spark
  Issue Type: Bug
Reporter: Bill Chambers


The parameter documentation is switched.

PR coming shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18280) Potential deadlock in `StandaloneSchedulerBackend.dead`

2016-11-08 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-18280.
--
   Resolution: Fixed
 Assignee: Shixiong Zhu
Fix Version/s: 2.1.0
   2.0.3

> Potential deadlock in `StandaloneSchedulerBackend.dead`
> ---
>
> Key: SPARK-18280
> URL: https://issues.apache.org/jira/browse/SPARK-18280
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.2, 2.0.0, 2.0.1
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.0.3, 2.1.0
>
>
> "StandaloneSchedulerBackend.dead" is called in a RPC thread, so it should not 
> call "SparkContext.stop" in the same thread. "SparkContext.stop" will block 
> until all RPC threads exit, if it's called inside a RPC thread, it will be 
> dead-lock.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18364) expose metrics for YarnShuffleService

2016-11-08 Thread Steven Rand (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15648785#comment-15648785
 ] 

Steven Rand commented on SPARK-18364:
-

I can implement this if people think it makes sense.

> expose metrics for YarnShuffleService
> -
>
> Key: SPARK-18364
> URL: https://issues.apache.org/jira/browse/SPARK-18364
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.0.1
>Reporter: Steven Rand
>Priority: Minor
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> ExternalShuffleService exposes metrics as of SPARK-16405. However, 
> YarnShuffleService does not.
> The work of instrumenting ExternalShuffleBlockHandler was already done in 
> SPARK-1645, so this JIRA is for creating a MetricsSystem in 
> YarnShuffleService similarly to how ExternalShuffleService already does it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18364) expose metrics for YarnShuffleService

2016-11-08 Thread Steven Rand (JIRA)
Steven Rand created SPARK-18364:
---

 Summary: expose metrics for YarnShuffleService
 Key: SPARK-18364
 URL: https://issues.apache.org/jira/browse/SPARK-18364
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 2.0.1
Reporter: Steven Rand
Priority: Minor


ExternalShuffleService exposes metrics as of SPARK-16405. However, 
YarnShuffleService does not.

The work of instrumenting ExternalShuffleBlockHandler was already done in 
SPARK-1645, so this JIRA is for creating a MetricsSystem in YarnShuffleService 
similarly to how ExternalShuffleService already does it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16215) Reduce runtime overhead of a program that writes an primitive array in Dataframe/Dataset

2016-11-08 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-16215.
---
Resolution: Duplicate

> Reduce runtime overhead of a program that writes an primitive array in 
> Dataframe/Dataset
> 
>
> Key: SPARK-16215
> URL: https://issues.apache.org/jira/browse/SPARK-16215
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Kazuaki Ishizaki
>
> When we run a program that writes an primitive array in Dataframe/Dataset, 
> generated code for projection performs null check and element-based data 
> copy. We can alleviate overhead of these operations by generating better code.
> A target program
> {code}
> val df = sparkContext.parallelize(Seq(0.0d, 1.0d), 1).toDF
> df.selectExpr("Array(value + 1.1d, value + 2.2d)").collect
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >