[jira] [Resolved] (SPARK-18368) Regular expression replace throws NullPointerException when serialized
[ https://issues.apache.org/jira/browse/SPARK-18368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-18368. - Resolution: Fixed Assignee: Ryan Blue Fix Version/s: 2.1.0 2.0.3 > Regular expression replace throws NullPointerException when serialized > -- > > Key: SPARK-18368 > URL: https://issues.apache.org/jira/browse/SPARK-18368 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1, 2.1.0 >Reporter: Ryan Blue >Assignee: Ryan Blue > Fix For: 2.0.3, 2.1.0 > > > This query fails with a [NullPointerException on line > 247|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala#L247]: > {code} > SELECT POSEXPLODE(SPLIT(REGEXP_REPLACE(ranks, '[\\[ \\]]', ''), ',')) AS > (rank, col0) FROM table; > {code} > The problem is that POSEXPLODE is causing the REGEXP_REPLACE to be serialized > after it is instantiated. The null value is a transient StringBuffer that > should hold the result. The fix is to make the result value lazy. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18352) Parse normal, multi-line JSON files (not just JSON Lines)
[ https://issues.apache.org/jira/browse/SPARK-18352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15650025#comment-15650025 ] Reynold Xin commented on SPARK-18352: - Again, this has nothing to do with streaming. It should just be an option (e.g. multilineJson, or wholeFile) for JSON. > Parse normal, multi-line JSON files (not just JSON Lines) > - > > Key: SPARK-18352 > URL: https://issues.apache.org/jira/browse/SPARK-18352 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > Spark currently can only parse JSON files that are JSON lines, i.e. each > record has an entire line and records are separated by new line. In reality, > a lot of users want to use Spark to parse actual JSON files, and are > surprised to learn that it doesn't do that. > We can introduce a new mode (wholeJsonFile?) in which we don't split the > files, and rather stream through them to parse the JSON files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18333) Revert hacks in parquet and orc reader to support case insensitive resolution
[ https://issues.apache.org/jira/browse/SPARK-18333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-18333. - Resolution: Fixed Assignee: Eric Liang Fix Version/s: 2.1.0 > Revert hacks in parquet and orc reader to support case insensitive resolution > - > > Key: SPARK-18333 > URL: https://issues.apache.org/jira/browse/SPARK-18333 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Eric Liang >Assignee: Eric Liang > Fix For: 2.1.0 > > > These are no longer needed after > https://issues.apache.org/jira/browse/SPARK-17183 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18352) Parse normal, multi-line JSON files (not just JSON Lines)
[ https://issues.apache.org/jira/browse/SPARK-18352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649982#comment-15649982 ] Thomas Sebastian edited comment on SPARK-18352 at 11/9/16 6:49 AM: --- Hi [~rxin], So, do you mean that stream API need not be used,and there should be a new API which can read multiple json files? -Thomas was (Author: thomastechs): Hi Reynold, So, do you mean that stream API need not be used,and there should be a new API which can read multiple json files? -Thomas > Parse normal, multi-line JSON files (not just JSON Lines) > - > > Key: SPARK-18352 > URL: https://issues.apache.org/jira/browse/SPARK-18352 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > Spark currently can only parse JSON files that are JSON lines, i.e. each > record has an entire line and records are separated by new line. In reality, > a lot of users want to use Spark to parse actual JSON files, and are > surprised to learn that it doesn't do that. > We can introduce a new mode (wholeJsonFile?) in which we don't split the > files, and rather stream through them to parse the JSON files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18352) Parse normal, multi-line JSON files (not just JSON Lines)
[ https://issues.apache.org/jira/browse/SPARK-18352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649982#comment-15649982 ] Thomas Sebastian commented on SPARK-18352: -- Hi Reynold, So, do you mean that stream API need not be used,and there should be a new API which can read multiple json files? -Thomas > Parse normal, multi-line JSON files (not just JSON Lines) > - > > Key: SPARK-18352 > URL: https://issues.apache.org/jira/browse/SPARK-18352 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > Spark currently can only parse JSON files that are JSON lines, i.e. each > record has an entire line and records are separated by new line. In reality, > a lot of users want to use Spark to parse actual JSON files, and are > surprised to learn that it doesn't do that. > We can introduce a new mode (wholeJsonFile?) in which we don't split the > files, and rather stream through them to parse the JSON files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18350) Support session local timezone
[ https://issues.apache.org/jira/browse/SPARK-18350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649969#comment-15649969 ] Xiao Li commented on SPARK-18350: - Agree. Session-specific SQL conf can be used here. > Support session local timezone > -- > > Key: SPARK-18350 > URL: https://issues.apache.org/jira/browse/SPARK-18350 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > As of Spark 2.1, Spark SQL assumes the machine timezone for datetime > manipulation, which is bad if users are not in the same timezones as the > machines, or if different users have different timezones. > We should introduce a session local timezone setting that is used for > execution. > An explicit non-goal is locale handling. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18374) Incorrect words in StopWords/english.txt
[ https://issues.apache.org/jira/browse/SPARK-18374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649953#comment-15649953 ] yuhao yang commented on SPARK-18374: Just to provide some history info for the issue: https://github.com/apache/spark/pull/11871#issuecomment-199832509, according to which english stopwords were argumented. > Incorrect words in StopWords/english.txt > > > Key: SPARK-18374 > URL: https://issues.apache.org/jira/browse/SPARK-18374 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.0.1 >Reporter: nirav patel > > I was just double checking english.txt for list of stopwords as I felt it was > taking out valid tokens like 'won'. I think issue is english.txt list is > missing apostrophe character and all character after apostrophe. So "won't" > becam "won" in that list; "wouldn't" is "wouldn" . > Here are some incorrect tokens in this list: > won > wouldn > ma > mightn > mustn > needn > shan > shouldn > wasn > weren > I think ideal list should have both style. i.e. won't and wont both should be > part of english.txt as some tokenizer might remove special characters. But > 'won' is obviously shouldn't be in this list. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18372) .Hive-staging folders created from Spark hiveContext are not getting cleaned up
[ https://issues.apache.org/jira/browse/SPARK-18372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] mingjie tang updated SPARK-18372: - Attachment: _thumb_37664.png the staging directory fail to be removed when hive table in the hdfs. > .Hive-staging folders created from Spark hiveContext are not getting cleaned > up > --- > > Key: SPARK-18372 > URL: https://issues.apache.org/jira/browse/SPARK-18372 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2, 1.6.2, 1.6.3 > Environment: spark standalone and spark yarn >Reporter: mingjie tang > Fix For: 2.0.1 > > Attachments: _thumb_37664.png > > > Steps to reproduce: > > 1. Launch spark-shell > 2. Run the following scala code via Spark-Shell > scala> val hivesampletabledf = sqlContext.table("hivesampletable") > scala> import org.apache.spark.sql.DataFrameWriter > scala> val dfw : DataFrameWriter = hivesampletabledf.write > scala> sqlContext.sql("CREATE TABLE IF NOT EXISTS hivesampletablecopypy ( > clientid string, querytime string, market string, deviceplatform string, > devicemake string, devicemodel string, state string, country string, > querydwelltime double, sessionid bigint, sessionpagevieworder bigint )") > scala> dfw.insertInto("hivesampletablecopypy") > scala> val hivesampletablecopypydfdf = sqlContext.sql("""SELECT clientid, > querytime, deviceplatform, querydwelltime FROM hivesampletablecopypy WHERE > state = 'Washington' AND devicemake = 'Microsoft' AND querydwelltime > 15 """) > hivesampletablecopypydfdf.show > 3. in HDFS (in our case, WASB), we can see the following folders > hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666 > > hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666-1/-ext-1 > > hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693 > the issue is that these don't get cleaned up and get accumulated > = > with the customer, we have tried setting "SET > hive.exec.stagingdir=/tmp/hive;" in hive-site.xml - didn't make any > difference. > .hive-staging folders are created under the folder - > hive/warehouse/hivesampletablecopypy/ > we have tried adding this property to hive-site.xml and restart the > components - > > hive.exec.stagingdir > $ {hive.exec.scratchdir} > /$ > {user.name} > /.staging > > a new .hive-staging folder was created in hive/warehouse/ folder > moreover, please understand that if we run the hive query in pure Hive via > Hive CLI on the same Spark cluster, we don't see the behavior > so it doesn't appear to be a Hive issue/behavior in this case- this is a > spark behavior > I checked in Ambari, spark.yarn.preserve.staging.files=false in Spark > configuration already > The issue happens via Spark-submit as well - customer used the following > command to reproduce this - > spark-submit test-hive-staging-cleanup.py -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18372) .Hive-staging folders created from Spark hiveContext are not getting cleaned up
[ https://issues.apache.org/jira/browse/SPARK-18372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649935#comment-15649935 ] mingjie tang commented on SPARK-18372: -- This bug can be reproduced by the following codes: val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) sqlContext.sql("CREATE TABLE IF NOT EXISTS T1 (key INT, value STRING)") sqlContext.sql("LOAD DATA LOCAL INPATH '../examples/src/main/resources/kv1.txt' INTO TABLE T1") sqlContext.sql("CREATE TABLE IF NOT EXISTS T2 (key INT, value STRING)") val sparktestdf = sqlContext.table("T1") val dfw = sparktestdf.write dfw.insertInto("T2") val sparktestcopypydfdf = sqlContext.sql("""SELECT * from T2 """) sparktestcopypydfdf.show After user quit the spark-shell, the related .staging directory generated by hive writer will not be removed. For example: the hive table in the directory: /user/hive/warehouse/t2 drwxr-xrwx 3 root wheel 102 Oct 28 15:02 .hive-staging_hive_2016-10-28_15-02-43_288_7070526396398178792-1 > .Hive-staging folders created from Spark hiveContext are not getting cleaned > up > --- > > Key: SPARK-18372 > URL: https://issues.apache.org/jira/browse/SPARK-18372 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2, 1.6.2, 1.6.3 > Environment: spark standalone and spark yarn >Reporter: mingjie tang > Fix For: 2.0.1 > > > Steps to reproduce: > > 1. Launch spark-shell > 2. Run the following scala code via Spark-Shell > scala> val hivesampletabledf = sqlContext.table("hivesampletable") > scala> import org.apache.spark.sql.DataFrameWriter > scala> val dfw : DataFrameWriter = hivesampletabledf.write > scala> sqlContext.sql("CREATE TABLE IF NOT EXISTS hivesampletablecopypy ( > clientid string, querytime string, market string, deviceplatform string, > devicemake string, devicemodel string, state string, country string, > querydwelltime double, sessionid bigint, sessionpagevieworder bigint )") > scala> dfw.insertInto("hivesampletablecopypy") > scala> val hivesampletablecopypydfdf = sqlContext.sql("""SELECT clientid, > querytime, deviceplatform, querydwelltime FROM hivesampletablecopypy WHERE > state = 'Washington' AND devicemake = 'Microsoft' AND querydwelltime > 15 """) > hivesampletablecopypydfdf.show > 3. in HDFS (in our case, WASB), we can see the following folders > hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666 > > hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666-1/-ext-1 > > hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693 > the issue is that these don't get cleaned up and get accumulated > = > with the customer, we have tried setting "SET > hive.exec.stagingdir=/tmp/hive;" in hive-site.xml - didn't make any > difference. > .hive-staging folders are created under the folder - > hive/warehouse/hivesampletablecopypy/ > we have tried adding this property to hive-site.xml and restart the > components - > > hive.exec.stagingdir > $ {hive.exec.scratchdir} > /$ > {user.name} > /.staging > > a new .hive-staging folder was created in hive/warehouse/ folder > moreover, please understand that if we run the hive query in pure Hive via > Hive CLI on the same Spark cluster, we don't see the behavior > so it doesn't appear to be a Hive issue/behavior in this case- this is a > spark behavior > I checked in Ambari, spark.yarn.preserve.staging.files=false in Spark > configuration already > The issue happens via Spark-submit as well - customer used the following > command to reproduce this - > spark-submit test-hive-staging-cleanup.py -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18282) Add model summaries for Python GMM and BisectingKMeans
[ https://issues.apache.org/jira/browse/SPARK-18282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649900#comment-15649900 ] zhengruifeng commented on SPARK-18282: -- This is a duplicate of SPARK-18240. But I prefer you to take it over. > Add model summaries for Python GMM and BisectingKMeans > -- > > Key: SPARK-18282 > URL: https://issues.apache.org/jira/browse/SPARK-18282 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Reporter: Seth Hendrickson >Priority: Minor > > GaussianMixtureModel and BisectingKMeansModel in python do not have model > summaries, but they are implemented in Scala. We should add them for API > parity before 2.1 release. After the QA Jiras are created, this can be linked > as a subtask. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-18240) Add Summary of BiKMeans and GMM in pyspark
[ https://issues.apache.org/jira/browse/SPARK-18240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng closed SPARK-18240. Resolution: Duplicate > Add Summary of BiKMeans and GMM in pyspark > -- > > Key: SPARK-18240 > URL: https://issues.apache.org/jira/browse/SPARK-18240 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: zhengruifeng > > Add Summary of BiKMeans and GMM in pyspark. > Since KMeansSummary in pyspark will be added in SPARK-15819, this JIRA will > not deal with KMeans. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18371) Spark Streaming backpressure bug - generates a batch with large number of records
[ https://issues.apache.org/jira/browse/SPARK-18371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649891#comment-15649891 ] mapreduced commented on SPARK-18371: I'll try to test it out hopefully soon. > Spark Streaming backpressure bug - generates a batch with large number of > records > - > > Key: SPARK-18371 > URL: https://issues.apache.org/jira/browse/SPARK-18371 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.0.0 >Reporter: mapreduced > Attachments: GiantBatch2.png, GiantBatch3.png, > Giant_batch_at_23_00.png, Look_at_batch_at_22_14.png > > > When the streaming job is configured with backpressureEnabled=true, it > generates a GIANT batch of records if the processing time + scheduled delay > is (much) larger than batchDuration. This creates a backlog of records like > no other and results in batches queueing for hours until it chews through > this giant batch. > Expectation is that it should reduce the number of records per batch in some > time to whatever it can really process. > Attaching some screen shots where it seems that this issue is quite easily > reproducible. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18350) Support session local timezone
[ https://issues.apache.org/jira/browse/SPARK-18350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649865#comment-15649865 ] Reynold Xin commented on SPARK-18350: - If it is session specific, I don't think we need an API. Just use the existing SQL conf. Also no need to introduce a new data type. That's a much bigger scope. > Support session local timezone > -- > > Key: SPARK-18350 > URL: https://issues.apache.org/jira/browse/SPARK-18350 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > As of Spark 2.1, Spark SQL assumes the machine timezone for datetime > manipulation, which is bad if users are not in the same timezones as the > machines, or if different users have different timezones. > We should introduce a session local timezone setting that is used for > execution. > An explicit non-goal is locale handling. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18350) Support session local timezone
[ https://issues.apache.org/jira/browse/SPARK-18350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649858#comment-15649858 ] Xiao Li edited comment on SPARK-18350 at 11/9/16 5:26 AM: -- Below might be needed if we want to support session timezone? - Add a SQL statement and API to set the current session timezone? Link: https://docs.oracle.com/cd/B19306_01/server.102/b14225/ch4datetime.htm#i1006728 - Add a SQL statement and API to get the current session timezone? Link: https://www.ibm.com/support/knowledgecenter/SSEPEK_10.0.0/sqlref/src/tpc/db2z_currenttz.html - Add time zone specific expressions? Link: http://www.ibm.com/support/knowledgecenter/SSEPEK_10.0.0/sqlref/src/tpc/db2z_tzspecificexpression.html More works are needed if we want to add a new data type {{TIMESTAMP WITH TIME ZONE}} Link: https://docs.oracle.com/cd/B19306_01/server.102/b14225/ch4datetime.htm#i1005946 was (Author: smilegator): Below might be needed if we want to support session timezone? - Add a SQL statement and API to set the current session timezone? Link: https://docs.oracle.com/cd/B19306_01/server.102/b14225/ch4datetime.htm#i1006728 - Add a SQL statement and API to get the current session timezone? Link: https://www.ibm.com/support/knowledgecenter/SSEPEK_10.0.0/sqlref/src/tpc/db2z_currenttz.html - Add time zone specific expressions? Link: http://www.ibm.com/support/knowledgecenter/SSEPEK_10.0.0/sqlref/src/tpc/db2z_tzspecificexpression.html More works are needed if we want to add a new data type {TIMESTAMP WITH TIME ZONE} Link: https://docs.oracle.com/cd/B19306_01/server.102/b14225/ch4datetime.htm#i1005946 > Support session local timezone > -- > > Key: SPARK-18350 > URL: https://issues.apache.org/jira/browse/SPARK-18350 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > As of Spark 2.1, Spark SQL assumes the machine timezone for datetime > manipulation, which is bad if users are not in the same timezones as the > machines, or if different users have different timezones. > We should introduce a session local timezone setting that is used for > execution. > An explicit non-goal is locale handling. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18350) Support session local timezone
[ https://issues.apache.org/jira/browse/SPARK-18350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649858#comment-15649858 ] Xiao Li commented on SPARK-18350: - Below might be needed if we want to support session timezone? - Add a SQL statement and API to set the current session timezone? Link: https://docs.oracle.com/cd/B19306_01/server.102/b14225/ch4datetime.htm#i1006728 - Add a SQL statement and API to get the current session timezone? Link: https://www.ibm.com/support/knowledgecenter/SSEPEK_10.0.0/sqlref/src/tpc/db2z_currenttz.html - Add time zone specific expressions? Link: http://www.ibm.com/support/knowledgecenter/SSEPEK_10.0.0/sqlref/src/tpc/db2z_tzspecificexpression.html More works are needed if we want to add a new data type {TIMESTAMP WITH TIME ZONE} Link: https://docs.oracle.com/cd/B19306_01/server.102/b14225/ch4datetime.htm#i1005946 > Support session local timezone > -- > > Key: SPARK-18350 > URL: https://issues.apache.org/jira/browse/SPARK-18350 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > As of Spark 2.1, Spark SQL assumes the machine timezone for datetime > manipulation, which is bad if users are not in the same timezones as the > machines, or if different users have different timezones. > We should introduce a session local timezone setting that is used for > execution. > An explicit non-goal is locale handling. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18352) Parse normal, multi-line JSON files (not just JSON Lines)
[ https://issues.apache.org/jira/browse/SPARK-18352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649835#comment-15649835 ] Reynold Xin commented on SPARK-18352: - There is already a readStream.json. "Stream" here means not having to read the entire file in memory at once, but rather just "stream through" it, i.e. parse as we scan. > Parse normal, multi-line JSON files (not just JSON Lines) > - > > Key: SPARK-18352 > URL: https://issues.apache.org/jira/browse/SPARK-18352 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > Spark currently can only parse JSON files that are JSON lines, i.e. each > record has an entire line and records are separated by new line. In reality, > a lot of users want to use Spark to parse actual JSON files, and are > surprised to learn that it doesn't do that. > We can introduce a new mode (wholeJsonFile?) in which we don't split the > files, and rather stream through them to parse the JSON files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18377) warehouse path should be a static conf
[ https://issues.apache.org/jira/browse/SPARK-18377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649809#comment-15649809 ] Apache Spark commented on SPARK-18377: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/15825 > warehouse path should be a static conf > -- > > Key: SPARK-18377 > URL: https://issues.apache.org/jira/browse/SPARK-18377 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18377) warehouse path should be a static conf
[ https://issues.apache.org/jira/browse/SPARK-18377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18377: Assignee: Wenchen Fan (was: Apache Spark) > warehouse path should be a static conf > -- > > Key: SPARK-18377 > URL: https://issues.apache.org/jira/browse/SPARK-18377 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18377) warehouse path should be a static conf
[ https://issues.apache.org/jira/browse/SPARK-18377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18377: Assignee: Apache Spark (was: Wenchen Fan) > warehouse path should be a static conf > -- > > Key: SPARK-18377 > URL: https://issues.apache.org/jira/browse/SPARK-18377 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18352) Parse normal, multi-line JSON files (not just JSON Lines)
[ https://issues.apache.org/jira/browse/SPARK-18352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649803#comment-15649803 ] Jayadevan M commented on SPARK-18352: - [~rxin] Are you looking a new api like spark.readStream.json(path) similor to spark.read.json(path) ? > Parse normal, multi-line JSON files (not just JSON Lines) > - > > Key: SPARK-18352 > URL: https://issues.apache.org/jira/browse/SPARK-18352 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > Spark currently can only parse JSON files that are JSON lines, i.e. each > record has an entire line and records are separated by new line. In reality, > a lot of users want to use Spark to parse actual JSON files, and are > surprised to learn that it doesn't do that. > We can introduce a new mode (wholeJsonFile?) in which we don't split the > files, and rather stream through them to parse the JSON files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18377) warehouse path should be a static conf
Wenchen Fan created SPARK-18377: --- Summary: warehouse path should be a static conf Key: SPARK-18377 URL: https://issues.apache.org/jira/browse/SPARK-18377 Project: Spark Issue Type: Improvement Components: SQL Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17691) Add aggregate function to collect list with maximum number of elements
[ https://issues.apache.org/jira/browse/SPARK-17691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649792#comment-15649792 ] Assaf Mendelson commented on SPARK-17691: - I don't believe UDF (or rather UDAF in this case) is the way to go. UDAF must use standard spark types. This means it must use immutable arrays. Since ordering is an issue here it will mean an array would have to be created and copied each time the data changes instead of just the relevant array elements being changed leading to horrible performance (I actually tried it. It cost more time than the full collect_list). Implementing this as a new function would allow internally to use a mutable array and improve performance. > Add aggregate function to collect list with maximum number of elements > -- > > Key: SPARK-17691 > URL: https://issues.apache.org/jira/browse/SPARK-17691 > Project: Spark > Issue Type: New Feature >Reporter: Assaf Mendelson >Priority: Minor > > One of the aggregate functions we have today is the collect_list function. > This is a useful tool to do a "catch all" aggregation which doesn't really > fit anywhere else. > The problem with collect_list is that it is unbounded. I would like to see a > means to do a collect_list where we limit the maximum number of elements. > I would see that the input for this would be the maximum number of elements > to use and the method of choosing (pick whatever, pick the top N, pick the > bottom B) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18376) Skip subexpression elimination for conditional expressions
[ https://issues.apache.org/jira/browse/SPARK-18376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18376: Assignee: Apache Spark > Skip subexpression elimination for conditional expressions > -- > > Key: SPARK-18376 > URL: https://issues.apache.org/jira/browse/SPARK-18376 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Liang-Chi Hsieh >Assignee: Apache Spark > > We should disallow subexpression elimination for expressions wrapped in > conditional expressions such as {{If}}. > Because for an example like this: > {code} > if (isNull(subexpr)) { > ... > } else { > AssertNotNull(subexpr) // subexpr2 > > SomeExpr(AssertNotNull(subexpr)) // SomeExpr(subexpr2) > } > {code} > AssertNotNull(subexpr) will be recognized as a subexpression and evaluate > even the else branch is never run. Under such cases, it possibly causes not > excepted exception and performance regression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18376) Skip subexpression elimination for conditional expressions
[ https://issues.apache.org/jira/browse/SPARK-18376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18376: Assignee: (was: Apache Spark) > Skip subexpression elimination for conditional expressions > -- > > Key: SPARK-18376 > URL: https://issues.apache.org/jira/browse/SPARK-18376 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Liang-Chi Hsieh > > We should disallow subexpression elimination for expressions wrapped in > conditional expressions such as {{If}}. > Because for an example like this: > {code} > if (isNull(subexpr)) { > ... > } else { > AssertNotNull(subexpr) // subexpr2 > > SomeExpr(AssertNotNull(subexpr)) // SomeExpr(subexpr2) > } > {code} > AssertNotNull(subexpr) will be recognized as a subexpression and evaluate > even the else branch is never run. Under such cases, it possibly causes not > excepted exception and performance regression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18376) Skip subexpression elimination for conditional expressions
[ https://issues.apache.org/jira/browse/SPARK-18376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649779#comment-15649779 ] Apache Spark commented on SPARK-18376: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/15824 > Skip subexpression elimination for conditional expressions > -- > > Key: SPARK-18376 > URL: https://issues.apache.org/jira/browse/SPARK-18376 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Liang-Chi Hsieh > > We should disallow subexpression elimination for expressions wrapped in > conditional expressions such as {{If}}. > Because for an example like this: > {code} > if (isNull(subexpr)) { > ... > } else { > AssertNotNull(subexpr) // subexpr2 > > SomeExpr(AssertNotNull(subexpr)) // SomeExpr(subexpr2) > } > {code} > AssertNotNull(subexpr) will be recognized as a subexpression and evaluate > even the else branch is never run. Under such cases, it possibly causes not > excepted exception and performance regression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18376) Skip subexpression elimination for conditional expressions
Liang-Chi Hsieh created SPARK-18376: --- Summary: Skip subexpression elimination for conditional expressions Key: SPARK-18376 URL: https://issues.apache.org/jira/browse/SPARK-18376 Project: Spark Issue Type: Bug Components: SQL Reporter: Liang-Chi Hsieh We should disallow subexpression elimination for expressions wrapped in conditional expressions such as {{If}}. Because for an example like this: {code} if (isNull(subexpr)) { ... } else { AssertNotNull(subexpr) // subexpr2 SomeExpr(AssertNotNull(subexpr)) // SomeExpr(subexpr2) } {code} AssertNotNull(subexpr) will be recognized as a subexpression and evaluate even the else branch is never run. Under such cases, it possibly causes not excepted exception and performance regression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18191) Port RDD API to use commit protocol
[ https://issues.apache.org/jira/browse/SPARK-18191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649674#comment-15649674 ] Apache Spark commented on SPARK-18191: -- User 'jiangxb1987' has created a pull request for this issue: https://github.com/apache/spark/pull/15823 > Port RDD API to use commit protocol > --- > > Key: SPARK-18191 > URL: https://issues.apache.org/jira/browse/SPARK-18191 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Reporter: Reynold Xin >Assignee: Jiang Xingbo > Fix For: 2.1.0 > > > Commit protocol is actually not specific to SQL. We can move it over to core > so the RDD API can use it too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18371) Spark Streaming backpressure bug - generates a batch with large number of records
[ https://issues.apache.org/jira/browse/SPARK-18371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649638#comment-15649638 ] Cody Koeninger commented on SPARK-18371: Thanks for digging into this. The other thing I noticed when working on https://github.com/apache/spark/pull/15132 is that the return value of getLatestRate was cast to Int, which seems wrong and possibly subject to overflow. If you have the ability to test that PR (shouldn't require a spark redeploy, since the kafka jar is standalone), may want to test it out. > Spark Streaming backpressure bug - generates a batch with large number of > records > - > > Key: SPARK-18371 > URL: https://issues.apache.org/jira/browse/SPARK-18371 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.0.0 >Reporter: mapreduced > Attachments: GiantBatch2.png, GiantBatch3.png, > Giant_batch_at_23_00.png, Look_at_batch_at_22_14.png > > > When the streaming job is configured with backpressureEnabled=true, it > generates a GIANT batch of records if the processing time + scheduled delay > is (much) larger than batchDuration. This creates a backlog of records like > no other and results in batches queueing for hours until it chews through > this giant batch. > Expectation is that it should reduce the number of records per batch in some > time to whatever it can really process. > Attaching some screen shots where it seems that this issue is quite easily > reproducible. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18375) Upgrade netty to 4.0.42.Final
[ https://issues.apache.org/jira/browse/SPARK-18375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guoqiang Li updated SPARK-18375: Summary: Upgrade netty to 4.0.42.Final (was: Upgrade netty to 4.0.42) > Upgrade netty to 4.0.42.Final > -- > > Key: SPARK-18375 > URL: https://issues.apache.org/jira/browse/SPARK-18375 > Project: Spark > Issue Type: Bug > Components: Build, Spark Core >Affects Versions: 1.6.2, 2.0.1 >Reporter: Guoqiang Li > > One of the important changes for 4.0.42.Final is "Support any FileRegion > implementation when using epoll transport > [#5825|https://github.com/netty/netty/pull/5825];. > In > 4.0.42.Final,[MessageWithHeader|https://github.com/apache/spark/blob/master/common/network-common/src/main/java/org/apache/spark/network/protocol/MessageWithHeader.java] > can work properly when {{spark.(shufflem, rpc).io.mode}} is set to epoll -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18375) Upgrade netty to 4.0.42
[ https://issues.apache.org/jira/browse/SPARK-18375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guoqiang Li updated SPARK-18375: Summary: Upgrade netty to 4.0.42 (was: Upgrade netty to 4.042) > Upgrade netty to 4.0.42 > --- > > Key: SPARK-18375 > URL: https://issues.apache.org/jira/browse/SPARK-18375 > Project: Spark > Issue Type: Bug > Components: Build, Spark Core >Affects Versions: 1.6.2, 2.0.1 >Reporter: Guoqiang Li > > One of the important changes for 4.0.42.Final is "Support any FileRegion > implementation when using epoll transport > [#5825|https://github.com/netty/netty/pull/5825];. > In > 4.0.42.Final,[MessageWithHeader|https://github.com/apache/spark/blob/master/common/network-common/src/main/java/org/apache/spark/network/protocol/MessageWithHeader.java] > can work properly when {{spark.(shufflem, rpc).io.mode}} is set to epoll -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18375) Upgrade netty to 4.042
[ https://issues.apache.org/jira/browse/SPARK-18375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guoqiang Li updated SPARK-18375: Description: One of the important changes for 4.0.42.Final is "Support any FileRegion implementation when using epoll transport [#5825|https://github.com/netty/netty/pull/5825];. In 4.0.42.Final,[MessageWithHeader|https://github.com/apache/spark/blob/master/common/network-common/src/main/java/org/apache/spark/network/protocol/MessageWithHeader.java] can work properly when {{spark.(shufflem, rpc).io.mode}} is set to epoll was: The important changes for 4.0.42.Final is "Support any FileRegion implementation when using epoll transport [#5825|https://github.com/netty/netty/pull/5825];. In 4.0.42.Final,[MessageWithHeader|https://github.com/apache/spark/blob/master/common/network-common/src/main/java/org/apache/spark/network/protocol/MessageWithHeader.java] can work properly when {{spark.(shufflem, rpc).io.mode}} is set to epoll > Upgrade netty to 4.042 > -- > > Key: SPARK-18375 > URL: https://issues.apache.org/jira/browse/SPARK-18375 > Project: Spark > Issue Type: Bug > Components: Build, Spark Core >Affects Versions: 1.6.2, 2.0.1 >Reporter: Guoqiang Li > > One of the important changes for 4.0.42.Final is "Support any FileRegion > implementation when using epoll transport > [#5825|https://github.com/netty/netty/pull/5825];. > In > 4.0.42.Final,[MessageWithHeader|https://github.com/apache/spark/blob/master/common/network-common/src/main/java/org/apache/spark/network/protocol/MessageWithHeader.java] > can work properly when {{spark.(shufflem, rpc).io.mode}} is set to epoll -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18375) Upgrade netty to 4.042
[ https://issues.apache.org/jira/browse/SPARK-18375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guoqiang Li updated SPARK-18375: Affects Version/s: 1.6.2 > Upgrade netty to 4.042 > -- > > Key: SPARK-18375 > URL: https://issues.apache.org/jira/browse/SPARK-18375 > Project: Spark > Issue Type: Bug > Components: Build, Spark Core >Affects Versions: 1.6.2, 2.0.1 >Reporter: Guoqiang Li > > The important changes for 4.0.42.Final is "Support any FileRegion > implementation when using epoll transport > [#5825|https://github.com/netty/netty/pull/5825];. > In > 4.0.42.Final,[MessageWithHeader|https://github.com/apache/spark/blob/master/common/network-common/src/main/java/org/apache/spark/network/protocol/MessageWithHeader.java] > can work properly when {{spark.(shufflem, rpc).io.mode}} is set to epoll -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18374) Incorrect words in StopWords/english.txt
nirav patel created SPARK-18374: --- Summary: Incorrect words in StopWords/english.txt Key: SPARK-18374 URL: https://issues.apache.org/jira/browse/SPARK-18374 Project: Spark Issue Type: Bug Components: ML Affects Versions: 2.0.1 Reporter: nirav patel I was just double checking english.txt for list of stopwords as I felt it was taking out valid tokens like 'won'. I think issue is english.txt list is missing apostrophe character and all character after apostrophe. So "won't" becam "won" in that list; "wouldn't" is "wouldn" . Here are some incorrect tokens in this list: won wouldn ma mightn mustn needn shan shouldn wasn weren I think ideal list should have both style. i.e. won't and wont both should be part of english.txt as some tokenizer might remove special characters. But 'won' is obviously shouldn't be in this list. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18375) Upgrade netty to 4.042
Guoqiang Li created SPARK-18375: --- Summary: Upgrade netty to 4.042 Key: SPARK-18375 URL: https://issues.apache.org/jira/browse/SPARK-18375 Project: Spark Issue Type: Bug Components: Build, Spark Core Affects Versions: 2.0.1 Reporter: Guoqiang Li The important changes for 4.0.42.Final is "Support any FileRegion implementation when using epoll transport [#5825|https://github.com/netty/netty/pull/5825];. In 4.0.42.Final,[MessageWithHeader|https://github.com/apache/spark/blob/master/common/network-common/src/main/java/org/apache/spark/network/protocol/MessageWithHeader.java] can work properly when {{spark.(shufflem, rpc).io.mode}} is set to epoll -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18364) expose metrics for YarnShuffleService
[ https://issues.apache.org/jira/browse/SPARK-18364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649506#comment-15649506 ] Steven Rand commented on SPARK-18364: - The solution I'd initially had in mind, simply creating a {{MetricsSystem}} in {{YarnShuffleService}}, isn't ideal, since then {{network-yarn}} has to depend on {{core}}. I tried splitting the metrics code currently in {{core}} into its own module, but this is pretty tough given how much other code from {{core}} the metrics code depends on. And the {{metrics}} module can't depend on {{core}}, since that creates a circular dependency (core has to depend on metrics). Another idea might be to try to get metrics out of the executors instead of from ExternalShuffleBlockHandler, though this also seems tricky. Open to ideas on how to proceed here if anyone has them. > expose metrics for YarnShuffleService > - > > Key: SPARK-18364 > URL: https://issues.apache.org/jira/browse/SPARK-18364 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 2.0.1 >Reporter: Steven Rand >Priority: Minor > Original Estimate: 336h > Remaining Estimate: 336h > > ExternalShuffleService exposes metrics as of SPARK-16405. However, > YarnShuffleService does not. > The work of instrumenting ExternalShuffleBlockHandler was already done in > SPARK-1645, so this JIRA is for creating a MetricsSystem in > YarnShuffleService similarly to how ExternalShuffleService already does it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13534) Implement Apache Arrow serializer for Spark DataFrame for use in DataFrame.toPandas
[ https://issues.apache.org/jira/browse/SPARK-13534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649436#comment-15649436 ] Apache Spark commented on SPARK-13534: -- User 'BryanCutler' has created a pull request for this issue: https://github.com/apache/spark/pull/15821 > Implement Apache Arrow serializer for Spark DataFrame for use in > DataFrame.toPandas > --- > > Key: SPARK-13534 > URL: https://issues.apache.org/jira/browse/SPARK-13534 > Project: Spark > Issue Type: New Feature > Components: PySpark >Reporter: Wes McKinney > > The current code path for accessing Spark DataFrame data in Python using > PySpark passes through an inefficient serialization-deserialiation process > that I've examined at a high level here: > https://gist.github.com/wesm/0cb5531b1c2e346a0007. Currently, RDD[Row] > objects are being deserialized in pure Python as a list of tuples, which are > then converted to pandas.DataFrame using its {{from_records}} alternate > constructor. This also uses a large amount of memory. > For flat (no nested types) schemas, the Apache Arrow memory layout > (https://github.com/apache/arrow/tree/master/format) can be deserialized to > {{pandas.DataFrame}} objects with comparatively small overhead compared with > memcpy / system memory bandwidth -- Arrow's bitmasks must be examined, > replacing the corresponding null values with pandas's sentinel values (None > or NaN as appropriate). > I will be contributing patches to Arrow in the coming weeks for converting > between Arrow and pandas in the general case, so if Spark can send Arrow > memory to PySpark, we will hopefully be able to increase the Python data > access throughput by an order of magnitude or more. I propose to add an new > serializer for Spark DataFrame and a new method that can be invoked from > PySpark to request a Arrow memory-layout byte stream, prefixed by a data > header indicating array buffer offsets and sizes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13534) Implement Apache Arrow serializer for Spark DataFrame for use in DataFrame.toPandas
[ https://issues.apache.org/jira/browse/SPARK-13534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13534: Assignee: (was: Apache Spark) > Implement Apache Arrow serializer for Spark DataFrame for use in > DataFrame.toPandas > --- > > Key: SPARK-13534 > URL: https://issues.apache.org/jira/browse/SPARK-13534 > Project: Spark > Issue Type: New Feature > Components: PySpark >Reporter: Wes McKinney > > The current code path for accessing Spark DataFrame data in Python using > PySpark passes through an inefficient serialization-deserialiation process > that I've examined at a high level here: > https://gist.github.com/wesm/0cb5531b1c2e346a0007. Currently, RDD[Row] > objects are being deserialized in pure Python as a list of tuples, which are > then converted to pandas.DataFrame using its {{from_records}} alternate > constructor. This also uses a large amount of memory. > For flat (no nested types) schemas, the Apache Arrow memory layout > (https://github.com/apache/arrow/tree/master/format) can be deserialized to > {{pandas.DataFrame}} objects with comparatively small overhead compared with > memcpy / system memory bandwidth -- Arrow's bitmasks must be examined, > replacing the corresponding null values with pandas's sentinel values (None > or NaN as appropriate). > I will be contributing patches to Arrow in the coming weeks for converting > between Arrow and pandas in the general case, so if Spark can send Arrow > memory to PySpark, we will hopefully be able to increase the Python data > access throughput by an order of magnitude or more. I propose to add an new > serializer for Spark DataFrame and a new method that can be invoked from > PySpark to request a Arrow memory-layout byte stream, prefixed by a data > header indicating array buffer offsets and sizes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13534) Implement Apache Arrow serializer for Spark DataFrame for use in DataFrame.toPandas
[ https://issues.apache.org/jira/browse/SPARK-13534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13534: Assignee: Apache Spark > Implement Apache Arrow serializer for Spark DataFrame for use in > DataFrame.toPandas > --- > > Key: SPARK-13534 > URL: https://issues.apache.org/jira/browse/SPARK-13534 > Project: Spark > Issue Type: New Feature > Components: PySpark >Reporter: Wes McKinney >Assignee: Apache Spark > > The current code path for accessing Spark DataFrame data in Python using > PySpark passes through an inefficient serialization-deserialiation process > that I've examined at a high level here: > https://gist.github.com/wesm/0cb5531b1c2e346a0007. Currently, RDD[Row] > objects are being deserialized in pure Python as a list of tuples, which are > then converted to pandas.DataFrame using its {{from_records}} alternate > constructor. This also uses a large amount of memory. > For flat (no nested types) schemas, the Apache Arrow memory layout > (https://github.com/apache/arrow/tree/master/format) can be deserialized to > {{pandas.DataFrame}} objects with comparatively small overhead compared with > memcpy / system memory bandwidth -- Arrow's bitmasks must be examined, > replacing the corresponding null values with pandas's sentinel values (None > or NaN as appropriate). > I will be contributing patches to Arrow in the coming weeks for converting > between Arrow and pandas in the general case, so if Spark can send Arrow > memory to PySpark, we will hopefully be able to increase the Python data > access throughput by an order of magnitude or more. I propose to add an new > serializer for Spark DataFrame and a new method that can be invoked from > PySpark to request a Arrow memory-layout byte stream, prefixed by a data > header indicating array buffer offsets and sizes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18373) Make KafkaSource's failOnDataLoss=false work with Spark jobs
[ https://issues.apache.org/jira/browse/SPARK-18373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18373: Assignee: Apache Spark (was: Shixiong Zhu) > Make KafkaSource's failOnDataLoss=false work with Spark jobs > > > Key: SPARK-18373 > URL: https://issues.apache.org/jira/browse/SPARK-18373 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Reporter: Shixiong Zhu >Assignee: Apache Spark > > Right now failOnDataLoss=false doesn't affect Spark jobs launched by > KafkaSource. The job may still fail the query when some topics are deleted or > some data is aged out. We should handle these corner cases in Spark jobs as > well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18373) Make KafkaSource's failOnDataLoss=false work with Spark jobs
[ https://issues.apache.org/jira/browse/SPARK-18373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18373: Assignee: Shixiong Zhu (was: Apache Spark) > Make KafkaSource's failOnDataLoss=false work with Spark jobs > > > Key: SPARK-18373 > URL: https://issues.apache.org/jira/browse/SPARK-18373 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > > Right now failOnDataLoss=false doesn't affect Spark jobs launched by > KafkaSource. The job may still fail the query when some topics are deleted or > some data is aged out. We should handle these corner cases in Spark jobs as > well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18373) Make KafkaSource's failOnDataLoss=false work with Spark jobs
[ https://issues.apache.org/jira/browse/SPARK-18373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649430#comment-15649430 ] Apache Spark commented on SPARK-18373: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/15820 > Make KafkaSource's failOnDataLoss=false work with Spark jobs > > > Key: SPARK-18373 > URL: https://issues.apache.org/jira/browse/SPARK-18373 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > > Right now failOnDataLoss=false doesn't affect Spark jobs launched by > KafkaSource. The job may still fail the query when some topics are deleted or > some data is aged out. We should handle these corner cases in Spark jobs as > well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18373) Make KafkaSource's failOnDataLoss=false work with Spark jobs
Shixiong Zhu created SPARK-18373: Summary: Make KafkaSource's failOnDataLoss=false work with Spark jobs Key: SPARK-18373 URL: https://issues.apache.org/jira/browse/SPARK-18373 Project: Spark Issue Type: Bug Components: Structured Streaming Reporter: Shixiong Zhu Assignee: Shixiong Zhu Right now failOnDataLoss=false doesn't affect Spark jobs launched by KafkaSource. The job may still fail the query when some topics are deleted or some data is aged out. We should handle these corner cases in Spark jobs as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18371) Spark Streaming backpressure bug - generates a batch with large number of records
[ https://issues.apache.org/jira/browse/SPARK-18371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649409#comment-15649409 ] mapreduced commented on SPARK-18371: I worked the math for [PIDRateEstimator|https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/scheduler/rate/PIDRateEstimator.scala] and I found that when there are long scheduling delays it increases the 'historicalError' by a LOT, which even when multiplied by integral (=0.2 default), results in a large negative in the formula: val newRate = (latestRate - proportional * error - integral * historicalError - derivative * dError).max(minRate) As a result, minRate (=100 default) becomes the newRate. Now, when it comes in [DirectKafkaInputDStream|https://github.com/apache/spark/blob/master/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala#L107] , if you have more than 100 partitions in your kafka topics, you're almost guaranteed to get Math.rounded backpressureRate = 0. Which then [here|https://github.com/apache/spark/blob/master/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala#L114] leads to returning None. That as a result returns [leaderOffsets|https://github.com/apache/spark/blob/master/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala#L149] for that batch - hence the giant batch with all records from last batch till head of the queue (leaderOffsets). Proposed solution: Not sure, maybe the minRate could be defaulted to Total number of Partitions in all your kafka topics + some constant. Not sure if anyone has any suggestions to changes in PIDRateEstimator itself. Here's the math I did from the example in Look_at_batch_at_22_14.png : *Run1* : latestRate = -1 latestTime = -1 latestError = -1 time = 1478587297000 processingDelay = 342000 delaySinceUpdate = 1478587298 processingRate = (1080/342000) * 1000 = 31578.94 error = -1 -31579 = -31580 historicalError = 8 * 31579 / 6 = 4.21 dError = (-31580 - 4.21)/ 1478587298 = 0.2136107254 newRate = (-1 -(1*-31580) - (0.2*4.21) - (0 * 0.2136107254)).max(100) = (31578.158).max(100) = 31578.158 latestTime = 1478587297 latestError = 0 latestRate = 31578.94 Returns None - which results in picking up maxRatePerPartition *Run 2* : time = 1478587615000 processingDelay = 5.3 * 60 * 1000 = 318000 schedulingDelay = 282000 delaySinceUpdate = (1478587615000 - 1478587297000) = 318 processingRate = 1080/318000 * 1000 = 33962.2 error = 31578 - 33962 = -2384 historicalError = 282000 * 33962 / 6 = 159621.4 dError = doesnt matter since multiplied by 0 newRate = (31578 - (1*-2384) - (0.2*159621) - (0 * dError)).max(100) = (2037.72).max(100) = 2037.72 latestRate = 2037.72 latestError = -2384 latestTime = 1478587615000 Returns newRate = 2037.72 *Run 3* : totalLag = 1972830183 perpartition lag = 6576100.61 backpressureRate = 126000 - 129000 time = 1478587795000 delaySinceUpdate = 1478587795000 - 1478587615000 = 180 processingRate = 1080/18 * 1000 = 6 error = 2037 - 6 = -57963 historicalError = 54 * 6 / 6 = 54 dError = doesntMatter newRate = (2037.72 - (1*-57963) -(0.2*54) - 0).max(100) = (-48000).max(100) = 100 latestTime = 1478587795000 latestRate = 100 latestError = -62384 Returns newRate = 100 > Spark Streaming backpressure bug - generates a batch with large number of > records > - > > Key: SPARK-18371 > URL: https://issues.apache.org/jira/browse/SPARK-18371 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.0.0 >Reporter: mapreduced > Attachments: GiantBatch2.png, GiantBatch3.png, > Giant_batch_at_23_00.png, Look_at_batch_at_22_14.png > > > When the streaming job is configured with backpressureEnabled=true, it > generates a GIANT batch of records if the processing time + scheduled delay > is (much) larger than batchDuration. This creates a backlog of records like > no other and results in batches queueing for hours until it chews through > this giant batch. > Expectation is that it should reduce the number of records per batch in some > time to whatever it can really process. > Attaching some screen shots where it seems that this issue is quite easily > reproducible. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13534) Implement Apache Arrow serializer for Spark DataFrame for use in DataFrame.toPandas
[ https://issues.apache.org/jira/browse/SPARK-13534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649405#comment-15649405 ] Bryan Cutler commented on SPARK-13534: -- I've been working on this with [~xusen]. We have a very rough WIP branch that I'll link here in case others want to pitch in or review while we are working out the kinks. > Implement Apache Arrow serializer for Spark DataFrame for use in > DataFrame.toPandas > --- > > Key: SPARK-13534 > URL: https://issues.apache.org/jira/browse/SPARK-13534 > Project: Spark > Issue Type: New Feature > Components: PySpark >Reporter: Wes McKinney > > The current code path for accessing Spark DataFrame data in Python using > PySpark passes through an inefficient serialization-deserialiation process > that I've examined at a high level here: > https://gist.github.com/wesm/0cb5531b1c2e346a0007. Currently, RDD[Row] > objects are being deserialized in pure Python as a list of tuples, which are > then converted to pandas.DataFrame using its {{from_records}} alternate > constructor. This also uses a large amount of memory. > For flat (no nested types) schemas, the Apache Arrow memory layout > (https://github.com/apache/arrow/tree/master/format) can be deserialized to > {{pandas.DataFrame}} objects with comparatively small overhead compared with > memcpy / system memory bandwidth -- Arrow's bitmasks must be examined, > replacing the corresponding null values with pandas's sentinel values (None > or NaN as appropriate). > I will be contributing patches to Arrow in the coming weeks for converting > between Arrow and pandas in the general case, so if Spark can send Arrow > memory to PySpark, we will hopefully be able to increase the Python data > access throughput by an order of magnitude or more. I propose to add an new > serializer for Spark DataFrame and a new method that can be invoked from > PySpark to request a Arrow memory-layout byte stream, prefixed by a data > header indicating array buffer offsets and sizes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18372) .Hive-staging folders created from Spark hiveContext are not getting cleaned up
[ https://issues.apache.org/jira/browse/SPARK-18372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649397#comment-15649397 ] mingjie tang commented on SPARK-18372: -- Solution: This bug is reported by customers. The reason is the org.spark.sql.hive.InsertIntoHiveTable call the hive class of (org.apache.hadoop.hive.) to create the staging directory. Default, from the hive side, this staging file would be removed after the hive session is expired. However, spark fail to notify the hive to remove the staging files. Thus, follow the code of spark 2.0.x, I just write one function inside the InsertIntoHiveTable to create the .staging directory, then, after the session expired of spark, this .staging directory would be removed. This update is tested for the spark 1.5.2 and spark 1.6.3, and the push request is : For the test, I have manually checking .staging files from table belong directory after the spark shell close. > .Hive-staging folders created from Spark hiveContext are not getting cleaned > up > --- > > Key: SPARK-18372 > URL: https://issues.apache.org/jira/browse/SPARK-18372 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2, 1.6.2, 1.6.3 > Environment: spark standalone and spark yarn >Reporter: mingjie tang > Fix For: 2.0.1 > > > Steps to reproduce: > > 1. Launch spark-shell > 2. Run the following scala code via Spark-Shell > scala> val hivesampletabledf = sqlContext.table("hivesampletable") > scala> import org.apache.spark.sql.DataFrameWriter > scala> val dfw : DataFrameWriter = hivesampletabledf.write > scala> sqlContext.sql("CREATE TABLE IF NOT EXISTS hivesampletablecopypy ( > clientid string, querytime string, market string, deviceplatform string, > devicemake string, devicemodel string, state string, country string, > querydwelltime double, sessionid bigint, sessionpagevieworder bigint )") > scala> dfw.insertInto("hivesampletablecopypy") > scala> val hivesampletablecopypydfdf = sqlContext.sql("""SELECT clientid, > querytime, deviceplatform, querydwelltime FROM hivesampletablecopypy WHERE > state = 'Washington' AND devicemake = 'Microsoft' AND querydwelltime > 15 """) > hivesampletablecopypydfdf.show > 3. in HDFS (in our case, WASB), we can see the following folders > hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666 > > hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666-1/-ext-1 > > hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693 > the issue is that these don't get cleaned up and get accumulated > = > with the customer, we have tried setting "SET > hive.exec.stagingdir=/tmp/hive;" in hive-site.xml - didn't make any > difference. > .hive-staging folders are created under the folder - > hive/warehouse/hivesampletablecopypy/ > we have tried adding this property to hive-site.xml and restart the > components - > > hive.exec.stagingdir > $ {hive.exec.scratchdir} > /$ > {user.name} > /.staging > > a new .hive-staging folder was created in hive/warehouse/ folder > moreover, please understand that if we run the hive query in pure Hive via > Hive CLI on the same Spark cluster, we don't see the behavior > so it doesn't appear to be a Hive issue/behavior in this case- this is a > spark behavior > I checked in Ambari, spark.yarn.preserve.staging.files=false in Spark > configuration already > The issue happens via Spark-submit as well - customer used the following > command to reproduce this - > spark-submit test-hive-staging-cleanup.py -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18372) .Hive-staging folders created from Spark hiveContext are not getting cleaned up
[ https://issues.apache.org/jira/browse/SPARK-18372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] mingjie tang updated SPARK-18372: - Description: Steps to reproduce: 1. Launch spark-shell 2. Run the following scala code via Spark-Shell scala> val hivesampletabledf = sqlContext.table("hivesampletable") scala> import org.apache.spark.sql.DataFrameWriter scala> val dfw : DataFrameWriter = hivesampletabledf.write scala> sqlContext.sql("CREATE TABLE IF NOT EXISTS hivesampletablecopypy ( clientid string, querytime string, market string, deviceplatform string, devicemake string, devicemodel string, state string, country string, querydwelltime double, sessionid bigint, sessionpagevieworder bigint )") scala> dfw.insertInto("hivesampletablecopypy") scala> val hivesampletablecopypydfdf = sqlContext.sql("""SELECT clientid, querytime, deviceplatform, querydwelltime FROM hivesampletablecopypy WHERE state = 'Washington' AND devicemake = 'Microsoft' AND querydwelltime > 15 """) hivesampletablecopypydfdf.show 3. in HDFS (in our case, WASB), we can see the following folders hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666 hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666-1/-ext-1 hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693 the issue is that these don't get cleaned up and get accumulated = with the customer, we have tried setting "SET hive.exec.stagingdir=/tmp/hive;" in hive-site.xml - didn't make any difference. .hive-staging folders are created under the folder - hive/warehouse/hivesampletablecopypy/ we have tried adding this property to hive-site.xml and restart the components - hive.exec.stagingdir $ {hive.exec.scratchdir} /$ {user.name} /.staging a new .hive-staging folder was created in hive/warehouse/ folder moreover, please understand that if we run the hive query in pure Hive via Hive CLI on the same Spark cluster, we don't see the behavior so it doesn't appear to be a Hive issue/behavior in this case- this is a spark behavior I checked in Ambari, spark.yarn.preserve.staging.files=false in Spark configuration already The issue happens via Spark-submit as well - customer used the following command to reproduce this - spark-submit test-hive-staging-cleanup.py was: Steps to reproduce: 1. Launch spark-shell 2. Run the following scala code via Spark-Shell scala> val hivesampletabledf = sqlContext.table("hivesampletable") scala> import org.apache.spark.sql.DataFrameWriter scala> val dfw : DataFrameWriter = hivesampletabledf.write scala> sqlContext.sql("CREATE TABLE IF NOT EXISTS hivesampletablecopypy ( clientid string, querytime string, market string, deviceplatform string, devicemake string, devicemodel string, state string, country string, querydwelltime double, sessionid bigint, sessionpagevieworder bigint )") scala> dfw.insertInto("hivesampletablecopypy") scala> val hivesampletablecopypydfdf = sqlContext.sql("""SELECT clientid, querytime, deviceplatform, querydwelltime FROM hivesampletablecopypy WHERE state = 'Washington' AND devicemake = 'Microsoft' AND querydwelltime > 15 """) hivesampletablecopypydfdf.show 3. in HDFS (in our case, WASB), we can see the following folders hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666 hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666-1/-ext-1 hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693 the issue is that these don't get cleaned up and get accumulated = with the customer, we have tried setting "SET hive.exec.stagingdir=/tmp/hive;" in hive-site.xml - didn't make any difference. .hive-staging folders are created under the folder - hive/warehouse/hivesampletablecopypy/ we have tried adding this property to hive-site.xml and restart the components - hive.exec.stagingdir $ {hive.exec.scratchdir} /$ {user.name} /.staging a new .hive-staging folder was created in hive/warehouse/ folder moreover, please understand that if we run the hive query in pure Hive via Hive CLI on the same Spark cluster, we don't see the behavior so it doesn't appear to be a Hive issue/behavior in this case- this is a spark behavior I checked in Ambari, spark.yarn.preserve.staging.files=false in Spark configuration already The issue happens via Spark-submit as well - customer used the following command to reproduce this - spark-submit test-hive-staging-cleanup.py Solution: This bug is reported by customers. The reason is the org.spark.sql.hive.InsertIntoHiveTable call the hive class of (org.apache.hadoop.hive.) to create the staging directory. Default, from the hive side, this staging file would be removed after the hive session is
[jira] [Commented] (SPARK-18372) .Hive-staging folders created from Spark hiveContext are not getting cleaned up
[ https://issues.apache.org/jira/browse/SPARK-18372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649387#comment-15649387 ] Apache Spark commented on SPARK-18372: -- User 'merlintang' has created a pull request for this issue: https://github.com/apache/spark/pull/15819 > .Hive-staging folders created from Spark hiveContext are not getting cleaned > up > --- > > Key: SPARK-18372 > URL: https://issues.apache.org/jira/browse/SPARK-18372 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2, 1.6.2, 1.6.3 > Environment: spark standalone and spark yarn >Reporter: mingjie tang > Fix For: 2.0.1 > > > Steps to reproduce: > > 1. Launch spark-shell > 2. Run the following scala code via Spark-Shell > scala> val hivesampletabledf = sqlContext.table("hivesampletable") > scala> import org.apache.spark.sql.DataFrameWriter > scala> val dfw : DataFrameWriter = hivesampletabledf.write > scala> sqlContext.sql("CREATE TABLE IF NOT EXISTS hivesampletablecopypy ( > clientid string, querytime string, market string, deviceplatform string, > devicemake string, devicemodel string, state string, country string, > querydwelltime double, sessionid bigint, sessionpagevieworder bigint )") > scala> dfw.insertInto("hivesampletablecopypy") > scala> val hivesampletablecopypydfdf = sqlContext.sql("""SELECT clientid, > querytime, deviceplatform, querydwelltime FROM hivesampletablecopypy WHERE > state = 'Washington' AND devicemake = 'Microsoft' AND querydwelltime > 15 """) > hivesampletablecopypydfdf.show > 3. in HDFS (in our case, WASB), we can see the following folders > hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666 > > hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666-1/-ext-1 > > hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693 > the issue is that these don't get cleaned up and get accumulated > = > with the customer, we have tried setting "SET > hive.exec.stagingdir=/tmp/hive;" in hive-site.xml - didn't make any > difference. > .hive-staging folders are created under the folder - > hive/warehouse/hivesampletablecopypy/ > we have tried adding this property to hive-site.xml and restart the > components - > > hive.exec.stagingdir > $ {hive.exec.scratchdir} > /$ > {user.name} > /.staging > > a new .hive-staging folder was created in hive/warehouse/ folder > moreover, please understand that if we run the hive query in pure Hive via > Hive CLI on the same Spark cluster, we don't see the behavior > so it doesn't appear to be a Hive issue/behavior in this case- this is a > spark behavior > I checked in Ambari, spark.yarn.preserve.staging.files=false in Spark > configuration already > The issue happens via Spark-submit as well - customer used the following > command to reproduce this - > spark-submit test-hive-staging-cleanup.py > Solution: > This bug is reported by customers. > The reason is the org.spark.sql.hive.InsertIntoHiveTable call the hive class > of (org.apache.hadoop.hive.) to create the staging directory. Default, from > the hive side, this staging file would be removed after the hive session is > expired. However, spark fail to notify the hive to remove the staging files. > Thus, follow the code of spark 2.0.x, I just write one function inside the > InsertIntoHiveTable to create the .staging directory, then, after the session > expired of spark, this .staging directory would be removed. > This update is tested for the spark 1.5.2 and spark 1.6.3, and the push > request is : > For the test, I have manually checking .staging files from table belong > directory after the spark shell close. meanwhile, please advise how to write > the test case? because the directory for the related tables can not get. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18372) .Hive-staging folders created from Spark hiveContext are not getting cleaned up
[ https://issues.apache.org/jira/browse/SPARK-18372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18372: Assignee: (was: Apache Spark) > .Hive-staging folders created from Spark hiveContext are not getting cleaned > up > --- > > Key: SPARK-18372 > URL: https://issues.apache.org/jira/browse/SPARK-18372 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2, 1.6.2, 1.6.3 > Environment: spark standalone and spark yarn >Reporter: mingjie tang > Fix For: 2.0.1 > > > Steps to reproduce: > > 1. Launch spark-shell > 2. Run the following scala code via Spark-Shell > scala> val hivesampletabledf = sqlContext.table("hivesampletable") > scala> import org.apache.spark.sql.DataFrameWriter > scala> val dfw : DataFrameWriter = hivesampletabledf.write > scala> sqlContext.sql("CREATE TABLE IF NOT EXISTS hivesampletablecopypy ( > clientid string, querytime string, market string, deviceplatform string, > devicemake string, devicemodel string, state string, country string, > querydwelltime double, sessionid bigint, sessionpagevieworder bigint )") > scala> dfw.insertInto("hivesampletablecopypy") > scala> val hivesampletablecopypydfdf = sqlContext.sql("""SELECT clientid, > querytime, deviceplatform, querydwelltime FROM hivesampletablecopypy WHERE > state = 'Washington' AND devicemake = 'Microsoft' AND querydwelltime > 15 """) > hivesampletablecopypydfdf.show > 3. in HDFS (in our case, WASB), we can see the following folders > hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666 > > hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666-1/-ext-1 > > hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693 > the issue is that these don't get cleaned up and get accumulated > = > with the customer, we have tried setting "SET > hive.exec.stagingdir=/tmp/hive;" in hive-site.xml - didn't make any > difference. > .hive-staging folders are created under the folder - > hive/warehouse/hivesampletablecopypy/ > we have tried adding this property to hive-site.xml and restart the > components - > > hive.exec.stagingdir > $ {hive.exec.scratchdir} > /$ > {user.name} > /.staging > > a new .hive-staging folder was created in hive/warehouse/ folder > moreover, please understand that if we run the hive query in pure Hive via > Hive CLI on the same Spark cluster, we don't see the behavior > so it doesn't appear to be a Hive issue/behavior in this case- this is a > spark behavior > I checked in Ambari, spark.yarn.preserve.staging.files=false in Spark > configuration already > The issue happens via Spark-submit as well - customer used the following > command to reproduce this - > spark-submit test-hive-staging-cleanup.py > Solution: > This bug is reported by customers. > The reason is the org.spark.sql.hive.InsertIntoHiveTable call the hive class > of (org.apache.hadoop.hive.) to create the staging directory. Default, from > the hive side, this staging file would be removed after the hive session is > expired. However, spark fail to notify the hive to remove the staging files. > Thus, follow the code of spark 2.0.x, I just write one function inside the > InsertIntoHiveTable to create the .staging directory, then, after the session > expired of spark, this .staging directory would be removed. > This update is tested for the spark 1.5.2 and spark 1.6.3, and the push > request is : > For the test, I have manually checking .staging files from table belong > directory after the spark shell close. meanwhile, please advise how to write > the test case? because the directory for the related tables can not get. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18372) .Hive-staging folders created from Spark hiveContext are not getting cleaned up
[ https://issues.apache.org/jira/browse/SPARK-18372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18372: Assignee: Apache Spark > .Hive-staging folders created from Spark hiveContext are not getting cleaned > up > --- > > Key: SPARK-18372 > URL: https://issues.apache.org/jira/browse/SPARK-18372 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2, 1.6.2, 1.6.3 > Environment: spark standalone and spark yarn >Reporter: mingjie tang >Assignee: Apache Spark > Fix For: 2.0.1 > > > Steps to reproduce: > > 1. Launch spark-shell > 2. Run the following scala code via Spark-Shell > scala> val hivesampletabledf = sqlContext.table("hivesampletable") > scala> import org.apache.spark.sql.DataFrameWriter > scala> val dfw : DataFrameWriter = hivesampletabledf.write > scala> sqlContext.sql("CREATE TABLE IF NOT EXISTS hivesampletablecopypy ( > clientid string, querytime string, market string, deviceplatform string, > devicemake string, devicemodel string, state string, country string, > querydwelltime double, sessionid bigint, sessionpagevieworder bigint )") > scala> dfw.insertInto("hivesampletablecopypy") > scala> val hivesampletablecopypydfdf = sqlContext.sql("""SELECT clientid, > querytime, deviceplatform, querydwelltime FROM hivesampletablecopypy WHERE > state = 'Washington' AND devicemake = 'Microsoft' AND querydwelltime > 15 """) > hivesampletablecopypydfdf.show > 3. in HDFS (in our case, WASB), we can see the following folders > hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666 > > hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666-1/-ext-1 > > hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693 > the issue is that these don't get cleaned up and get accumulated > = > with the customer, we have tried setting "SET > hive.exec.stagingdir=/tmp/hive;" in hive-site.xml - didn't make any > difference. > .hive-staging folders are created under the folder - > hive/warehouse/hivesampletablecopypy/ > we have tried adding this property to hive-site.xml and restart the > components - > > hive.exec.stagingdir > $ {hive.exec.scratchdir} > /$ > {user.name} > /.staging > > a new .hive-staging folder was created in hive/warehouse/ folder > moreover, please understand that if we run the hive query in pure Hive via > Hive CLI on the same Spark cluster, we don't see the behavior > so it doesn't appear to be a Hive issue/behavior in this case- this is a > spark behavior > I checked in Ambari, spark.yarn.preserve.staging.files=false in Spark > configuration already > The issue happens via Spark-submit as well - customer used the following > command to reproduce this - > spark-submit test-hive-staging-cleanup.py > Solution: > This bug is reported by customers. > The reason is the org.spark.sql.hive.InsertIntoHiveTable call the hive class > of (org.apache.hadoop.hive.) to create the staging directory. Default, from > the hive side, this staging file would be removed after the hive session is > expired. However, spark fail to notify the hive to remove the staging files. > Thus, follow the code of spark 2.0.x, I just write one function inside the > InsertIntoHiveTable to create the .staging directory, then, after the session > expired of spark, this .staging directory would be removed. > This update is tested for the spark 1.5.2 and spark 1.6.3, and the push > request is : > For the test, I have manually checking .staging files from table belong > directory after the spark shell close. meanwhile, please advise how to write > the test case? because the directory for the related tables can not get. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18372) .Hive-staging folders created from Spark hiveContext are not getting cleaned up
[ https://issues.apache.org/jira/browse/SPARK-18372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649386#comment-15649386 ] mingjie tang commented on SPARK-18372: -- the PR is https://github.com/apache/spark/pull/15819 > .Hive-staging folders created from Spark hiveContext are not getting cleaned > up > --- > > Key: SPARK-18372 > URL: https://issues.apache.org/jira/browse/SPARK-18372 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2, 1.6.2, 1.6.3 > Environment: spark standalone and spark yarn >Reporter: mingjie tang > Fix For: 2.0.1 > > > Steps to reproduce: > > 1. Launch spark-shell > 2. Run the following scala code via Spark-Shell > scala> val hivesampletabledf = sqlContext.table("hivesampletable") > scala> import org.apache.spark.sql.DataFrameWriter > scala> val dfw : DataFrameWriter = hivesampletabledf.write > scala> sqlContext.sql("CREATE TABLE IF NOT EXISTS hivesampletablecopypy ( > clientid string, querytime string, market string, deviceplatform string, > devicemake string, devicemodel string, state string, country string, > querydwelltime double, sessionid bigint, sessionpagevieworder bigint )") > scala> dfw.insertInto("hivesampletablecopypy") > scala> val hivesampletablecopypydfdf = sqlContext.sql("""SELECT clientid, > querytime, deviceplatform, querydwelltime FROM hivesampletablecopypy WHERE > state = 'Washington' AND devicemake = 'Microsoft' AND querydwelltime > 15 """) > hivesampletablecopypydfdf.show > 3. in HDFS (in our case, WASB), we can see the following folders > hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666 > > hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666-1/-ext-1 > > hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693 > the issue is that these don't get cleaned up and get accumulated > = > with the customer, we have tried setting "SET > hive.exec.stagingdir=/tmp/hive;" in hive-site.xml - didn't make any > difference. > .hive-staging folders are created under the folder - > hive/warehouse/hivesampletablecopypy/ > we have tried adding this property to hive-site.xml and restart the > components - > > hive.exec.stagingdir > $ {hive.exec.scratchdir} > /$ > {user.name} > /.staging > > a new .hive-staging folder was created in hive/warehouse/ folder > moreover, please understand that if we run the hive query in pure Hive via > Hive CLI on the same Spark cluster, we don't see the behavior > so it doesn't appear to be a Hive issue/behavior in this case- this is a > spark behavior > I checked in Ambari, spark.yarn.preserve.staging.files=false in Spark > configuration already > The issue happens via Spark-submit as well - customer used the following > command to reproduce this - > spark-submit test-hive-staging-cleanup.py > Solution: > This bug is reported by customers. > The reason is the org.spark.sql.hive.InsertIntoHiveTable call the hive class > of (org.apache.hadoop.hive.) to create the staging directory. Default, from > the hive side, this staging file would be removed after the hive session is > expired. However, spark fail to notify the hive to remove the staging files. > Thus, follow the code of spark 2.0.x, I just write one function inside the > InsertIntoHiveTable to create the .staging directory, then, after the session > expired of spark, this .staging directory would be removed. > This update is tested for the spark 1.5.2 and spark 1.6.3, and the push > request is : > For the test, I have manually checking .staging files from table belong > directory after the spark shell close. meanwhile, please advise how to write > the test case? because the directory for the related tables can not get. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18372) .Hive-staging folders created from Spark hiveContext are not getting cleaned up
mingjie tang created SPARK-18372: Summary: .Hive-staging folders created from Spark hiveContext are not getting cleaned up Key: SPARK-18372 URL: https://issues.apache.org/jira/browse/SPARK-18372 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.6.2, 1.5.2, 1.6.3 Environment: spark standalone and spark yarn Reporter: mingjie tang Fix For: 2.0.1 Steps to reproduce: 1. Launch spark-shell 2. Run the following scala code via Spark-Shell scala> val hivesampletabledf = sqlContext.table("hivesampletable") scala> import org.apache.spark.sql.DataFrameWriter scala> val dfw : DataFrameWriter = hivesampletabledf.write scala> sqlContext.sql("CREATE TABLE IF NOT EXISTS hivesampletablecopypy ( clientid string, querytime string, market string, deviceplatform string, devicemake string, devicemodel string, state string, country string, querydwelltime double, sessionid bigint, sessionpagevieworder bigint )") scala> dfw.insertInto("hivesampletablecopypy") scala> val hivesampletablecopypydfdf = sqlContext.sql("""SELECT clientid, querytime, deviceplatform, querydwelltime FROM hivesampletablecopypy WHERE state = 'Washington' AND devicemake = 'Microsoft' AND querydwelltime > 15 """) hivesampletablecopypydfdf.show 3. in HDFS (in our case, WASB), we can see the following folders hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666 hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666-1/-ext-1 hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693 the issue is that these don't get cleaned up and get accumulated = with the customer, we have tried setting "SET hive.exec.stagingdir=/tmp/hive;" in hive-site.xml - didn't make any difference. .hive-staging folders are created under the folder - hive/warehouse/hivesampletablecopypy/ we have tried adding this property to hive-site.xml and restart the components - hive.exec.stagingdir $ {hive.exec.scratchdir} /$ {user.name} /.staging a new .hive-staging folder was created in hive/warehouse/ folder moreover, please understand that if we run the hive query in pure Hive via Hive CLI on the same Spark cluster, we don't see the behavior so it doesn't appear to be a Hive issue/behavior in this case- this is a spark behavior I checked in Ambari, spark.yarn.preserve.staging.files=false in Spark configuration already The issue happens via Spark-submit as well - customer used the following command to reproduce this - spark-submit test-hive-staging-cleanup.py Solution: This bug is reported by customers. The reason is the org.spark.sql.hive.InsertIntoHiveTable call the hive class of (org.apache.hadoop.hive.) to create the staging directory. Default, from the hive side, this staging file would be removed after the hive session is expired. However, spark fail to notify the hive to remove the staging files. Thus, follow the code of spark 2.0.x, I just write one function inside the InsertIntoHiveTable to create the .staging directory, then, after the session expired of spark, this .staging directory would be removed. This update is tested for the spark 1.5.2 and spark 1.6.3, and the push request is : For the test, I have manually checking .staging files from table belong directory after the spark shell close. meanwhile, please advise how to write the test case? because the directory for the related tables can not get. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18359) No option in read csv for other decimal delimiter than dot
[ https://issues.apache.org/jira/browse/SPARK-18359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649352#comment-15649352 ] Hyukjin Kwon commented on SPARK-18359: -- I guess that should be locale specific. I guess we recently sweeped the local to {{Locale.US}}. Maybe we need an option to specify the locale. I guess this is also loosely related with SPARK-18350. > No option in read csv for other decimal delimiter than dot > -- > > Key: SPARK-18359 > URL: https://issues.apache.org/jira/browse/SPARK-18359 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.0.0, 2.0.1 >Reporter: yannick Radji > > On the DataFrameReader object there no CSV-specific option to set decimal > delimiter on comma whereas dot like it use to be in France and Europe. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14914) Test Cases fail on Windows
[ https://issues.apache.org/jira/browse/SPARK-14914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649344#comment-15649344 ] Apache Spark commented on SPARK-14914: -- User 'wangmiao1981' has created a pull request for this issue: https://github.com/apache/spark/pull/15818 > Test Cases fail on Windows > -- > > Key: SPARK-14914 > URL: https://issues.apache.org/jira/browse/SPARK-14914 > Project: Spark > Issue Type: Test >Reporter: Tao LI >Assignee: Tao LI > Fix For: 2.1.0 > > > There are lots of test failure on windows. Mainly for the following reasons: > 1. resources (e.g., temp files) haven't been properly closed after using, > thus IOException raised while deleting the temp directory. > 2. Command line too long on windows. > 3. File path problems caused by the drive label and back slash on absolute > windows path. > 4. Simply java bug on windows >a. setReadable doesn't work on windows for directories. >b. setWritable doesn't work on windows >c. Memory-mapped file can't be read at the same time. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18369) Deprecate runs in Pyspark mllib KMeans
[ https://issues.apache.org/jira/browse/SPARK-18369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649345#comment-15649345 ] Sandeep Singh commented on SPARK-18369: --- I think its already deprecated https://github.com/apache/spark/blob/master/python/pyspark/mllib/clustering.py#L321-L322 https://github.com/apache/spark/blob/master/python/pyspark/mllib/clustering.py#L347 > Deprecate runs in Pyspark mllib KMeans > -- > > Key: SPARK-18369 > URL: https://issues.apache.org/jira/browse/SPARK-18369 > Project: Spark > Issue Type: Improvement > Components: MLlib, PySpark >Reporter: Seth Hendrickson >Priority: Minor > > We should deprecate runs in pyspark mllib kmeans algo as we have done in > Scala. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18371) Spark Streaming backpressure bug - generates a batch with large number of records
[ https://issues.apache.org/jira/browse/SPARK-18371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] mapreduced updated SPARK-18371: --- Attachment: GiantBatch3.png > Spark Streaming backpressure bug - generates a batch with large number of > records > - > > Key: SPARK-18371 > URL: https://issues.apache.org/jira/browse/SPARK-18371 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.0.0 >Reporter: mapreduced > Attachments: GiantBatch2.png, GiantBatch3.png, > Giant_batch_at_23_00.png, Look_at_batch_at_22_14.png > > > When the streaming job is configured with backpressureEnabled=true, it > generates a GIANT batch of records if the processing time + scheduled delay > is (much) larger than batchDuration. This creates a backlog of records like > no other and results in batches queueing for hours until it chews through > this giant batch. > Expectation is that it should reduce the number of records per batch in some > time to whatever it can really process. > Attaching some screen shots where it seems that this issue is quite easily > reproducible. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18371) Spark Streaming backpressure bug - generates a batch with large number of records
[ https://issues.apache.org/jira/browse/SPARK-18371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] mapreduced updated SPARK-18371: --- Attachment: Giant_batch_at_23_00.png > Spark Streaming backpressure bug - generates a batch with large number of > records > - > > Key: SPARK-18371 > URL: https://issues.apache.org/jira/browse/SPARK-18371 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.0.0 >Reporter: mapreduced > Attachments: GiantBatch2.png, GiantBatch3.png, > Giant_batch_at_23_00.png, Look_at_batch_at_22_14.png > > > When the streaming job is configured with backpressureEnabled=true, it > generates a GIANT batch of records if the processing time + scheduled delay > is (much) larger than batchDuration. This creates a backlog of records like > no other and results in batches queueing for hours until it chews through > this giant batch. > Expectation is that it should reduce the number of records per batch in some > time to whatever it can really process. > Attaching some screen shots where it seems that this issue is quite easily > reproducible. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18371) Spark Streaming backpressure bug - generates a batch with large number of records
[ https://issues.apache.org/jira/browse/SPARK-18371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] mapreduced updated SPARK-18371: --- Attachment: Look_at_batch_at_22_14.png > Spark Streaming backpressure bug - generates a batch with large number of > records > - > > Key: SPARK-18371 > URL: https://issues.apache.org/jira/browse/SPARK-18371 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.0.0 >Reporter: mapreduced > Attachments: GiantBatch2.png, Look_at_batch_at_22_14.png > > > When the streaming job is configured with backpressureEnabled=true, it > generates a GIANT batch of records if the processing time + scheduled delay > is (much) larger than batchDuration. This creates a backlog of records like > no other and results in batches queueing for hours until it chews through > this giant batch. > Expectation is that it should reduce the number of records per batch in some > time to whatever it can really process. > Attaching some screen shots where it seems that this issue is quite easily > reproducible. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18371) Spark Streaming backpressure bug - generates a batch with large number of records
[ https://issues.apache.org/jira/browse/SPARK-18371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] mapreduced updated SPARK-18371: --- Attachment: GiantBatch2.png > Spark Streaming backpressure bug - generates a batch with large number of > records > - > > Key: SPARK-18371 > URL: https://issues.apache.org/jira/browse/SPARK-18371 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.0.0 >Reporter: mapreduced > Attachments: GiantBatch2.png, Look_at_batch_at_22_14.png > > > When the streaming job is configured with backpressureEnabled=true, it > generates a GIANT batch of records if the processing time + scheduled delay > is (much) larger than batchDuration. This creates a backlog of records like > no other and results in batches queueing for hours until it chews through > this giant batch. > Expectation is that it should reduce the number of records per batch in some > time to whatever it can really process. > Attaching some screen shots where it seems that this issue is quite easily > reproducible. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18371) Spark Streaming backpressure bug - generates a batch with large number of records
mapreduced created SPARK-18371: -- Summary: Spark Streaming backpressure bug - generates a batch with large number of records Key: SPARK-18371 URL: https://issues.apache.org/jira/browse/SPARK-18371 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 2.0.0 Reporter: mapreduced When the streaming job is configured with backpressureEnabled=true, it generates a GIANT batch of records if the processing time + scheduled delay is (much) larger than batchDuration. This creates a backlog of records like no other and results in batches queueing for hours until it chews through this giant batch. Expectation is that it should reduce the number of records per batch in some time to whatever it can really process. Attaching some screen shots where it seems that this issue is quite easily reproducible. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18366) Add handleInvalid to Pyspark for QuantileDiscretizer and Bucketizer
[ https://issues.apache.org/jira/browse/SPARK-18366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18366: Assignee: (was: Apache Spark) > Add handleInvalid to Pyspark for QuantileDiscretizer and Bucketizer > --- > > Key: SPARK-18366 > URL: https://issues.apache.org/jira/browse/SPARK-18366 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Reporter: Seth Hendrickson >Priority: Minor > > We should add the new {{handleInvalid}} param for these transformers to > Python to maintain API parity. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18366) Add handleInvalid to Pyspark for QuantileDiscretizer and Bucketizer
[ https://issues.apache.org/jira/browse/SPARK-18366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649319#comment-15649319 ] Apache Spark commented on SPARK-18366: -- User 'techaddict' has created a pull request for this issue: https://github.com/apache/spark/pull/15817 > Add handleInvalid to Pyspark for QuantileDiscretizer and Bucketizer > --- > > Key: SPARK-18366 > URL: https://issues.apache.org/jira/browse/SPARK-18366 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Reporter: Seth Hendrickson >Priority: Minor > > We should add the new {{handleInvalid}} param for these transformers to > Python to maintain API parity. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18366) Add handleInvalid to Pyspark for QuantileDiscretizer and Bucketizer
[ https://issues.apache.org/jira/browse/SPARK-18366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18366: Assignee: Apache Spark > Add handleInvalid to Pyspark for QuantileDiscretizer and Bucketizer > --- > > Key: SPARK-18366 > URL: https://issues.apache.org/jira/browse/SPARK-18366 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Reporter: Seth Hendrickson >Assignee: Apache Spark >Priority: Minor > > We should add the new {{handleInvalid}} param for these transformers to > Python to maintain API parity. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18370) InsertIntoHadoopFsRelationCommand should keep track of its table
Herman van Hovell created SPARK-18370: - Summary: InsertIntoHadoopFsRelationCommand should keep track of its table Key: SPARK-18370 URL: https://issues.apache.org/jira/browse/SPARK-18370 Project: Spark Issue Type: Bug Reporter: Herman van Hovell Assignee: Herman van Hovell Priority: Minor When we plan a {{InsertIntoHadoopFsRelationCommand}} we drop the {{Table}} name. This is quite annoying when debugging plans. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18239) Gradient Boosted Tree wrapper in SparkR
[ https://issues.apache.org/jira/browse/SPARK-18239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung resolved SPARK-18239. -- Resolution: Fixed Fix Version/s: 2.2.0 2.1.0 Target Version/s: (was: 2.1.0) > Gradient Boosted Tree wrapper in SparkR > --- > > Key: SPARK-18239 > URL: https://issues.apache.org/jira/browse/SPARK-18239 > Project: Spark > Issue Type: Sub-task > Components: ML, SparkR >Affects Versions: 2.0.1 >Reporter: Felix Cheung >Assignee: Felix Cheung > Fix For: 2.1.0, 2.2.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18131) Support returning Vector/Dense Vector from backend
[ https://issues.apache.org/jira/browse/SPARK-18131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649200#comment-15649200 ] Felix Cheung commented on SPARK-18131: -- We discussed this as a part of the GBT PR, from here https://github.com/apache/spark/pull/15746#issuecomment-259217671 " The best sparse vector support in R comes from the Matrix package - But its a big package and I dont think we should add that as a dependency. We could try to do a wrapper where if the user already has the package installed we return it using Matrix ? " I think this is a good idea to do in a general sense for any SparseVector in the SerDe code. > Support returning Vector/Dense Vector from backend > -- > > Key: SPARK-18131 > URL: https://issues.apache.org/jira/browse/SPARK-18131 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Miao Wang > > For `spark.logit`, there is a `probabilityCol`, which is a vector in the > backend (scala side). When we do collect(select(df, "probabilityCol")), > backend returns the java object handle (memory address). We need to implement > a method to convert a Vector/Dense Vector column as R vector, which can be > read in SparkR. It is a followup JIRA of adding `spark.logit`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18320) ML 2.1 QA: API: Python API coverage
[ https://issues.apache.org/jira/browse/SPARK-18320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649171#comment-15649171 ] Seth Hendrickson edited comment on SPARK-18320 at 11/8/16 11:23 PM: I scanned through the {{@Since("2.1.0")}} tags in ml/mllib. The major things that were added were LSH and clustering summaries, which are linked and have JIRAs. I made JIRAs for a couple other minor things as well and linked them. was (Author: sethah): I scanned through the {{@Since("2.1.0") tags in ml/mllib}}. The major things that were added were LSH and clustering summaries, which are linked and have JIRAs. I made JIRAs for a couple other minor things as well and linked them. > ML 2.1 QA: API: Python API coverage > --- > > Key: SPARK-18320 > URL: https://issues.apache.org/jira/browse/SPARK-18320 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML, PySpark >Reporter: Joseph K. Bradley >Priority: Blocker > > For new public APIs added to MLlib ({{spark.ml}} only), we need to check the > generated HTML doc and compare the Scala & Python versions. > * *GOAL*: Audit and create JIRAs to fix in the next release. > * *NON-GOAL*: This JIRA is _not_ for fixing the API parity issues. > We need to track: > * Inconsistency: Do class/method/parameter names match? > * Docs: Is the Python doc missing or just a stub? We want the Python doc to > be as complete as the Scala doc. > * API breaking changes: These should be very rare but are occasionally either > necessary (intentional) or accidental. These must be recorded and added in > the Migration Guide for this release. > ** Note: If the API change is for an Alpha/Experimental/DeveloperApi > component, please note that as well. > * Missing classes/methods/parameters: We should create to-do JIRAs for > functionality missing from Python, to be added in the next release cycle. > *Please use a _separate_ JIRA (linked below as "requires") for this list of > to-do items.* -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18320) ML 2.1 QA: API: Python API coverage
[ https://issues.apache.org/jira/browse/SPARK-18320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649171#comment-15649171 ] Seth Hendrickson commented on SPARK-18320: -- I scanned through the {{@Since("2.1.0") tags in ml/mllib}}. The major things that were added were LSH and clustering summaries, which are linked and have JIRAs. I made JIRAs for a couple other minor things as well and linked them. > ML 2.1 QA: API: Python API coverage > --- > > Key: SPARK-18320 > URL: https://issues.apache.org/jira/browse/SPARK-18320 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML, PySpark >Reporter: Joseph K. Bradley >Priority: Blocker > > For new public APIs added to MLlib ({{spark.ml}} only), we need to check the > generated HTML doc and compare the Scala & Python versions. > * *GOAL*: Audit and create JIRAs to fix in the next release. > * *NON-GOAL*: This JIRA is _not_ for fixing the API parity issues. > We need to track: > * Inconsistency: Do class/method/parameter names match? > * Docs: Is the Python doc missing or just a stub? We want the Python doc to > be as complete as the Scala doc. > * API breaking changes: These should be very rare but are occasionally either > necessary (intentional) or accidental. These must be recorded and added in > the Migration Guide for this release. > ** Note: If the API change is for an Alpha/Experimental/DeveloperApi > component, please note that as well. > * Missing classes/methods/parameters: We should create to-do JIRAs for > functionality missing from Python, to be added in the next release cycle. > *Please use a _separate_ JIRA (linked below as "requires") for this list of > to-do items.* -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18369) Deprecate runs in Pyspark mllib KMeans
Seth Hendrickson created SPARK-18369: Summary: Deprecate runs in Pyspark mllib KMeans Key: SPARK-18369 URL: https://issues.apache.org/jira/browse/SPARK-18369 Project: Spark Issue Type: Improvement Components: MLlib, PySpark Reporter: Seth Hendrickson Priority: Minor We should deprecate runs in pyspark mllib kmeans algo as we have done in Scala. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18367) limit() makes the lame walk again
[ https://issues.apache.org/jira/browse/SPARK-18367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649132#comment-15649132 ] Nicholas Chammas commented on SPARK-18367: -- On 2.0.x the caching is required due to SPARK-18254, which is fixed in 2.1+. On master I get the same "Too many open files in system" error without the caching if I remove the limit(). > limit() makes the lame walk again > - > > Key: SPARK-18367 > URL: https://issues.apache.org/jira/browse/SPARK-18367 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1, 2.1.0 > Environment: Python 3.5, Java 8 >Reporter: Nicholas Chammas > Attachments: plan-with-limit.txt, plan-without-limit.txt > > > I have a complex DataFrame query that fails to run normally but succeeds if I > add a dummy {{limit()}} upstream in the query tree. > The failure presents itself like this: > {code} > ERROR DiskBlockObjectWriter: Uncaught exception while reverting partial > writes to file > /private/var/folders/f5/t48vxz555b51mr3g6jjhxv40gq/T/blockmgr-1e908314-1d49-47ba-8c95-fa43ff43cee4/31/temp_shuffle_6b939273-c6ce-44a0-99b1-db668e6c89dc > java.io.FileNotFoundException: > /private/var/folders/f5/t48vxz555b51mr3g6jjhxv40gq/T/blockmgr-1e908314-1d49-47ba-8c95-fa43ff43cee4/31/temp_shuffle_6b939273-c6ce-44a0-99b1-db668e6c89dc > (Too many open files in system) > {code} > My {{ulimit -n}} is already set to 10,000, and I can't set it much higher on > macOS. However, I don't think that's the issue, since if I add a dummy > {{limit()}} early on the query tree -- dummy as in it does not actually > reduce the number of rows queried -- then the same query works. > I've diffed the physical query plans to see what this {{limit()}} is actually > doing, and the diff is as follows: > {code} > diff plan-with-limit.txt plan-without-limit.txt > 24,28c24 > <: : : +- *GlobalLimit 100 > <: : :+- Exchange SinglePartition > <: : : +- *LocalLimit 100 > <: : : +- *Project [...] > <: : : +- *Scan orc [...] > Format: ORC, InputPaths: file:/..., PartitionFilters: [], PushedFilters: [], > ReadSchema: struct<... > --- > >: : : +- *Scan orc [...] Format: ORC, > > InputPaths: file:/..., PartitionFilters: [], PushedFilters: [], ReadSchema: > > struct<... > 49,53c45 > < : : +- *GlobalLimit 100 > < : :+- Exchange SinglePartition > < : : +- *LocalLimit 100 > < : : +- *Project [...] > < : : +- *Scan orc [...] > Format: ORC, InputPaths: file:/..., PartitionFilters: [], PushedFilters: [], > ReadSchema: struct<... > --- > > : : +- *Scan orc [] Format: ORC, > > InputPaths: file:/..., PartitionFilters: [], PushedFilters: [], ReadSchema: > > struct<... > {code} > Does this give any clues as to why this {{limit()}} is helping? Again, the > 100 limit you can see in the plan is much higher than the cardinality of > the dataset I'm reading, so there is no theoretical impact on the output. You > can see the full query plans attached to this ticket. > Unfortunately, I don't have a minimal reproduction for this issue, but I can > work towards one with some clues. > I'm seeing this behavior on 2.0.1 and on master at commit > {{26e1c53aceee37e3687a372ff6c6f05463fd8a94}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18342) HDFSBackedStateStore can fail to rename files causing snapshotting and recovery to fail
[ https://issues.apache.org/jira/browse/SPARK-18342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-18342. --- Resolution: Fixed Fix Version/s: 2.1.0 2.0.2 Issue resolved by pull request 15804 [https://github.com/apache/spark/pull/15804] > HDFSBackedStateStore can fail to rename files causing snapshotting and > recovery to fail > --- > > Key: SPARK-18342 > URL: https://issues.apache.org/jira/browse/SPARK-18342 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.0.1 >Reporter: Burak Yavuz >Priority: Critical > Fix For: 2.0.2, 2.1.0 > > > The HDFSBackedStateStore renames temporary files to delta files as it commits > new versions. It however doesn't check whether the rename succeeded. If the > rename fails, then recovery will not be possible. It should fail during the > rename stage. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18367) limit() makes the lame walk again
[ https://issues.apache.org/jira/browse/SPARK-18367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649110#comment-15649110 ] Herman van Hovell commented on SPARK-18367: --- Could you try this without caching? > limit() makes the lame walk again > - > > Key: SPARK-18367 > URL: https://issues.apache.org/jira/browse/SPARK-18367 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1, 2.1.0 > Environment: Python 3.5, Java 8 >Reporter: Nicholas Chammas > Attachments: plan-with-limit.txt, plan-without-limit.txt > > > I have a complex DataFrame query that fails to run normally but succeeds if I > add a dummy {{limit()}} upstream in the query tree. > The failure presents itself like this: > {code} > ERROR DiskBlockObjectWriter: Uncaught exception while reverting partial > writes to file > /private/var/folders/f5/t48vxz555b51mr3g6jjhxv40gq/T/blockmgr-1e908314-1d49-47ba-8c95-fa43ff43cee4/31/temp_shuffle_6b939273-c6ce-44a0-99b1-db668e6c89dc > java.io.FileNotFoundException: > /private/var/folders/f5/t48vxz555b51mr3g6jjhxv40gq/T/blockmgr-1e908314-1d49-47ba-8c95-fa43ff43cee4/31/temp_shuffle_6b939273-c6ce-44a0-99b1-db668e6c89dc > (Too many open files in system) > {code} > My {{ulimit -n}} is already set to 10,000, and I can't set it much higher on > macOS. However, I don't think that's the issue, since if I add a dummy > {{limit()}} early on the query tree -- dummy as in it does not actually > reduce the number of rows queried -- then the same query works. > I've diffed the physical query plans to see what this {{limit()}} is actually > doing, and the diff is as follows: > {code} > diff plan-with-limit.txt plan-without-limit.txt > 24,28c24 > <: : : +- *GlobalLimit 100 > <: : :+- Exchange SinglePartition > <: : : +- *LocalLimit 100 > <: : : +- *Project [...] > <: : : +- *Scan orc [...] > Format: ORC, InputPaths: file:/..., PartitionFilters: [], PushedFilters: [], > ReadSchema: struct<... > --- > >: : : +- *Scan orc [...] Format: ORC, > > InputPaths: file:/..., PartitionFilters: [], PushedFilters: [], ReadSchema: > > struct<... > 49,53c45 > < : : +- *GlobalLimit 100 > < : :+- Exchange SinglePartition > < : : +- *LocalLimit 100 > < : : +- *Project [...] > < : : +- *Scan orc [...] > Format: ORC, InputPaths: file:/..., PartitionFilters: [], PushedFilters: [], > ReadSchema: struct<... > --- > > : : +- *Scan orc [] Format: ORC, > > InputPaths: file:/..., PartitionFilters: [], PushedFilters: [], ReadSchema: > > struct<... > {code} > Does this give any clues as to why this {{limit()}} is helping? Again, the > 100 limit you can see in the plan is much higher than the cardinality of > the dataset I'm reading, so there is no theoretical impact on the output. You > can see the full query plans attached to this ticket. > Unfortunately, I don't have a minimal reproduction for this issue, but I can > work towards one with some clues. > I'm seeing this behavior on 2.0.1 and on master at commit > {{26e1c53aceee37e3687a372ff6c6f05463fd8a94}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18368) Regular expression replace throws NullPointerException when serialized
[ https://issues.apache.org/jira/browse/SPARK-18368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18368: Assignee: (was: Apache Spark) > Regular expression replace throws NullPointerException when serialized > -- > > Key: SPARK-18368 > URL: https://issues.apache.org/jira/browse/SPARK-18368 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1, 2.1.0 >Reporter: Ryan Blue > > This query fails with a [NullPointerException on line > 247|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala#L247]: > {code} > SELECT POSEXPLODE(SPLIT(REGEXP_REPLACE(ranks, '[\\[ \\]]', ''), ',')) AS > (rank, col0) FROM table; > {code} > The problem is that POSEXPLODE is causing the REGEXP_REPLACE to be serialized > after it is instantiated. The null value is a transient StringBuffer that > should hold the result. The fix is to make the result value lazy. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18368) Regular expression replace throws NullPointerException when serialized
[ https://issues.apache.org/jira/browse/SPARK-18368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18368: Assignee: Apache Spark > Regular expression replace throws NullPointerException when serialized > -- > > Key: SPARK-18368 > URL: https://issues.apache.org/jira/browse/SPARK-18368 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1, 2.1.0 >Reporter: Ryan Blue >Assignee: Apache Spark > > This query fails with a [NullPointerException on line > 247|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala#L247]: > {code} > SELECT POSEXPLODE(SPLIT(REGEXP_REPLACE(ranks, '[\\[ \\]]', ''), ',')) AS > (rank, col0) FROM table; > {code} > The problem is that POSEXPLODE is causing the REGEXP_REPLACE to be serialized > after it is instantiated. The null value is a transient StringBuffer that > should hold the result. The fix is to make the result value lazy. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18368) Regular expression replace throws NullPointerException when serialized
[ https://issues.apache.org/jira/browse/SPARK-18368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649104#comment-15649104 ] Apache Spark commented on SPARK-18368: -- User 'rdblue' has created a pull request for this issue: https://github.com/apache/spark/pull/15816 > Regular expression replace throws NullPointerException when serialized > -- > > Key: SPARK-18368 > URL: https://issues.apache.org/jira/browse/SPARK-18368 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1, 2.1.0 >Reporter: Ryan Blue > > This query fails with a [NullPointerException on line > 247|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala#L247]: > {code} > SELECT POSEXPLODE(SPLIT(REGEXP_REPLACE(ranks, '[\\[ \\]]', ''), ',')) AS > (rank, col0) FROM table; > {code} > The problem is that POSEXPLODE is causing the REGEXP_REPLACE to be serialized > after it is instantiated. The null value is a transient StringBuffer that > should hold the result. The fix is to make the result value lazy. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18368) Regular expression replace throws NullPointerException when serialized
[ https://issues.apache.org/jira/browse/SPARK-18368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated SPARK-18368: -- Description: This query fails with a [NullPointerException on line 247|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala#L247]: {code} SELECT POSEXPLODE(SPLIT(REGEXP_REPLACE(ranks, '[\\[ \\]]', ''), ',')) AS (rank, col0) FROM table; {code} The problem is that POSEXPLODE is causing the REGEXP_REPLACE to be serialized after it is instantiated. The null value is a transient StringBuffer that should hold the result. The fix is to make the result value lazy. was: This query fails with a [NullPointerException on line 247|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala#L247]: {code} SELECT POSEXPLODE(SPLIT(REGEXP_REPLACE(ranks, '[\\[ \\]]', ''), ',')) AS (rank, col0) FROM table; {code} The problem is that POSEXPLODE is causing the REGEXP_REPLACE to be serialized after it is instantiated. The null value is a transient StringBuffer that should hold the result. The fix is to make the result value lazy. > Regular expression replace throws NullPointerException when serialized > -- > > Key: SPARK-18368 > URL: https://issues.apache.org/jira/browse/SPARK-18368 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1, 2.1.0 >Reporter: Ryan Blue > > This query fails with a [NullPointerException on line > 247|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala#L247]: > {code} > SELECT POSEXPLODE(SPLIT(REGEXP_REPLACE(ranks, '[\\[ \\]]', ''), ',')) AS > (rank, col0) FROM table; > {code} > The problem is that POSEXPLODE is causing the REGEXP_REPLACE to be serialized > after it is instantiated. The null value is a transient StringBuffer that > should hold the result. The fix is to make the result value lazy. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14540) Support Scala 2.12 closures and Java 8 lambdas in ClosureCleaner
[ https://issues.apache.org/jira/browse/SPARK-14540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649090#comment-15649090 ] Taro L. Saito commented on SPARK-14540: --- I'm also hitting a similar problem in my dependency injection library for Scala: https://github.com/wvlet/airframe/pull/39. I feel ClosureCleaner like functionality is necessary in Scala 2.12 itself. > Support Scala 2.12 closures and Java 8 lambdas in ClosureCleaner > > > Key: SPARK-14540 > URL: https://issues.apache.org/jira/browse/SPARK-14540 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Reporter: Josh Rosen > > Using https://github.com/JoshRosen/spark/tree/build-for-2.12, I tried running > ClosureCleanerSuite with Scala 2.12 and ran into two bad test failures: > {code} > [info] - toplevel return statements in closures are identified at cleaning > time *** FAILED *** (32 milliseconds) > [info] Expected exception > org.apache.spark.util.ReturnStatementInClosureException to be thrown, but no > exception was thrown. (ClosureCleanerSuite.scala:57) > {code} > and > {code} > [info] - user provided closures are actually cleaned *** FAILED *** (56 > milliseconds) > [info] Expected ReturnStatementInClosureException, but got > org.apache.spark.SparkException: Job aborted due to stage failure: Task not > serializable: java.io.NotSerializableException: java.lang.Object > [info]- element of array (index: 0) > [info]- array (class "[Ljava.lang.Object;", size: 1) > [info]- field (class "java.lang.invoke.SerializedLambda", name: > "capturedArgs", type: "class [Ljava.lang.Object;") > [info]- object (class "java.lang.invoke.SerializedLambda", > SerializedLambda[capturingClass=class > org.apache.spark.util.TestUserClosuresActuallyCleaned$, > functionalInterfaceMethod=scala/runtime/java8/JFunction1$mcII$sp.apply$mcII$sp:(I)I, > implementation=invokeStatic > org/apache/spark/util/TestUserClosuresActuallyCleaned$.org$apache$spark$util$TestUserClosuresActuallyCleaned$$$anonfun$69:(Ljava/lang/Object;I)I, > instantiatedMethodType=(I)I, numCaptured=1]) > [info]- element of array (index: 0) > [info]- array (class "[Ljava.lang.Object;", size: 1) > [info]- field (class "java.lang.invoke.SerializedLambda", name: > "capturedArgs", type: "class [Ljava.lang.Object;") > [info]- object (class "java.lang.invoke.SerializedLambda", > SerializedLambda[capturingClass=class org.apache.spark.rdd.RDD, > functionalInterfaceMethod=scala/Function3.apply:(Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;, > implementation=invokeStatic > org/apache/spark/rdd/RDD.org$apache$spark$rdd$RDD$$$anonfun$20$adapted:(Lscala/Function1;Lorg/apache/spark/TaskContext;Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;, > > instantiatedMethodType=(Lorg/apache/spark/TaskContext;Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;, > numCaptured=1]) > [info]- field (class "org.apache.spark.rdd.MapPartitionsRDD", name: > "f", type: "interface scala.Function3") > [info]- object (class "org.apache.spark.rdd.MapPartitionsRDD", > MapPartitionsRDD[2] at apply at Transformer.scala:22) > [info]- field (class "scala.Tuple2", name: "_1", type: "class > java.lang.Object") > [info]- root object (class "scala.Tuple2", (MapPartitionsRDD[2] at > apply at > Transformer.scala:22,org.apache.spark.SparkContext$$Lambda$957/431842435@6e803685)). > [info] This means the closure provided by user is not actually cleaned. > (ClosureCleanerSuite.scala:78) > {code} > We'll need to figure out a closure cleaning strategy which works for 2.12 > lambdas. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18368) Regular expression replace throws NullPointerException when serialized
Ryan Blue created SPARK-18368: - Summary: Regular expression replace throws NullPointerException when serialized Key: SPARK-18368 URL: https://issues.apache.org/jira/browse/SPARK-18368 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.1, 2.1.0 Reporter: Ryan Blue This query fails with a [NullPointerException on line 247|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala#L247]: {code} SELECT POSEXPLODE(SPLIT(REGEXP_REPLACE(ranks, '[\\[ \\]]', ''), ',')) AS (rank, col0) FROM table; {code} The problem is that POSEXPLODE is causing the REGEXP_REPLACE to be serialized after it is instantiated. The null value is a transient StringBuffer that should hold the result. The fix is to make the result value lazy. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18367) limit() makes the lame walk again
[ https://issues.apache.org/jira/browse/SPARK-18367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649063#comment-15649063 ] Nicholas Chammas commented on SPARK-18367: -- I'm not trying to write any files actually. In this case I'm just running show() to trigger the error. I've tried coalescing partitions right before the show(), down to 5 even, but that doesn't seem to help. > limit() makes the lame walk again > - > > Key: SPARK-18367 > URL: https://issues.apache.org/jira/browse/SPARK-18367 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1, 2.1.0 > Environment: Python 3.5, Java 8 >Reporter: Nicholas Chammas > Attachments: plan-with-limit.txt, plan-without-limit.txt > > > I have a complex DataFrame query that fails to run normally but succeeds if I > add a dummy {{limit()}} upstream in the query tree. > The failure presents itself like this: > {code} > ERROR DiskBlockObjectWriter: Uncaught exception while reverting partial > writes to file > /private/var/folders/f5/t48vxz555b51mr3g6jjhxv40gq/T/blockmgr-1e908314-1d49-47ba-8c95-fa43ff43cee4/31/temp_shuffle_6b939273-c6ce-44a0-99b1-db668e6c89dc > java.io.FileNotFoundException: > /private/var/folders/f5/t48vxz555b51mr3g6jjhxv40gq/T/blockmgr-1e908314-1d49-47ba-8c95-fa43ff43cee4/31/temp_shuffle_6b939273-c6ce-44a0-99b1-db668e6c89dc > (Too many open files in system) > {code} > My {{ulimit -n}} is already set to 10,000, and I can't set it much higher on > macOS. However, I don't think that's the issue, since if I add a dummy > {{limit()}} early on the query tree -- dummy as in it does not actually > reduce the number of rows queried -- then the same query works. > I've diffed the physical query plans to see what this {{limit()}} is actually > doing, and the diff is as follows: > {code} > diff plan-with-limit.txt plan-without-limit.txt > 24,28c24 > <: : : +- *GlobalLimit 100 > <: : :+- Exchange SinglePartition > <: : : +- *LocalLimit 100 > <: : : +- *Project [...] > <: : : +- *Scan orc [...] > Format: ORC, InputPaths: file:/..., PartitionFilters: [], PushedFilters: [], > ReadSchema: struct<... > --- > >: : : +- *Scan orc [...] Format: ORC, > > InputPaths: file:/..., PartitionFilters: [], PushedFilters: [], ReadSchema: > > struct<... > 49,53c45 > < : : +- *GlobalLimit 100 > < : :+- Exchange SinglePartition > < : : +- *LocalLimit 100 > < : : +- *Project [...] > < : : +- *Scan orc [...] > Format: ORC, InputPaths: file:/..., PartitionFilters: [], PushedFilters: [], > ReadSchema: struct<... > --- > > : : +- *Scan orc [] Format: ORC, > > InputPaths: file:/..., PartitionFilters: [], PushedFilters: [], ReadSchema: > > struct<... > {code} > Does this give any clues as to why this {{limit()}} is helping? Again, the > 100 limit you can see in the plan is much higher than the cardinality of > the dataset I'm reading, so there is no theoretical impact on the output. You > can see the full query plans attached to this ticket. > Unfortunately, I don't have a minimal reproduction for this issue, but I can > work towards one with some clues. > I'm seeing this behavior on 2.0.1 and on master at commit > {{26e1c53aceee37e3687a372ff6c6f05463fd8a94}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17916) CSV data source treats empty string as null no matter what nullValue option is
[ https://issues.apache.org/jira/browse/SPARK-17916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649039#comment-15649039 ] Eric Liang commented on SPARK-17916: We're hitting this as a regression from 2.0 as well. Ideally, we don't want the empty string to be treated specially in any scenario. The only logic that converts it to nulls should be due to the nullValue option. > CSV data source treats empty string as null no matter what nullValue option is > -- > > Key: SPARK-17916 > URL: https://issues.apache.org/jira/browse/SPARK-17916 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Hossein Falaki > > When user configures {{nullValue}} in CSV data source, in addition to those > values, all empty string values are also converted to null. > {code} > data: > col1,col2 > 1,"-" > 2,"" > {code} > {code} > spark.read.format("csv").option("nullValue", "-") > {code} > We will find a null in both rows. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18367) limit() makes the lame walk again
[ https://issues.apache.org/jira/browse/SPARK-18367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649022#comment-15649022 ] Herman van Hovell commented on SPARK-18367: --- You might be trying to write a lot of partitions (files) here (given the error). The global limit will reduce the number of partitions to one. Could you try to coalesce partitions before writing? > limit() makes the lame walk again > - > > Key: SPARK-18367 > URL: https://issues.apache.org/jira/browse/SPARK-18367 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1, 2.1.0 > Environment: Python 3.5, Java 8 >Reporter: Nicholas Chammas > Attachments: plan-with-limit.txt, plan-without-limit.txt > > > I have a complex DataFrame query that fails to run normally but succeeds if I > add a dummy {{limit()}} upstream in the query tree. > The failure presents itself like this: > {code} > ERROR DiskBlockObjectWriter: Uncaught exception while reverting partial > writes to file > /private/var/folders/f5/t48vxz555b51mr3g6jjhxv40gq/T/blockmgr-1e908314-1d49-47ba-8c95-fa43ff43cee4/31/temp_shuffle_6b939273-c6ce-44a0-99b1-db668e6c89dc > java.io.FileNotFoundException: > /private/var/folders/f5/t48vxz555b51mr3g6jjhxv40gq/T/blockmgr-1e908314-1d49-47ba-8c95-fa43ff43cee4/31/temp_shuffle_6b939273-c6ce-44a0-99b1-db668e6c89dc > (Too many open files in system) > {code} > My {{ulimit -n}} is already set to 10,000, and I can't set it much higher on > macOS. However, I don't think that's the issue, since if I add a dummy > {{limit()}} early on the query tree -- dummy as in it does not actually > reduce the number of rows queried -- then the same query works. > I've diffed the physical query plans to see what this {{limit()}} is actually > doing, and the diff is as follows: > {code} > diff plan-with-limit.txt plan-without-limit.txt > 24,28c24 > <: : : +- *GlobalLimit 100 > <: : :+- Exchange SinglePartition > <: : : +- *LocalLimit 100 > <: : : +- *Project [...] > <: : : +- *Scan orc [...] > Format: ORC, InputPaths: file:/..., PartitionFilters: [], PushedFilters: [], > ReadSchema: struct<... > --- > >: : : +- *Scan orc [...] Format: ORC, > > InputPaths: file:/..., PartitionFilters: [], PushedFilters: [], ReadSchema: > > struct<... > 49,53c45 > < : : +- *GlobalLimit 100 > < : :+- Exchange SinglePartition > < : : +- *LocalLimit 100 > < : : +- *Project [...] > < : : +- *Scan orc [...] > Format: ORC, InputPaths: file:/..., PartitionFilters: [], PushedFilters: [], > ReadSchema: struct<... > --- > > : : +- *Scan orc [] Format: ORC, > > InputPaths: file:/..., PartitionFilters: [], PushedFilters: [], ReadSchema: > > struct<... > {code} > Does this give any clues as to why this {{limit()}} is helping? Again, the > 100 limit you can see in the plan is much higher than the cardinality of > the dataset I'm reading, so there is no theoretical impact on the output. You > can see the full query plans attached to this ticket. > Unfortunately, I don't have a minimal reproduction for this issue, but I can > work towards one with some clues. > I'm seeing this behavior on 2.0.1 and on master at commit > {{26e1c53aceee37e3687a372ff6c6f05463fd8a94}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18226) SparkR displaying vector columns in incorrect way
[ https://issues.apache.org/jira/browse/SPARK-18226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649026#comment-15649026 ] Felix Cheung commented on SPARK-18226: -- We discussed this as a part of the GBT PR, from here https://github.com/apache/spark/pull/15746#issuecomment-259217671 " The best sparse vector support in R comes from the Matrix package - But its a big package and I dont think we should add that as a dependency. We could try to do a wrapper where if the user already has the package installed we return it using Matrix ? " I think this is a good idea to do in a general sense for any SparseVector in the SerDe code. > SparkR displaying vector columns in incorrect way > - > > Key: SPARK-18226 > URL: https://issues.apache.org/jira/browse/SPARK-18226 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.0.0 >Reporter: Grzegorz Chilkiewicz >Priority: Trivial > > I have encountered a problem with SparkR presenting Spark vectors from > org.apache.spark.mllib.linalg package > * `head(df)` shows in vector column: "" > * cast to string does not work as expected, it shows: > "[1,null,null,org.apache.spark.sql.catalyst.expressions.UnsafeArrayData@79f50a91]" > * `showDF(df)` work correctly > to reproduce, start SparkR and paste following code (example taken from > https://spark.apache.org/docs/latest/sparkr.html#naive-bayes-model) > {code} > # Fit a Bernoulli naive Bayes model with spark.naiveBayes > titanic <- as.data.frame(Titanic) > titanicDF <- createDataFrame(titanic[titanic$Freq > 0, -5]) > nbDF <- titanicDF > nbTestDF <- titanicDF > nbModel <- spark.naiveBayes(nbDF, Survived ~ Class + Sex + Age) > # Model summary > summary(nbModel) > # Prediction > nbPredictions <- predict(nbModel, nbTestDF) > # > # My modification to expose the problem # > nbPredictions$rawPrediction_str <- cast(nbPredictions$rawPrediction, "string") > head(nbPredictions) > showDF(nbPredictions) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18367) limit() makes the lame walk again
[ https://issues.apache.org/jira/browse/SPARK-18367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-18367: - Description: I have a complex DataFrame query that fails to run normally but succeeds if I add a dummy {{limit()}} upstream in the query tree. The failure presents itself like this: {code} ERROR DiskBlockObjectWriter: Uncaught exception while reverting partial writes to file /private/var/folders/f5/t48vxz555b51mr3g6jjhxv40gq/T/blockmgr-1e908314-1d49-47ba-8c95-fa43ff43cee4/31/temp_shuffle_6b939273-c6ce-44a0-99b1-db668e6c89dc java.io.FileNotFoundException: /private/var/folders/f5/t48vxz555b51mr3g6jjhxv40gq/T/blockmgr-1e908314-1d49-47ba-8c95-fa43ff43cee4/31/temp_shuffle_6b939273-c6ce-44a0-99b1-db668e6c89dc (Too many open files in system) {code} My {{ulimit -n}} is already set to 10,000, and I can't set it much higher on macOS. However, I don't think that's the issue, since if I add a dummy {{limit()}} early on the query tree -- dummy as in it does not actually reduce the number of rows queried -- then the same query works. I've diffed the physical query plans to see what this {{limit()}} is actually doing, and the diff is as follows: {code} diff plan-with-limit.txt plan-without-limit.txt 24,28c24 <: : : +- *GlobalLimit 100 <: : :+- Exchange SinglePartition <: : : +- *LocalLimit 100 <: : : +- *Project [...] <: : : +- *Scan orc [...] Format: ORC, InputPaths: file:/..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<... --- >: : : +- *Scan orc [...] Format: ORC, > InputPaths: file:/..., PartitionFilters: [], PushedFilters: [], ReadSchema: > struct<... 49,53c45 < : : +- *GlobalLimit 100 < : :+- Exchange SinglePartition < : : +- *LocalLimit 100 < : : +- *Project [...] < : : +- *Scan orc [...] Format: ORC, InputPaths: file:/..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<... --- > : : +- *Scan orc [] Format: ORC, > InputPaths: file:/..., PartitionFilters: [], PushedFilters: [], ReadSchema: > struct<... {code} Does this give any clues as to why this {{limit()}} is helping? Again, the 100 limit you can see in the plan is much higher than the cardinality of the dataset I'm reading, so there is no theoretical impact on the output. You can see the full query plans attached to this ticket. Unfortunately, I don't have a minimal reproduction for this issue, but I can work towards one with some clues. I'm seeing this behavior on 2.0.1 and on master at commit {{26e1c53aceee37e3687a372ff6c6f05463fd8a94}}. was: I have a complex DataFrame query that fails to run normally but succeeds if I add a dummy {{limit()}} upstream in the query tree. The failure presents itself like this: {code} ERROR DiskBlockObjectWriter: Uncaught exception while reverting partial writes to file /private/var/folders/f5/t48vxz555b51mr3g6jjhxv40gq/T/blockmgr-1e908314-1d49-47ba-8c95-fa43ff43cee4/31/temp_shuffle_6b939273-c6ce-44a0-99b1-db668e6c89dc java.io.FileNotFoundException: /private/var/folders/f5/t48vxz555b51mr3g6jjhxv40gq/T/blockmgr-1e908314-1d49-47ba-8c95-fa43ff43cee4/31/temp_shuffle_6b939273-c6ce-44a0-99b1-db668e6c89dc (Too many open files in system) {code} My {{ulimit -n}} is already set to 10,000, and I can't set it much higher on macOS. However, I don't think that's the issue, since if I add a dummy {{limit()}} early on the query tree -- dummy as in it does not actually reduce the number of rows queried -- then the same query works. I've diffed the physical query plans to see what this {{limit()}} is actually doing, and the diff is as follows: {code} diff plan-with-limit.txt plan-without-limit.txt 24,28c24 <: : : +- *GlobalLimit 100 <: : :+- Exchange SinglePartition <: : : +- *LocalLimit 100 <: : : +- *Project [...] <: : : +- *Scan orc [...] Format: ORC, InputPaths: file:/..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<... --- >: : : +- *Scan orc [...] Format: ORC, > InputPaths: file:/..., PartitionFilters: [], PushedFilters: [], ReadSchema: > struct<... 49,53c45 < : : +- *GlobalLimit 100 <
[jira] [Updated] (SPARK-18367) limit() makes the lame walk again
[ https://issues.apache.org/jira/browse/SPARK-18367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-18367: - Attachment: plan-without-limit.txt plan-with-limit.txt > limit() makes the lame walk again > - > > Key: SPARK-18367 > URL: https://issues.apache.org/jira/browse/SPARK-18367 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1, 2.1.0 > Environment: Python 3.5, Java 8 >Reporter: Nicholas Chammas > Attachments: plan-with-limit.txt, plan-without-limit.txt > > > I have a complex DataFrame query that fails to run normally but succeeds if I > add a dummy {{limit()}} upstream in the query tree. > The failure presents itself like this: > {code} > ERROR DiskBlockObjectWriter: Uncaught exception while reverting partial > writes to file > /private/var/folders/f5/t48vxz555b51mr3g6jjhxv40gq/T/blockmgr-1e908314-1d49-47ba-8c95-fa43ff43cee4/31/temp_shuffle_6b939273-c6ce-44a0-99b1-db668e6c89dc > java.io.FileNotFoundException: > /private/var/folders/f5/t48vxz555b51mr3g6jjhxv40gq/T/blockmgr-1e908314-1d49-47ba-8c95-fa43ff43cee4/31/temp_shuffle_6b939273-c6ce-44a0-99b1-db668e6c89dc > (Too many open files in system) > {code} > My {{ulimit -n}} is already set to 10,000, and I can't set it much higher on > macOS. However, I don't think that's the issue, since if I add a dummy > {{limit()}} early on the query tree -- dummy as in it does not actually > reduce the number of rows queried -- then the same query works. > I've diffed the physical query plans to see what this {{limit()}} is actually > doing, and the diff is as follows: > {code} > diff plan-with-limit.txt plan-without-limit.txt > 24,28c24 > <: : : +- *GlobalLimit 100 > <: : :+- Exchange SinglePartition > <: : : +- *LocalLimit 100 > <: : : +- *Project [...] > <: : : +- *Scan orc [...] > Format: ORC, InputPaths: file:/..., PartitionFilters: [], PushedFilters: [], > ReadSchema: struct<... > --- > >: : : +- *Scan orc [...] Format: ORC, > > InputPaths: file:/..., PartitionFilters: [], PushedFilters: [], ReadSchema: > > struct<... > 49,53c45 > < : : +- *GlobalLimit 100 > < : :+- Exchange SinglePartition > < : : +- *LocalLimit 100 > < : : +- *Project [...] > < : : +- *Scan orc [...] > Format: ORC, InputPaths: file:/..., PartitionFilters: [], PushedFilters: [], > ReadSchema: struct<... > --- > > : : +- *Scan orc [] Format: ORC, > > InputPaths: file:/..., PartitionFilters: [], PushedFilters: [], ReadSchema: > > struct<... > {code} > Does this give any clues as to why this {{limit()}} is helping? Again, the > 100 limit you can see in the plan is much higher than the cardinality of > the dataset I'm reading, so there is no theoretical impact on the output. You > can see the full query plans attached to this ticket. > Unfortunately, I don't have a minimal reproduction for this issue, but I can > work towards one with some clues. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18367) limit() makes the lame walk again
Nicholas Chammas created SPARK-18367: Summary: limit() makes the lame walk again Key: SPARK-18367 URL: https://issues.apache.org/jira/browse/SPARK-18367 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.1, 2.1.0 Environment: Python 3.5, Java 8 Reporter: Nicholas Chammas I have a complex DataFrame query that fails to run normally but succeeds if I add a dummy {{limit()}} upstream in the query tree. The failure presents itself like this: {code} ERROR DiskBlockObjectWriter: Uncaught exception while reverting partial writes to file /private/var/folders/f5/t48vxz555b51mr3g6jjhxv40gq/T/blockmgr-1e908314-1d49-47ba-8c95-fa43ff43cee4/31/temp_shuffle_6b939273-c6ce-44a0-99b1-db668e6c89dc java.io.FileNotFoundException: /private/var/folders/f5/t48vxz555b51mr3g6jjhxv40gq/T/blockmgr-1e908314-1d49-47ba-8c95-fa43ff43cee4/31/temp_shuffle_6b939273-c6ce-44a0-99b1-db668e6c89dc (Too many open files in system) {code} My {{ulimit -n}} is already set to 10,000, and I can't set it much higher on macOS. However, I don't think that's the issue, since if I add a dummy {{limit()}} early on the query tree -- dummy as in it does not actually reduce the number of rows queried -- then the same query works. I've diffed the physical query plans to see what this {{limit()}} is actually doing, and the diff is as follows: {code} diff plan-with-limit.txt plan-without-limit.txt 24,28c24 <: : : +- *GlobalLimit 100 <: : :+- Exchange SinglePartition <: : : +- *LocalLimit 100 <: : : +- *Project [...] <: : : +- *Scan orc [...] Format: ORC, InputPaths: file:/..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<... --- >: : : +- *Scan orc [...] Format: ORC, > InputPaths: file:/..., PartitionFilters: [], PushedFilters: [], ReadSchema: > struct<... 49,53c45 < : : +- *GlobalLimit 100 < : :+- Exchange SinglePartition < : : +- *LocalLimit 100 < : : +- *Project [...] < : : +- *Scan orc [...] Format: ORC, InputPaths: file:/..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<... --- > : : +- *Scan orc [] Format: ORC, > InputPaths: file:/..., PartitionFilters: [], PushedFilters: [], ReadSchema: > struct<... {code} Does this give any clues as to why this {{limit()}} is helping? Again, the 100 limit you can see in the plan is much higher than the cardinality of the dataset I'm reading, so there is no theoretical impact on the output. You can see the full query plans attached to this ticket. Unfortunately, I don't have a minimal reproduction for this issue, but I can work towards one with some clues. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18366) Add handleInvalid to Pyspark for QuantileDiscretizer and Bucketizer
[ https://issues.apache.org/jira/browse/SPARK-18366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seth Hendrickson updated SPARK-18366: - Component/s: PySpark ML > Add handleInvalid to Pyspark for QuantileDiscretizer and Bucketizer > --- > > Key: SPARK-18366 > URL: https://issues.apache.org/jira/browse/SPARK-18366 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Reporter: Seth Hendrickson >Priority: Minor > > We should add the new {{handleInvalid}} param for these transformers to > Python to maintain API parity. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18366) Add handleInvalid to Pyspark for QuantileDiscretizer and Bucketizer
Seth Hendrickson created SPARK-18366: Summary: Add handleInvalid to Pyspark for QuantileDiscretizer and Bucketizer Key: SPARK-18366 URL: https://issues.apache.org/jira/browse/SPARK-18366 Project: Spark Issue Type: New Feature Reporter: Seth Hendrickson Priority: Minor We should add the new {{handleInvalid}} param for these transformers to Python to maintain API parity. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18339) Don't push down current_timestamp for filters in StructuredStreaming
[ https://issues.apache.org/jira/browse/SPARK-18339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-18339: - Labels: (was: correctness) > Don't push down current_timestamp for filters in StructuredStreaming > > > Key: SPARK-18339 > URL: https://issues.apache.org/jira/browse/SPARK-18339 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.0.1 >Reporter: Burak Yavuz > > For the following workflow: > 1. I have a column called time which is at minute level precision in a > Streaming DataFrame > 2. I want to perform groupBy time, count > 3. Then I want my MemorySink to only have the last 30 minutes of counts and I > perform this by > {code} > .where('time >= current_timestamp().cast("long") - 30 * 60) > {code} > what happens is that the `filter` gets pushed down before the aggregation, > and the filter happens on the source data for the aggregation instead of the > result of the aggregation (where I actually want to filter). > I guess the main issue here is that `current_timestamp` is non-deterministic > in the streaming context and shouldn't be pushed down the filter. > Does this require us to store the `current_timestamp` for each trigger of the > streaming job, that is something to discuss. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18336) SQL started to fail with OOM and etc. after move from 1.6.2 to 2.0.2
[ https://issues.apache.org/jira/browse/SPARK-18336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15648912#comment-15648912 ] Egor Pahomov commented on SPARK-18336: -- [~srowen], I've read everything in documentation about new memory management and tryied next config: {code} .set("spark.executor.memory", "28g") .set("spark.executor.instances", "50") .set("spark.dynamicAllocation.enabled", "false") .set("spark.yarn.executor.memoryOverhead", "3000") .set("spark.executor.cores", "6") .set("spark.driver.memory", "25g") .set("spark.driver.cores", "5") .set("spark.yarn.am.memory", "20g") .set("spark.shuffle.io.numConnectionsPerPeer", "5") .set("spark.sql.autoBroadcastJoinThreshold", "30485760") .set("spark.network.timeout", "4000s") .set("spark.driver.maxResultSize", "5g") .set("spark.sql.parquet.compression.codec", "gzip") .set("spark.kryoserializer.buffer.max", "1200m") .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") .set("spark.yarn.driver.memoryOverhead", "1000") .set("spark.memory.storageFraction", "0.2") .set("spark.memory.fraction", "0.8") .set("spark.sql.parquet.cacheMetadata", "false") .set("spark.scheduler.mode", "FIFO") .set("spark.sql.broadcastTimeout", "2") .set("spark.akka.frameSize", "200") .set("spark.sql.shuffle.partitions", partitions) .set("spark.network.timeout", "1000s") {code} Is there a manual - "I'm happy with how it was working before, can I configure my job be exactly like old times" ? Can I read somewere what happened to Tungsten project, beacause it seemed like big deal and now it turned of by default? > SQL started to fail with OOM and etc. after move from 1.6.2 to 2.0.2 > > > Key: SPARK-18336 > URL: https://issues.apache.org/jira/browse/SPARK-18336 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: Egor Pahomov > > I had several(~100) quires, which were run one after another in single spark > context. I can provide code of runner - it's very simple. It worked fine on > 1.6.2, than I moved to 2551d959a6c9fb27a54d38599a2301d735532c24 (branch-2.0 > on 31.10.2016 17:04:12). It started to fail with OOM and other errors. When I > separate my 100 quires to 2 sets and run set after set it works fine. I would > suspect problems with memory on driver, but nothing points to that. > My conf: > {code} > lazy val sparkConfTemplate = new SparkConf() > .setMaster("yarn-client") > .setAppName(appName) > .set("spark.executor.memory", "25g") > .set("spark.executor.instances", "40") > .set("spark.dynamicAllocation.enabled", "false") > .set("spark.yarn.executor.memoryOverhead", "3000") > .set("spark.executor.cores", "6") > .set("spark.driver.memory", "25g") > .set("spark.driver.cores", "5") > .set("spark.yarn.am.memory", "20g") > .set("spark.shuffle.io.numConnectionsPerPeer", "5") > .set("spark.sql.autoBroadcastJoinThreshold", "10") > .set("spark.network.timeout", "4000s") > .set("spark.driver.maxResultSize", "5g") > .set("spark.sql.parquet.compression.codec", "gzip") > .set("spark.kryoserializer.buffer.max", "1200m") > .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") > .set("spark.yarn.driver.memoryOverhead", "1000") > .set("spark.scheduler.mode", "FIFO") > .set("spark.sql.broadcastTimeout", "2") > .set("spark.akka.frameSize", "200") > .set("spark.sql.shuffle.partitions", partitions) > .set("spark.network.timeout", "1000s") > > .setJars(List(this.getClass.getProtectionDomain().getCodeSource().getLocation().toURI().getPath())) > {code} > Errors, which started to happen: > {code} > # > # A fatal error has been detected by the Java Runtime Environment: > # > # SIGSEGV (0xb) at pc=0x7f04c6cf3ea8, pid=17479, tid=139658116687616 > # > # JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build > 1.8.0_60-b27) > # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode > linux-amd64 compressed oops) > # Problematic frame: > # V [libjvm.so+0x64bea8] > InstanceKlass::oop_follow_contents(ParCompactionManager*, oopDesc*)+0x88 > # > # Failed to write core dump. Core dumps have been disabled. To enable core > dumping, try "ulimit -c unlimited" before starting Java again > # > # An error report file with more information is saved as: > # /home/egor/hs_err_pid17479.log > # > # If you would like to submit a bug report, please visit: > # http://bugreport.java.com/bugreport/crash.jsp > # > {code} > {code} > Exception in thread "refresh progress" java.lang.OutOfMemoryError: Java heap > space > at scala.collection.immutable.Iterable$.newBuilder(Iterable.scala:44) > at
[jira] [Commented] (SPARK-18343) FileSystem$Statistics$StatisticsDataReferenceCleaner hangs on s3 write
[ https://issues.apache.org/jira/browse/SPARK-18343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15648900#comment-15648900 ] Luke Miner commented on SPARK-18343: I ran jstack on an executor and on the driver and have attached it. Not sure how to read it myself. > FileSystem$Statistics$StatisticsDataReferenceCleaner hangs on s3 write > -- > > Key: SPARK-18343 > URL: https://issues.apache.org/jira/browse/SPARK-18343 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.1 > Environment: Spark 2.0.1 > Hadoop 2.7.1 > Mesos 1.0.1 > Ubuntu 14.04 >Reporter: Luke Miner > > I have a driver program where I write read data in from Cassandra using > spark, perform some operations, and then write out to JSON on S3. The program > runs fine when I use Spark 1.6.1 and the spark-cassandra-connector 1.6.0-M1. > However, if I try to upgrade to Spark 2.0.1 (hadoop 2.7.1) and > spark-cassandra-connector 2.0.0-M3, the program completes in the sense that > all the expected files are written to S3, but the program never terminates. > I do run `sc.stop()` at the end of the program. I am also using Mesos 1.0.1. > In both cases I use the default output committer. > From the thread dump (included below) it seems like it could be waiting on: > `org.apache.hadoop.fs.FileSystem$Statistics$StatisticsDataReferenceCleaner` > Code snippet: > {code} > // get MongoDB oplog operations > val operations = sc.cassandraTable[JsonOperation](keyspace, namespace) > .where("ts >= ? AND ts < ?", minTimestamp, maxTimestamp) > > // replay oplog operations into documents > val documents = operations > .spanBy(op => op.id) > .map { case (id: String, ops: Iterable[T]) => (id, apply(ops)) } > .filter { case (id, result) => result.isInstanceOf[Document] } > .map { case (id, document) => MergedDocument(id = id, document = > document > .asInstanceOf[Document]) > } > > // write documents to json on s3 > documents > .map(document => document.toJson) > .coalesce(partitions) > .saveAsTextFile(path, classOf[GzipCodec]) > sc.stop() > {code} > Thread dump on the driver: > {code} > 60 context-cleaner-periodic-gc TIMED_WAITING > 46 dag-scheduler-event-loopWAITING > 4389DestroyJavaVM RUNNABLE > 12 dispatcher-event-loop-0 WAITING > 13 dispatcher-event-loop-1 WAITING > 14 dispatcher-event-loop-2 WAITING > 15 dispatcher-event-loop-3 WAITING > 47 driver-revive-threadTIMED_WAITING > 3 Finalizer WAITING > 82 ForkJoinPool-1-worker-17WAITING > 43 heartbeat-receiver-event-loop-threadTIMED_WAITING > 93 java-sdk-http-connection-reaper TIMED_WAITING > 4387java-sdk-progress-listener-callback-thread WAITING > 25 map-output-dispatcher-0 WAITING > 26 map-output-dispatcher-1 WAITING > 27 map-output-dispatcher-2 WAITING > 28 map-output-dispatcher-3 WAITING > 29 map-output-dispatcher-4 WAITING > 30 map-output-dispatcher-5 WAITING > 31 map-output-dispatcher-6 WAITING > 32 map-output-dispatcher-7 WAITING > 48 MesosCoarseGrainedSchedulerBackend-mesos-driver RUNNABLE > 44 netty-rpc-env-timeout TIMED_WAITING > 92 > org.apache.hadoop.fs.FileSystem$Statistics$StatisticsDataReferenceCleaner > WAITING > 62 pool-19-thread-1TIMED_WAITING > 2 Reference Handler WAITING > 61 Scheduler-1112394071TIMED_WAITING > 20 shuffle-server-0RUNNABLE > 55 shuffle-server-0RUNNABLE > 21 shuffle-server-1RUNNABLE > 56 shuffle-server-1RUNNABLE > 22 shuffle-server-2RUNNABLE > 57 shuffle-server-2RUNNABLE > 23 shuffle-server-3RUNNABLE > 58 shuffle-server-3RUNNABLE > 4 Signal Dispatcher RUNNABLE > 59 Spark Context Cleaner TIMED_WAITING > 9 SparkListenerBusWAITING > 35 SparkUI-35-selector-ServerConnectorManager@651d3734/0 RUNNABLE > 36 > SparkUI-36-acceptor-0@467924cb-ServerConnector@3b5eaf92{HTTP/1.1}{0.0.0.0:4040} > RUNNABLE > 37 SparkUI-37-selector-ServerConnectorManager@651d3734/1 RUNNABLE > 38 SparkUI-38 TIMED_WAITING > 39 SparkUI-39 TIMED_WAITING > 40 SparkUI-40 TIMED_WAITING > 41 SparkUI-41 RUNNABLE > 42 SparkUI-42 TIMED_WAITING > 438 task-result-getter-0WAITING > 450 task-result-getter-1WAITING > 489 task-result-getter-2WAITING > 492 task-result-getter-3WAITING > 75 threadDeathWatcher-2-1 TIMED_WAITING > 45 Timer-0 WAITING > {code} > Thread dump on the executors. It's the same on all of them: > {code} > 24 dispatcher-event-loop-0 WAITING > 25 dispatcher-event-loop-1 WAITING > 26
[jira] [Commented] (SPARK-17691) Add aggregate function to collect list with maximum number of elements
[ https://issues.apache.org/jira/browse/SPARK-17691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1564#comment-1564 ] Michael Armbrust commented on SPARK-17691: -- +1 > Add aggregate function to collect list with maximum number of elements > -- > > Key: SPARK-17691 > URL: https://issues.apache.org/jira/browse/SPARK-17691 > Project: Spark > Issue Type: New Feature >Reporter: Assaf Mendelson >Priority: Minor > > One of the aggregate functions we have today is the collect_list function. > This is a useful tool to do a "catch all" aggregation which doesn't really > fit anywhere else. > The problem with collect_list is that it is unbounded. I would like to see a > means to do a collect_list where we limit the maximum number of elements. > I would see that the input for this would be the maximum number of elements > to use and the method of choosing (pick whatever, pick the top N, pick the > bottom B) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18343) FileSystem$Statistics$StatisticsDataReferenceCleaner hangs on s3 write
[ https://issues.apache.org/jira/browse/SPARK-18343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luke Miner updated SPARK-18343: --- Description: I have a driver program where I write read data in from Cassandra using spark, perform some operations, and then write out to JSON on S3. The program runs fine when I use Spark 1.6.1 and the spark-cassandra-connector 1.6.0-M1. However, if I try to upgrade to Spark 2.0.1 (hadoop 2.7.1) and spark-cassandra-connector 2.0.0-M3, the program completes in the sense that all the expected files are written to S3, but the program never terminates. I do run `sc.stop()` at the end of the program. I am also using Mesos 1.0.1. In both cases I use the default output committer. >From the thread dump (included below) it seems like it could be waiting on: >`org.apache.hadoop.fs.FileSystem$Statistics$StatisticsDataReferenceCleaner` Code snippet: {code} // get MongoDB oplog operations val operations = sc.cassandraTable[JsonOperation](keyspace, namespace) .where("ts >= ? AND ts < ?", minTimestamp, maxTimestamp) // replay oplog operations into documents val documents = operations .spanBy(op => op.id) .map { case (id: String, ops: Iterable[T]) => (id, apply(ops)) } .filter { case (id, result) => result.isInstanceOf[Document] } .map { case (id, document) => MergedDocument(id = id, document = document .asInstanceOf[Document]) } // write documents to json on s3 documents .map(document => document.toJson) .coalesce(partitions) .saveAsTextFile(path, classOf[GzipCodec]) sc.stop() {code} Thread dump on the driver: {code} 60 context-cleaner-periodic-gc TIMED_WAITING 46 dag-scheduler-event-loopWAITING 4389DestroyJavaVM RUNNABLE 12 dispatcher-event-loop-0 WAITING 13 dispatcher-event-loop-1 WAITING 14 dispatcher-event-loop-2 WAITING 15 dispatcher-event-loop-3 WAITING 47 driver-revive-threadTIMED_WAITING 3 Finalizer WAITING 82 ForkJoinPool-1-worker-17WAITING 43 heartbeat-receiver-event-loop-threadTIMED_WAITING 93 java-sdk-http-connection-reaper TIMED_WAITING 4387java-sdk-progress-listener-callback-thread WAITING 25 map-output-dispatcher-0 WAITING 26 map-output-dispatcher-1 WAITING 27 map-output-dispatcher-2 WAITING 28 map-output-dispatcher-3 WAITING 29 map-output-dispatcher-4 WAITING 30 map-output-dispatcher-5 WAITING 31 map-output-dispatcher-6 WAITING 32 map-output-dispatcher-7 WAITING 48 MesosCoarseGrainedSchedulerBackend-mesos-driver RUNNABLE 44 netty-rpc-env-timeout TIMED_WAITING 92 org.apache.hadoop.fs.FileSystem$Statistics$StatisticsDataReferenceCleaner WAITING 62 pool-19-thread-1TIMED_WAITING 2 Reference Handler WAITING 61 Scheduler-1112394071TIMED_WAITING 20 shuffle-server-0RUNNABLE 55 shuffle-server-0RUNNABLE 21 shuffle-server-1RUNNABLE 56 shuffle-server-1RUNNABLE 22 shuffle-server-2RUNNABLE 57 shuffle-server-2RUNNABLE 23 shuffle-server-3RUNNABLE 58 shuffle-server-3RUNNABLE 4 Signal Dispatcher RUNNABLE 59 Spark Context Cleaner TIMED_WAITING 9 SparkListenerBusWAITING 35 SparkUI-35-selector-ServerConnectorManager@651d3734/0 RUNNABLE 36 SparkUI-36-acceptor-0@467924cb-ServerConnector@3b5eaf92{HTTP/1.1}{0.0.0.0:4040} RUNNABLE 37 SparkUI-37-selector-ServerConnectorManager@651d3734/1 RUNNABLE 38 SparkUI-38 TIMED_WAITING 39 SparkUI-39 TIMED_WAITING 40 SparkUI-40 TIMED_WAITING 41 SparkUI-41 RUNNABLE 42 SparkUI-42 TIMED_WAITING 438 task-result-getter-0WAITING 450 task-result-getter-1WAITING 489 task-result-getter-2WAITING 492 task-result-getter-3WAITING 75 threadDeathWatcher-2-1 TIMED_WAITING 45 Timer-0 WAITING {code} Thread dump on the executors. It's the same on all of them: {code} 24 dispatcher-event-loop-0 WAITING 25 dispatcher-event-loop-1 WAITING 26 dispatcher-event-loop-2 RUNNABLE 27 dispatcher-event-loop-3 WAITING 39 driver-heartbeater TIMED_WAITING 3 Finalizer WAITING 58 java-sdk-http-connection-reaper TIMED_WAITING 75 java-sdk-progress-listener-callback-thread WAITING 1 mainTIMED_WAITING 33 netty-rpc-env-timeout TIMED_WAITING 55 org.apache.hadoop.fs.FileSystem$Statistics$StatisticsDataReferenceCleaner WAITING 59 pool-17-thread-1TIMED_WAITING 2 Reference Handler WAITING 28 shuffle-client-0RUNNABLE 35 shuffle-client-0RUNNABLE 41 shuffle-client-0RUNNABLE 37 shuffle-server-0RUNNABLE 5 Signal Dispatcher RUNNABLE 23 threadDeathWatcher-2-1 TIMED_WAITING {code} Jstack of an executor: {code} ubuntu@ip-10-0-230-88:~$
[jira] [Updated] (SPARK-18365) Improve Documentation for Sample Method
[ https://issues.apache.org/jira/browse/SPARK-18365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Chambers updated SPARK-18365: -- Summary: Improve Documentation for Sample Method (was: Documentation for Sampling is Incorrect) > Improve Documentation for Sample Method > --- > > Key: SPARK-18365 > URL: https://issues.apache.org/jira/browse/SPARK-18365 > Project: Spark > Issue Type: Bug >Reporter: Bill Chambers > > The parameter documentation is switched. > PR coming shortly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18365) Documentation for Sampling is Incorrect
[ https://issues.apache.org/jira/browse/SPARK-18365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18365: Assignee: Apache Spark > Documentation for Sampling is Incorrect > --- > > Key: SPARK-18365 > URL: https://issues.apache.org/jira/browse/SPARK-18365 > Project: Spark > Issue Type: Bug >Reporter: Bill Chambers >Assignee: Apache Spark > > The parameter documentation is switched. > PR coming shortly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18365) Documentation for Sampling is Incorrect
[ https://issues.apache.org/jira/browse/SPARK-18365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18365: Assignee: (was: Apache Spark) > Documentation for Sampling is Incorrect > --- > > Key: SPARK-18365 > URL: https://issues.apache.org/jira/browse/SPARK-18365 > Project: Spark > Issue Type: Bug >Reporter: Bill Chambers > > The parameter documentation is switched. > PR coming shortly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18365) Documentation for Sampling is Incorrect
[ https://issues.apache.org/jira/browse/SPARK-18365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15648845#comment-15648845 ] Apache Spark commented on SPARK-18365: -- User 'anabranch' has created a pull request for this issue: https://github.com/apache/spark/pull/15815 > Documentation for Sampling is Incorrect > --- > > Key: SPARK-18365 > URL: https://issues.apache.org/jira/browse/SPARK-18365 > Project: Spark > Issue Type: Bug >Reporter: Bill Chambers > > The parameter documentation is switched. > PR coming shortly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18365) Documentation for Sampling is Incorrect
Bill Chambers created SPARK-18365: - Summary: Documentation for Sampling is Incorrect Key: SPARK-18365 URL: https://issues.apache.org/jira/browse/SPARK-18365 Project: Spark Issue Type: Bug Reporter: Bill Chambers The parameter documentation is switched. PR coming shortly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18280) Potential deadlock in `StandaloneSchedulerBackend.dead`
[ https://issues.apache.org/jira/browse/SPARK-18280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-18280. -- Resolution: Fixed Assignee: Shixiong Zhu Fix Version/s: 2.1.0 2.0.3 > Potential deadlock in `StandaloneSchedulerBackend.dead` > --- > > Key: SPARK-18280 > URL: https://issues.apache.org/jira/browse/SPARK-18280 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.2, 2.0.0, 2.0.1 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Fix For: 2.0.3, 2.1.0 > > > "StandaloneSchedulerBackend.dead" is called in a RPC thread, so it should not > call "SparkContext.stop" in the same thread. "SparkContext.stop" will block > until all RPC threads exit, if it's called inside a RPC thread, it will be > dead-lock. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18364) expose metrics for YarnShuffleService
[ https://issues.apache.org/jira/browse/SPARK-18364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15648785#comment-15648785 ] Steven Rand commented on SPARK-18364: - I can implement this if people think it makes sense. > expose metrics for YarnShuffleService > - > > Key: SPARK-18364 > URL: https://issues.apache.org/jira/browse/SPARK-18364 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 2.0.1 >Reporter: Steven Rand >Priority: Minor > Original Estimate: 336h > Remaining Estimate: 336h > > ExternalShuffleService exposes metrics as of SPARK-16405. However, > YarnShuffleService does not. > The work of instrumenting ExternalShuffleBlockHandler was already done in > SPARK-1645, so this JIRA is for creating a MetricsSystem in > YarnShuffleService similarly to how ExternalShuffleService already does it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18364) expose metrics for YarnShuffleService
Steven Rand created SPARK-18364: --- Summary: expose metrics for YarnShuffleService Key: SPARK-18364 URL: https://issues.apache.org/jira/browse/SPARK-18364 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 2.0.1 Reporter: Steven Rand Priority: Minor ExternalShuffleService exposes metrics as of SPARK-16405. However, YarnShuffleService does not. The work of instrumenting ExternalShuffleBlockHandler was already done in SPARK-1645, so this JIRA is for creating a MetricsSystem in YarnShuffleService similarly to how ExternalShuffleService already does it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16215) Reduce runtime overhead of a program that writes an primitive array in Dataframe/Dataset
[ https://issues.apache.org/jira/browse/SPARK-16215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-16215. --- Resolution: Duplicate > Reduce runtime overhead of a program that writes an primitive array in > Dataframe/Dataset > > > Key: SPARK-16215 > URL: https://issues.apache.org/jira/browse/SPARK-16215 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Kazuaki Ishizaki > > When we run a program that writes an primitive array in Dataframe/Dataset, > generated code for projection performs null check and element-based data > copy. We can alleviate overhead of these operations by generating better code. > A target program > {code} > val df = sparkContext.parallelize(Seq(0.0d, 1.0d), 1).toDF > df.selectExpr("Array(value + 1.1d, value + 2.2d)").collect > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org