[jira] [Created] (SPARK-22579) BlockManager.getRemoteValues and BlockManager.getRemoteBytes should be implemented using streaming
Eyal Farago created SPARK-22579: --- Summary: BlockManager.getRemoteValues and BlockManager.getRemoteBytes should be implemented using streaming Key: SPARK-22579 URL: https://issues.apache.org/jira/browse/SPARK-22579 Project: Spark Issue Type: Bug Components: Block Manager, Spark Core Affects Versions: 2.1.0 Reporter: Eyal Farago when an RDD partition is cached on an executor bu the task requiring it is running on another executor (process locality ANY), the cached partition is fetched via BlockManager.getRemoteValues which delegates to BlockManager.getRemoteBytes, both calls are blocking. in my use case I had a 700GB RDD spread over 1000 partitions on a 6 nodes cluster, cached to disk. rough math shows that average partition size is 700MB. looking at spark UI it was obvious that tasks running with process locality 'ANY' are much slower than local tasks (~40 seconds to 8-10 minutes ratio), I was able to capture thread dumps of executors executing remote tasks and got this stake trace: {quote}Thread IDThread Name Thread StateThread Locks 1521Executor task launch worker-1000WAITING Lock(java.util.concurrent.ThreadPoolExecutor$Worker@196462978}) sun.misc.Unsafe.park(Native Method) java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202) scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218) scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190) scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) scala.concurrent.Await$.result(package.scala:190) org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:190) org.apache.spark.network.BlockTransferService.fetchBlockSync(BlockTransferService.scala:104) org.apache.spark.storage.BlockManager.getRemoteBytes(BlockManager.scala:582) org.apache.spark.storage.BlockManager.getRemoteValues(BlockManager.scala:550) org.apache.spark.storage.BlockManager.get(BlockManager.scala:638) org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:690) org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334) org.apache.spark.rdd.RDD.iterator(RDD.scala:285) org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) org.apache.spark.rdd.RDD.iterator(RDD.scala:287) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) org.apache.spark.rdd.RDD.iterator(RDD.scala:287){quote} digging into the code showed that the block manager first fetches all bytes (getRemoteBytes) and then wraps it with a deserialization stream, this has several draw backs: 1. blocking, requesting executor is blocked while the remote executor is serving the block. 2. potentially large memory footprint on requesting executor, in my use case a 700mb of raw bytes stored in a ChunkedByteBuffer. 3. inefficient, requesting side usually don't need all values at once as it consumes the values via an iterator. 4. potentially large memory footprint on serving executor, in case the block is cached in deserialized form the serving executor has to serialize it into a ChunkedByteBuffer (BlockManager.doGetLocalBytes). this is both memory & CPU intensive, memory footprint can be reduced by using a limited buffer for serialization 'spilling' to the response stream. I suggest improving this either by implementing full streaming mechanism or some kind of pagination mechanism, in addition the requesting executor should be able to make progress with the data it already has, blocking only when local buffer is exhausted and remote side didn't deliver the next chunk of the stream (or page in case of pagination) yet. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22560) Must create spark session directly to connect to hive
[ https://issues.apache.org/jira/browse/SPARK-22560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ran Mingxuan updated SPARK-22560: - Affects Version/s: 2.2.0 > Must create spark session directly to connect to hive > - > > Key: SPARK-22560 > URL: https://issues.apache.org/jira/browse/SPARK-22560 > Project: Spark > Issue Type: Bug > Components: Java API, SQL >Affects Versions: 2.1.0, 2.2.0 >Reporter: Ran Mingxuan > Original Estimate: 168h > Remaining Estimate: 168h > > In a java project I have to use both JavaSparkContext and SparkSession. I > find the order to create them affect hive connection. > I have built a spark job like below: > {code:java} > // wrong code > public void main(String[] args) > { > SparkConf sparkConf = new SparkConf().setAppName("testApp"); > JavaSparkContext sc = new JavaSparkContext(sparkConf); > SparkSession spark = > SparkSession.builder().sparkContext(sc.sc()).enableHiveSupport().getOrCreate(); > spark.sql("show databases").show(); > } > {code} > and with this code spark job will not be able to find hive meta-store even if > it can discover correct warehouse. > I have to use code like below to make things work: > {code:java} > // correct code > public String main(String[] args) > { > SparkConf sparkConf = new SparkConf().setAppName("testApp"); > SparkSession spark = > SparkSession.builder().config(sparkConf).enableHiveSupport().getOrCreate(); > SparkContext sparkContext = spark.sparkContext(); > JavaSparkContext sc = JavaSparkContext.fromSparkContext(sparkContext); > spark.sql("show databases").show(); > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-22560) Must create spark session directly to connect to hive
[ https://issues.apache.org/jira/browse/SPARK-22560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16262093#comment-16262093 ] Ran Mingxuan edited comment on SPARK-22560 at 11/22/17 7:47 AM: I think the problem is introduced by org.apache.spark.sql.internal.SessionState which will take the option of sparkcontext. That means the `enableHiveSupport()` won't work in this case if spark session is built from spark context. {code:java} // enableHiveSupport() will not be able to be got here @InterfaceStability.Unstable @transient lazy val sessionState: SessionState = { parentSessionState .map(_.clone(this)) .getOrElse { val state = SparkSession.instantiateSessionState( SparkSession.sessionStateClassName(sparkContext.conf), self) initialSessionOptions.foreach { case (k, v) => state.conf.setConfString(k, v) } state } } {code} was (Author: ranmx): I think the problem is introduced by org.apache.spark.sql.internal.SessionState which will take the option of sparkcontext. That means the `enableHiveSupport()` won't work in this case if spark session is built from spark context. > Must create spark session directly to connect to hive > - > > Key: SPARK-22560 > URL: https://issues.apache.org/jira/browse/SPARK-22560 > Project: Spark > Issue Type: Bug > Components: Java API, SQL >Affects Versions: 2.1.0 >Reporter: Ran Mingxuan > Original Estimate: 168h > Remaining Estimate: 168h > > In a java project I have to use both JavaSparkContext and SparkSession. I > find the order to create them affect hive connection. > I have built a spark job like below: > {code:java} > // wrong code > public void main(String[] args) > { > SparkConf sparkConf = new SparkConf().setAppName("testApp"); > JavaSparkContext sc = new JavaSparkContext(sparkConf); > SparkSession spark = > SparkSession.builder().sparkContext(sc.sc()).enableHiveSupport().getOrCreate(); > spark.sql("show databases").show(); > } > {code} > and with this code spark job will not be able to find hive meta-store even if > it can discover correct warehouse. > I have to use code like below to make things work: > {code:java} > // correct code > public String main(String[] args) > { > SparkConf sparkConf = new SparkConf().setAppName("testApp"); > SparkSession spark = > SparkSession.builder().config(sparkConf).enableHiveSupport().getOrCreate(); > SparkContext sparkContext = spark.sparkContext(); > JavaSparkContext sc = JavaSparkContext.fromSparkContext(sparkContext); > spark.sql("show databases").show(); > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22560) Must create spark session directly to connect to hive
[ https://issues.apache.org/jira/browse/SPARK-22560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16262093#comment-16262093 ] Ran Mingxuan commented on SPARK-22560: -- I think the problem is introduced by org.apache.spark.sql.internal.SessionState which will take the option of sparkcontext. That means the `enableHiveSupport()` won't work in this case if spark session is built from spark context. > Must create spark session directly to connect to hive > - > > Key: SPARK-22560 > URL: https://issues.apache.org/jira/browse/SPARK-22560 > Project: Spark > Issue Type: Bug > Components: Java API, SQL >Affects Versions: 2.1.0 >Reporter: Ran Mingxuan > Original Estimate: 168h > Remaining Estimate: 168h > > In a java project I have to use both JavaSparkContext and SparkSession. I > find the order to create them affect hive connection. > I have built a spark job like below: > {code:java} > // wrong code > public void main(String[] args) > { > SparkConf sparkConf = new SparkConf().setAppName("testApp"); > JavaSparkContext sc = new JavaSparkContext(sparkConf); > SparkSession spark = > SparkSession.builder().sparkContext(sc.sc()).enableHiveSupport().getOrCreate(); > spark.sql("show databases").show(); > } > {code} > and with this code spark job will not be able to find hive meta-store even if > it can discover correct warehouse. > I have to use code like below to make things work: > {code:java} > // correct code > public String main(String[] args) > { > SparkConf sparkConf = new SparkConf().setAppName("testApp"); > SparkSession spark = > SparkSession.builder().config(sparkConf).enableHiveSupport().getOrCreate(); > SparkContext sparkContext = spark.sparkContext(); > JavaSparkContext sc = JavaSparkContext.fromSparkContext(sparkContext); > spark.sql("show databases").show(); > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22578) CSV with quoted line breaks not correctly parsed
Carlos Barahona created SPARK-22578: --- Summary: CSV with quoted line breaks not correctly parsed Key: SPARK-22578 URL: https://issues.apache.org/jira/browse/SPARK-22578 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Carlos Barahona I believe the behavior addressed in SPARK-19610 still exists. Using spark 2.2.0, when attempting to read in a CSV file containing a quoted newline, the resulting dataset contains two separate items split along the quoted newline. Example text: {code:java} 4/28/2015 8:01,4/28/2015 8:19,0,100,1078,1,4/28/2015 8:19,email,"Hello World", 2,3,4,5 {code} scala> val csvFile = spark.read.csv("file:///path") csvFile: org.apache.spark.sql.DataFrame = [_c0: string, _c1: string ... 7 more fields] scala> csvFile.first() res2: org.apache.spark.sql.Row = [4/28/2015 8:01,4/28/2015 8:19,0,100,1078,1,4/28/2015 8:19,email,Hello] scala> csvFile.count() res3: Long = 2 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22543) fix java 64kb compile error for deeply nested expressions
[ https://issues.apache.org/jira/browse/SPARK-22543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-22543: Target Version/s: 2.3.0 (was: 2.2.1, 2.3.0) > fix java 64kb compile error for deeply nested expressions > - > > Key: SPARK-22543 > URL: https://issues.apache.org/jira/browse/SPARK-22543 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22548) Incorrect nested AND expression pushed down to JDBC data source
[ https://issues.apache.org/jira/browse/SPARK-22548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-22548. - Resolution: Fixed Assignee: Jia Li Fix Version/s: 2.3.0 2.2.1 2.1.3 > Incorrect nested AND expression pushed down to JDBC data source > --- > > Key: SPARK-22548 > URL: https://issues.apache.org/jira/browse/SPARK-22548 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Jia Li >Assignee: Jia Li > Fix For: 2.1.3, 2.2.1, 2.3.0 > > > Let’s say I have a JDBC data source table ‘foobar’ with 3 rows: > NAME THEID > == > fred 1 > mary 2 > joe 'foo' "bar"3 > This query returns incorrect result. > SELECT * FROM foobar WHERE (THEID > 0 AND TRIM(NAME) = 'mary') OR (NAME = > 'fred') > It’s supposed to return: > fred 1 > mary 2 > But it returns > fred 1 > mary 2 > joe 'foo' "bar"3 > This is because one leg of the nested AND predicate, TRIM(NAME) = 'mary’, can > not be pushed down but is lost during JDBC push down filter translation. The > same translation method is also called by Data Source V2. I have a fix for > this issue and will open a PR. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22560) Must create spark session directly to connect to hive
[ https://issues.apache.org/jira/browse/SPARK-22560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ran Mingxuan updated SPARK-22560: - Description: In a java project I have to use both JavaSparkContext and SparkSession. I find the order to create them affect hive connection. I have built a spark job like below: {code:java} // wrong code public void main(String[] args) { SparkConf sparkConf = new SparkConf().setAppName("testApp"); JavaSparkContext sc = new JavaSparkContext(sparkConf); SparkSession spark = SparkSession.builder().sparkContext(sc.sc()).enableHiveSupport().getOrCreate(); spark.sql("show databases").show(); } {code} and with this code spark job will not be able to find hive meta-store even if it can discover correct warehouse. I have to use code like below to make things work: {code:java} // correct code public String main(String[] args) { SparkConf sparkConf = new SparkConf().setAppName("testApp"); SparkSession spark = SparkSession.builder().config(sparkConf).enableHiveSupport().getOrCreate(); SparkContext sparkContext = spark.sparkContext(); JavaSparkContext sc = JavaSparkContext.fromSparkContext(sparkContext); spark.sql("show databases").show(); } {code} was: In a java project I have to use both JavaSparkContext and SparkSession. I find the order to create them affect hive connection. I have built a spark job like below: {code:java} // wrong code public void main(String[] args) { SparkConf sparkConf = new SparkConf().setAppName("testApp"); JavaSparkContext sc = new JavaSparkContext(sparkConf); sc.hadoopConfiguration().setBoolean("mapreduce.fileoutputcommitter.marksuccessfuljobs", false); sc.hadoopConfiguration().setBoolean("parquet.enable.summary-metadata", false);// SparkSession spark = SparkSession.builder().sparkContext(sc.sc()).enableHiveSupport().getOrCreate(); spark.sql("show databases").show(); } {code} and with this code spark job will not be able to find hive meta-store even if it can discover correct warehouse. I have to use code like below to make things work: {code:java} // correct code public String main(String[] args) { SparkConf sparkConf = new SparkConf().setAppName("testApp"); SparkSession spark = SparkSession.builder().config(sparkConf).enableHiveSupport().getOrCreate(); SparkContext sparkContext = spark.sparkContext(); JavaSparkContext sc = JavaSparkContext.fromSparkContext(sparkContext); sc.hadoopConfiguration().setBoolean("mapreduce.fileoutputcommitter.marksuccessfuljobs", false); sc.hadoopConfiguration().setBoolean("parquet.enable.summary-metadata", false); spark.sql("show databases").show(); } {code} > Must create spark session directly to connect to hive > - > > Key: SPARK-22560 > URL: https://issues.apache.org/jira/browse/SPARK-22560 > Project: Spark > Issue Type: Bug > Components: Java API, SQL >Affects Versions: 2.1.0 >Reporter: Ran Mingxuan > Original Estimate: 168h > Remaining Estimate: 168h > > In a java project I have to use both JavaSparkContext and SparkSession. I > find the order to create them affect hive connection. > I have built a spark job like below: > {code:java} > // wrong code > public void main(String[] args) > { > SparkConf sparkConf = new SparkConf().setAppName("testApp"); > JavaSparkContext sc = new JavaSparkContext(sparkConf); > SparkSession spark = > SparkSession.builder().sparkContext(sc.sc()).enableHiveSupport().getOrCreate(); > spark.sql("show databases").show(); > } > {code} > and with this code spark job will not be able to find hive meta-store even if > it can discover correct warehouse. > I have to use code like below to make things work: > {code:java} > // correct code > public String main(String[] args) > { > SparkConf sparkConf = new SparkConf().setAppName("testApp"); > SparkSession spark = > SparkSession.builder().config(sparkConf).enableHiveSupport().getOrCreate(); > SparkContext sparkContext = spark.sparkContext(); > JavaSparkContext sc = JavaSparkContext.fromSparkContext(sparkContext); > spark.sql("show databases").show(); > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22560) Must create spark session directly to connect to hive
[ https://issues.apache.org/jira/browse/SPARK-22560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16261887#comment-16261887 ] Ran Mingxuan commented on SPARK-22560: -- [~srowen] Thank you for your reply. Maybe I should declare that the hadoop configurations have no effect on the result and move them away from the code. > Must create spark session directly to connect to hive > - > > Key: SPARK-22560 > URL: https://issues.apache.org/jira/browse/SPARK-22560 > Project: Spark > Issue Type: Bug > Components: Java API, SQL >Affects Versions: 2.1.0 >Reporter: Ran Mingxuan > Original Estimate: 168h > Remaining Estimate: 168h > > In a java project I have to use both JavaSparkContext and SparkSession. I > find the order to create them affect hive connection. > I have built a spark job like below: > {code:java} > // wrong code > public void main(String[] args) > { > SparkConf sparkConf = new SparkConf().setAppName("testApp"); > JavaSparkContext sc = new JavaSparkContext(sparkConf); > > sc.hadoopConfiguration().setBoolean("mapreduce.fileoutputcommitter.marksuccessfuljobs", > false); > sc.hadoopConfiguration().setBoolean("parquet.enable.summary-metadata", > false);// > SparkSession spark = > SparkSession.builder().sparkContext(sc.sc()).enableHiveSupport().getOrCreate(); > spark.sql("show databases").show(); > } > {code} > and with this code spark job will not be able to find hive meta-store even if > it can discover correct warehouse. > I have to use code like below to make things work: > {code:java} > // correct code > public String main(String[] args) > { > SparkConf sparkConf = new SparkConf().setAppName("testApp"); > SparkSession spark = > SparkSession.builder().config(sparkConf).enableHiveSupport().getOrCreate(); > SparkContext sparkContext = spark.sparkContext(); > JavaSparkContext sc = JavaSparkContext.fromSparkContext(sparkContext); > > sc.hadoopConfiguration().setBoolean("mapreduce.fileoutputcommitter.marksuccessfuljobs", > false); > sc.hadoopConfiguration().setBoolean("parquet.enable.summary-metadata", > false); > spark.sql("show databases").show(); > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22374) STS ran into OOM in a secure cluster
[ https://issues.apache.org/jira/browse/SPARK-22374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16261875#comment-16261875 ] Dongjoon Hyun commented on SPARK-22374: --- Thank you for advice. I'll consider that. > STS ran into OOM in a secure cluster > > > Key: SPARK-22374 > URL: https://issues.apache.org/jira/browse/SPARK-22374 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Dongjoon Hyun > Attachments: 1.png, 2.png, 3.png > > > In a secure cluster, FileSystem.CACHE grows indefinitely. > *ENVIRONMENT* > 1. `spark.yarn.principal` and `spark.yarn.keytab` is used. > 2. Spark Thrift Server run with `doAs` false. > {code} > > hive.server2.enable.doAs > false > > {code} > With 6GB (-Xmx6144m) options, `HiveConf` consumes 4GB inside FileSystem.CACHE. > {code} > 20,030 instances of "org.apache.hadoop.hive.conf.HiveConf", loaded by > "sun.misc.Launcher$AppClassLoader @ 0x64001c160" occupy 4,418,101,352 > (73.42%) bytes. These instances are referenced from one instance of > "java.util.HashMap$Node[]", loaded by "" > {code} > Please see the attached images. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22510) Exceptions caused by 64KB JVM bytecode or constant pool entry limits
[ https://issues.apache.org/jira/browse/SPARK-22510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki updated SPARK-22510: - Description: Codegen can throw an exception due to the 64KB JVM bytecode or 64K constant pool entry limit. was: Codegen can throw an exception due to the 64KB JVM bytecode limit. > Exceptions caused by 64KB JVM bytecode or constant pool entry limits > - > > Key: SPARK-22510 > URL: https://issues.apache.org/jira/browse/SPARK-22510 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 2.2.0 >Reporter: Xiao Li > > Codegen can throw an exception due to the 64KB JVM bytecode or 64K constant > pool entry limit. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22510) Exceptions caused by 64KB JVM bytecode or 64K constant pool entry limit
[ https://issues.apache.org/jira/browse/SPARK-22510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki updated SPARK-22510: - Summary: Exceptions caused by 64KB JVM bytecode or 64K constant pool entry limit (was: Exceptions caused by 64KB JVM bytecode or constant pool entry limits ) > Exceptions caused by 64KB JVM bytecode or 64K constant pool entry limit > > > Key: SPARK-22510 > URL: https://issues.apache.org/jira/browse/SPARK-22510 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 2.2.0 >Reporter: Xiao Li > > Codegen can throw an exception due to the 64KB JVM bytecode or 64K constant > pool entry limit. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22510) Exceptions caused by 64KB JVM bytecode or constant pool entry limits
[ https://issues.apache.org/jira/browse/SPARK-22510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki updated SPARK-22510: - Summary: Exceptions caused by 64KB JVM bytecode or constant pool entry limits (was: Exceptions caused by 64KB JVM bytecode limit ) > Exceptions caused by 64KB JVM bytecode or constant pool entry limits > - > > Key: SPARK-22510 > URL: https://issues.apache.org/jira/browse/SPARK-22510 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 2.2.0 >Reporter: Xiao Li > > Codegen can throw an exception due to the 64KB JVM bytecode limit. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22575) Making Spark Thrift Server clean up its cache
[ https://issues.apache.org/jira/browse/SPARK-22575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16261808#comment-16261808 ] Oz Ben-Ami commented on SPARK-22575: We are not explicitly caching any tables, only running SQL queries. Is there a way to manually remove tables or partitions from cache? > Making Spark Thrift Server clean up its cache > - > > Key: SPARK-22575 > URL: https://issues.apache.org/jira/browse/SPARK-22575 > Project: Spark > Issue Type: Improvement > Components: Block Manager, SQL >Affects Versions: 2.2.0 >Reporter: Oz Ben-Ami >Priority: Minor > Labels: cache, dataproc, thrift, yarn > > Currently, Spark Thrift Server accumulates data in its appcache, even for old > queries. This fills up the disk (using over 100GB per worker node) within > days, and the only way to clear it is to restart the Thrift Server > application. Even deleting the files directly isn't a solution, as Spark then > complains about FileNotFound. > I asked about this on [Stack > Overflow|https://stackoverflow.com/questions/46893123/how-can-i-make-spark-thrift-server-clean-up-its-cache] > a few weeks ago, but it does not seem to be currently doable by > configuration. > Am I missing some configuration option, or some other factor here? > Otherwise, can anyone point me to the code that handles this, so maybe I can > try my hand at a fix? > Thanks! -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22575) Making Spark Thrift Server clean up its cache
[ https://issues.apache.org/jira/browse/SPARK-22575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16261557#comment-16261557 ] Marco Gaido commented on SPARK-22575: - does it happen because you are caching some tables and never uncaching them? > Making Spark Thrift Server clean up its cache > - > > Key: SPARK-22575 > URL: https://issues.apache.org/jira/browse/SPARK-22575 > Project: Spark > Issue Type: Improvement > Components: Block Manager, SQL >Affects Versions: 2.2.0 >Reporter: Oz Ben-Ami >Priority: Minor > Labels: cache, dataproc, thrift, yarn > > Currently, Spark Thrift Server accumulates data in its appcache, even for old > queries. This fills up the disk (using over 100GB per worker node) within > days, and the only way to clear it is to restart the Thrift Server > application. Even deleting the files directly isn't a solution, as Spark then > complains about FileNotFound. > I asked about this on [Stack > Overflow|https://stackoverflow.com/questions/46893123/how-can-i-make-spark-thrift-server-clean-up-its-cache] > a few weeks ago, but it does not seem to be currently doable by > configuration. > Am I missing some configuration option, or some other factor here? > Otherwise, can anyone point me to the code that handles this, so maybe I can > try my hand at a fix? > Thanks! -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22500) 64KB JVM bytecode limit problem with cast
[ https://issues.apache.org/jira/browse/SPARK-22500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-22500. - Resolution: Fixed Fix Version/s: 2.3.0 2.2.1 Issue resolved by pull request 19730 [https://github.com/apache/spark/pull/19730] > 64KB JVM bytecode limit problem with cast > - > > Key: SPARK-22500 > URL: https://issues.apache.org/jira/browse/SPARK-22500 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.2.0 >Reporter: Kazuaki Ishizaki > Fix For: 2.2.1, 2.3.0 > > > {{Cast}} can throw an exception due to the 64KB JVM bytecode limit when they > use with a lot of structure fields -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22500) 64KB JVM bytecode limit problem with cast
[ https://issues.apache.org/jira/browse/SPARK-22500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-22500: --- Assignee: Kazuaki Ishizaki > 64KB JVM bytecode limit problem with cast > - > > Key: SPARK-22500 > URL: https://issues.apache.org/jira/browse/SPARK-22500 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.2.0 >Reporter: Kazuaki Ishizaki >Assignee: Kazuaki Ishizaki > Fix For: 2.2.1, 2.3.0 > > > {{Cast}} can throw an exception due to the 64KB JVM bytecode limit when they > use with a lot of structure fields -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22374) STS ran into OOM in a secure cluster
[ https://issues.apache.org/jira/browse/SPARK-22374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16261444#comment-16261444 ] Steve Loughran commented on SPARK-22374: We need to do something about this, it is dangerously recurrent. Even though the fix is {{closeAllForUGI}}, we should be able to track it better and start warning early and meaningfully. For example # record timestamps of creation # track total cache size as an instrumented value # start warning when it gets big > STS ran into OOM in a secure cluster > > > Key: SPARK-22374 > URL: https://issues.apache.org/jira/browse/SPARK-22374 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Dongjoon Hyun > Attachments: 1.png, 2.png, 3.png > > > In a secure cluster, FileSystem.CACHE grows indefinitely. > *ENVIRONMENT* > 1. `spark.yarn.principal` and `spark.yarn.keytab` is used. > 2. Spark Thrift Server run with `doAs` false. > {code} > > hive.server2.enable.doAs > false > > {code} > With 6GB (-Xmx6144m) options, `HiveConf` consumes 4GB inside FileSystem.CACHE. > {code} > 20,030 instances of "org.apache.hadoop.hive.conf.HiveConf", loaded by > "sun.misc.Launcher$AppClassLoader @ 0x64001c160" occupy 4,418,101,352 > (73.42%) bytes. These instances are referenced from one instance of > "java.util.HashMap$Node[]", loaded by "" > {code} > Please see the attached images. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22475) show histogram in DESC COLUMN command
[ https://issues.apache.org/jira/browse/SPARK-22475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-22475: --- Assignee: Marco Gaido > show histogram in DESC COLUMN command > - > > Key: SPARK-22475 > URL: https://issues.apache.org/jira/browse/SPARK-22475 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Zhenhua Wang >Assignee: Marco Gaido > Fix For: 2.3.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22475) show histogram in DESC COLUMN command
[ https://issues.apache.org/jira/browse/SPARK-22475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-22475. - Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 19774 [https://github.com/apache/spark/pull/19774] > show histogram in DESC COLUMN command > - > > Key: SPARK-22475 > URL: https://issues.apache.org/jira/browse/SPARK-22475 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Zhenhua Wang > Fix For: 2.3.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22576) Spark SQL locate returns incorrect value when the start position is negative
[ https://issues.apache.org/jira/browse/SPARK-22576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-22576. --- Resolution: Not A Problem Yes, there's nothing about HiveQL (which Spark follows) or Spark SQL that suggests -1 is a valid start position. It is 1-based, even. I don't even see that engines like MySQL support it. https://www.w3schools.com/sql/func_mysql_locate.asp > Spark SQL locate returns incorrect value when the start position is negative > > > Key: SPARK-22576 > URL: https://issues.apache.org/jira/browse/SPARK-22576 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Yuxin Cao > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22165) Type conflicts between dates, timestamps and date in partition column
[ https://issues.apache.org/jira/browse/SPARK-22165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-22165. - Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 19389 [https://github.com/apache/spark/pull/19389] > Type conflicts between dates, timestamps and date in partition column > - > > Key: SPARK-22165 > URL: https://issues.apache.org/jira/browse/SPARK-22165 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1, 2.2.0, 2.3.0 >Reporter: Hyukjin Kwon >Priority: Minor > Fix For: 2.3.0 > > > It looks we have some bugs when resolving type conflicts in partition column. > I found few corner cases as below: > Case 1: timestamp should be inferred but date type is inferred. > {code} > val df = Seq((1, "2015-01-01"), (2, "2016-01-01 00:00:00")).toDF("i", "ts") > df.write.format("parquet").partitionBy("ts").save("/tmp/foo") > spark.read.load("/tmp/foo").printSchema() > {code} > {code} > root > |-- i: integer (nullable = true) > |-- ts: date (nullable = true) > {code} > Case 2: decimal should be inferred but integer is inferred. > {code} > val df = Seq((1, "1"), (2, "1" * 30)).toDF("i", "decimal") > df.write.format("parquet").partitionBy("decimal").save("/tmp/bar") > spark.read.load("/tmp/bar").printSchema() > {code} > {code} > root > |-- i: integer (nullable = true) > |-- decimal: integer (nullable = true) > {code} > Looks we should de-duplicate type resolution logic if possible rather than > separate numeric precedence-like comparison alone. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22165) Type conflicts between dates, timestamps and date in partition column
[ https://issues.apache.org/jira/browse/SPARK-22165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-22165: --- Assignee: Hyukjin Kwon > Type conflicts between dates, timestamps and date in partition column > - > > Key: SPARK-22165 > URL: https://issues.apache.org/jira/browse/SPARK-22165 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1, 2.2.0, 2.3.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 2.3.0 > > > It looks we have some bugs when resolving type conflicts in partition column. > I found few corner cases as below: > Case 1: timestamp should be inferred but date type is inferred. > {code} > val df = Seq((1, "2015-01-01"), (2, "2016-01-01 00:00:00")).toDF("i", "ts") > df.write.format("parquet").partitionBy("ts").save("/tmp/foo") > spark.read.load("/tmp/foo").printSchema() > {code} > {code} > root > |-- i: integer (nullable = true) > |-- ts: date (nullable = true) > {code} > Case 2: decimal should be inferred but integer is inferred. > {code} > val df = Seq((1, "1"), (2, "1" * 30)).toDF("i", "decimal") > df.write.format("parquet").partitionBy("decimal").save("/tmp/bar") > spark.read.load("/tmp/bar").printSchema() > {code} > {code} > root > |-- i: integer (nullable = true) > |-- decimal: integer (nullable = true) > {code} > Looks we should de-duplicate type resolution logic if possible rather than > separate numeric precedence-like comparison alone. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20586) Add deterministic to ScalaUDF
[ https://issues.apache.org/jira/browse/SPARK-20586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16261393#comment-16261393 ] Bago Amirbekian commented on SPARK-20586: - Also a follow up questions, are the performance implications to using the `deterministic` flag we should try and avoid by restructuring the ml UDFs to avoid raising within the UDF. > Add deterministic to ScalaUDF > - > > Key: SPARK-20586 > URL: https://issues.apache.org/jira/browse/SPARK-20586 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Xiao Li >Assignee: Xiao Li > Fix For: 2.3.0 > > > https://hive.apache.org/javadocs/r2.0.1/api/org/apache/hadoop/hive/ql/udf/UDFType.html > Like Hive UDFType, we should allow users to add the extra flags for ScalaUDF > too. {{stateful}}/{{impliesOrder}} are not applicable to ScalaUDF. Thus, we > only add the following two flags. > - deterministic: Certain optimizations should not be applied if UDF is not > deterministic. Deterministic UDF returns same result each time it is invoked > with a particular input. This determinism just needs to hold within the > context of a query. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20586) Add deterministic to ScalaUDF
[ https://issues.apache.org/jira/browse/SPARK-20586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16261377#comment-16261377 ] Bago Amirbekian commented on SPARK-20586: - Is there some documentation somewhere about the right way to use this `deterministic` flag? I bring it up because in `spark.ml` sometimes we will raise errors in a udf when a dataFrame contains invalid data. This can cause bad behavior if the optimizer re-orders the udf with other operations so we’re marking these udfs as `nonDeterministic` (https://github.com/apache/spark/pull/19662) but that somehow seems wrong. The issue isn’t that the udfs are non-deterministic, they are deterministic and always raise on the same inputs. I guess my question is 1) Is this correct usage of the non-deterministic flag or are we simply abusing it when we should come up with a more specific solution? 2) If this is correct usage could we rename the flag or document this type of usage somewhere? > Add deterministic to ScalaUDF > - > > Key: SPARK-20586 > URL: https://issues.apache.org/jira/browse/SPARK-20586 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Xiao Li >Assignee: Xiao Li > Fix For: 2.3.0 > > > https://hive.apache.org/javadocs/r2.0.1/api/org/apache/hadoop/hive/ql/udf/UDFType.html > Like Hive UDFType, we should allow users to add the extra flags for ScalaUDF > too. {{stateful}}/{{impliesOrder}} are not applicable to ScalaUDF. Thus, we > only add the following two flags. > - deterministic: Certain optimizations should not be applied if UDF is not > deterministic. Deterministic UDF returns same result each time it is invoked > with a particular input. This determinism just needs to hold within the > context of a query. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22576) Spark SQL locate returns incorrect value when the start position is negative
[ https://issues.apache.org/jira/browse/SPARK-22576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16261347#comment-16261347 ] Marco Gaido commented on SPARK-22576: - I see, but this is SAP Sysbase. Why do you think Spark should behave like that? Also Hive behaves like Spark, thus I don't see why Spark would be supposed to work differently. > Spark SQL locate returns incorrect value when the start position is negative > > > Key: SPARK-22576 > URL: https://issues.apache.org/jira/browse/SPARK-22576 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Yuxin Cao > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22576) Spark SQL locate returns incorrect value when the start position is negative
[ https://issues.apache.org/jira/browse/SPARK-22576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16261313#comment-16261313 ] Yuxin Cao commented on SPARK-22576: --- http://infocenter.sybase.com/help/index.jsp?topic=/com.sybase.infocenter.dc38151.1600/doc/html/san1278453073929.html The character position at which to begin the search in the string. The first character is position 1. If the starting offset is negative, LOCATE returns the last matching string offset, rather than the first. A negative offset indicates how much of the end of the string to exclude from the search. The number of bytes excluded is calculated as ( -1 * offset ) - 1. > Spark SQL locate returns incorrect value when the start position is negative > > > Key: SPARK-22576 > URL: https://issues.apache.org/jira/browse/SPARK-22576 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Yuxin Cao > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22577) executor page blacklist status should update with TaskSet level blacklisting
Thomas Graves created SPARK-22577: - Summary: executor page blacklist status should update with TaskSet level blacklisting Key: SPARK-22577 URL: https://issues.apache.org/jira/browse/SPARK-22577 Project: Spark Issue Type: Bug Components: Scheduler Affects Versions: 2.1.1 Reporter: Thomas Graves right now the executor blacklist status only updates with the BlacklistTracker after a task set has finished and propagated the blacklisting to the application level. We should change that to show at the taskset level as well. Without this it can be very confusing to the user why things aren't running. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22526) Spark hangs while reading binary files from S3
[ https://issues.apache.org/jira/browse/SPARK-22526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16261292#comment-16261292 ] Steve Loughran commented on SPARK-22526: S3a uses the AWS S3 client, which uses httpclient inside. S3AInputStream absolutely closes that stream on close(), it does a lot of work to use abort when it's considered more expensive to read to the end of the GET. I wouldn't jump to blame the httpclient. However, AWS S3 does http connection recycling pooling, set in {{fs.s3a.connection.maximum}}; default is 15. There might be some blocking waiting for free connections, so try a larger value. Now, what's needed to track things down? You get to do it yourself, at least for now, as you are the only person reporting this. I'd go for * getting the thread dump as things block & see what the threads are up to * use "netstat -p tcp" to see what network connections are live * logging {{org.apache.hadoop.fs.s3a}} and seeing what it says. Before doing any of that: move off 2.7 to using the Hadoop 2.8 binaries, Hadoop 2.8 has a lot of performance and functionality improvements which will never be backported to 2.7.x, including updated AWS SDK libraries. If you do find a problem in the s3a client/AWS SDK, the response to a HADOOP- JIRA will be "does it go away if you upgrade?". Save time by doing that before anything else. > Spark hangs while reading binary files from S3 > -- > > Key: SPARK-22526 > URL: https://issues.apache.org/jira/browse/SPARK-22526 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: mohamed imran > Original Estimate: 168h > Remaining Estimate: 168h > > Hi, > I am using Spark 2.2.0(recent version) to read binary files from S3. I use > sc.binaryfiles to read the files. > It is working fine until some 100 file read but later it get hangs > indefinitely from 5 up to 40 mins like Avro file read issue(it was fixed in > the later releases) > I tried setting the fs.s3a.connection.maximum to some maximum values but > didn't help. > And finally i ended up using the spark speculation parameter set which is > again didnt help much. > One thing Which I observed is that it is not closing the connection after > every read of binary files from the S3. > example :- sc.binaryFiles("s3a://test/test123.zip") > Please look into this major issue! -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22576) Spark SQL locate returns incorrect value when the start position is negative
[ https://issues.apache.org/jira/browse/SPARK-22576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16261293#comment-16261293 ] Marco Gaido commented on SPARK-22576: - why do you expect locate to work like this and not as it is working now? > Spark SQL locate returns incorrect value when the start position is negative > > > Key: SPARK-22576 > URL: https://issues.apache.org/jira/browse/SPARK-22576 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Yuxin Cao > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22521) VectorIndexerModel support handle unseen categories via handleInvalid: Python API
[ https://issues.apache.org/jira/browse/SPARK-22521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] holdenk reassigned SPARK-22521: --- Assignee: Weichen Xu > VectorIndexerModel support handle unseen categories via handleInvalid: Python > API > - > > Key: SPARK-22521 > URL: https://issues.apache.org/jira/browse/SPARK-22521 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.2.0 >Reporter: Weichen Xu >Assignee: Weichen Xu > Fix For: 2.3.0 > > > VectorIndexerModel support handle unseen categories via handleInvalid: Python > API -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22521) VectorIndexerModel support handle unseen categories via handleInvalid: Python API
[ https://issues.apache.org/jira/browse/SPARK-22521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] holdenk resolved SPARK-22521. - Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 19753 [https://github.com/apache/spark/pull/19753] > VectorIndexerModel support handle unseen categories via handleInvalid: Python > API > - > > Key: SPARK-22521 > URL: https://issues.apache.org/jira/browse/SPARK-22521 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.2.0 >Reporter: Weichen Xu > Fix For: 2.3.0 > > > VectorIndexerModel support handle unseen categories via handleInvalid: Python > API -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22576) Spark SQL locate returns incorrect value when the start position is negative
[ https://issues.apache.org/jira/browse/SPARK-22576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuxin Cao updated SPARK-22576: -- Summary: Spark SQL locate returns incorrect value when the start position is negative (was: Spark SQL locate returns incorrect value when position is negative) > Spark SQL locate returns incorrect value when the start position is negative > > > Key: SPARK-22576 > URL: https://issues.apache.org/jira/browse/SPARK-22576 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Yuxin Cao > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22576) Spark SQL locate returns incorrect value when position is negative
Yuxin Cao created SPARK-22576: - Summary: Spark SQL locate returns incorrect value when position is negative Key: SPARK-22576 URL: https://issues.apache.org/jira/browse/SPARK-22576 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Yuxin Cao -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
subscribe
[jira] [Created] (SPARK-22575) Making Spark Thrift Server clean up its cache
Oz Ben-Ami created SPARK-22575: -- Summary: Making Spark Thrift Server clean up its cache Key: SPARK-22575 URL: https://issues.apache.org/jira/browse/SPARK-22575 Project: Spark Issue Type: Improvement Components: Block Manager, SQL Affects Versions: 2.2.0 Reporter: Oz Ben-Ami Priority: Minor Currently, Spark Thrift Server accumulates data in its appcache, even for old queries. This fills up the disk (using over 100GB per worker node) within days, and the only way to clear it is to restart the Thrift Server application. Even deleting the files directly isn't a solution, as Spark then complains about FileNotFound. I asked about this on [Stack Overflow|https://stackoverflow.com/questions/46893123/how-can-i-make-spark-thrift-server-clean-up-its-cache] a few weeks ago, but it does not seem to be currently doable by configuration. Am I missing some configuration option, or some other factor here? Otherwise, can anyone point me to the code that handles this, so maybe I can try my hand at a fix? Thanks! -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22574) Wrong request causing Spark Dispatcher going inactive
[ https://issues.apache.org/jira/browse/SPARK-22574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16260954#comment-16260954 ] Apache Spark commented on SPARK-22574: -- User 'Gschiavon' has created a pull request for this issue: https://github.com/apache/spark/pull/19793 > Wrong request causing Spark Dispatcher going inactive > - > > Key: SPARK-22574 > URL: https://issues.apache.org/jira/browse/SPARK-22574 > Project: Spark > Issue Type: Bug > Components: Mesos, Spark Submit >Affects Versions: 2.2.0 >Reporter: German Schiavon Matteo >Priority: Minor > Fix For: 2.0.0, 2.1.0, 2.2.0 > > > When submitting a wrong _CreateSubmissionRequest_ to Spark Dispatcher is > causing a bad state of Dispatcher and making it inactive as a mesos framework. > The class CreateSubmissionRequest initialise its arguments to null as follows: > {code:title=CreateSubmissionRequest.scala|borderStyle=solid} > var appResource: String = null > var mainClass: String = null > var appArgs: Array[String] = null > var sparkProperties: Map[String, String] = null > var environmentVariables: Map[String, String] = null > {code} > There are some checks of this variable but not in all of them, for example in > appArgs and environmentVariables. > If you don't set _appArgs_ it will cause the following error: > {code:title=error|borderStyle=solid} > 17/11/21 14:37:24 INFO MesosClusterScheduler: Reviving Offers. > Exception in thread "Thread-22" java.lang.NullPointerException > at > org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.getDriverCommandValue(MesosClusterScheduler.scala:444) > at > org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.buildDriverCommand(MesosClusterScheduler.scala:451) > at > org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.org$apache$spark$scheduler$cluster$mesos$MesosClusterScheduler$$createTaskInfo(MesosClusterScheduler.scala:538) > at > org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler$$anonfun$scheduleTasks$1.apply(MesosClusterScheduler.scala:570) > at > org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler$$anonfun$scheduleTasks$1.apply(MesosClusterScheduler.scala:555) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.scheduleTasks(MesosClusterScheduler.scala:555) > at > org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.resourceOffers(MesosClusterScheduler.scala:621) > {code} > Because it's trying to access to it without checking whether is null or not. > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22574) Wrong request causing Spark Dispatcher going inactive
[ https://issues.apache.org/jira/browse/SPARK-22574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22574: Assignee: Apache Spark > Wrong request causing Spark Dispatcher going inactive > - > > Key: SPARK-22574 > URL: https://issues.apache.org/jira/browse/SPARK-22574 > Project: Spark > Issue Type: Bug > Components: Mesos, Spark Submit >Affects Versions: 2.2.0 >Reporter: German Schiavon Matteo >Assignee: Apache Spark >Priority: Minor > Fix For: 2.0.0, 2.1.0, 2.2.0 > > > When submitting a wrong _CreateSubmissionRequest_ to Spark Dispatcher is > causing a bad state of Dispatcher and making it inactive as a mesos framework. > The class CreateSubmissionRequest initialise its arguments to null as follows: > {code:title=CreateSubmissionRequest.scala|borderStyle=solid} > var appResource: String = null > var mainClass: String = null > var appArgs: Array[String] = null > var sparkProperties: Map[String, String] = null > var environmentVariables: Map[String, String] = null > {code} > There are some checks of this variable but not in all of them, for example in > appArgs and environmentVariables. > If you don't set _appArgs_ it will cause the following error: > {code:title=error|borderStyle=solid} > 17/11/21 14:37:24 INFO MesosClusterScheduler: Reviving Offers. > Exception in thread "Thread-22" java.lang.NullPointerException > at > org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.getDriverCommandValue(MesosClusterScheduler.scala:444) > at > org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.buildDriverCommand(MesosClusterScheduler.scala:451) > at > org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.org$apache$spark$scheduler$cluster$mesos$MesosClusterScheduler$$createTaskInfo(MesosClusterScheduler.scala:538) > at > org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler$$anonfun$scheduleTasks$1.apply(MesosClusterScheduler.scala:570) > at > org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler$$anonfun$scheduleTasks$1.apply(MesosClusterScheduler.scala:555) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.scheduleTasks(MesosClusterScheduler.scala:555) > at > org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.resourceOffers(MesosClusterScheduler.scala:621) > {code} > Because it's trying to access to it without checking whether is null or not. > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22574) Wrong request causing Spark Dispatcher going inactive
[ https://issues.apache.org/jira/browse/SPARK-22574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22574: Assignee: (was: Apache Spark) > Wrong request causing Spark Dispatcher going inactive > - > > Key: SPARK-22574 > URL: https://issues.apache.org/jira/browse/SPARK-22574 > Project: Spark > Issue Type: Bug > Components: Mesos, Spark Submit >Affects Versions: 2.2.0 >Reporter: German Schiavon Matteo >Priority: Minor > Fix For: 2.0.0, 2.1.0, 2.2.0 > > > When submitting a wrong _CreateSubmissionRequest_ to Spark Dispatcher is > causing a bad state of Dispatcher and making it inactive as a mesos framework. > The class CreateSubmissionRequest initialise its arguments to null as follows: > {code:title=CreateSubmissionRequest.scala|borderStyle=solid} > var appResource: String = null > var mainClass: String = null > var appArgs: Array[String] = null > var sparkProperties: Map[String, String] = null > var environmentVariables: Map[String, String] = null > {code} > There are some checks of this variable but not in all of them, for example in > appArgs and environmentVariables. > If you don't set _appArgs_ it will cause the following error: > {code:title=error|borderStyle=solid} > 17/11/21 14:37:24 INFO MesosClusterScheduler: Reviving Offers. > Exception in thread "Thread-22" java.lang.NullPointerException > at > org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.getDriverCommandValue(MesosClusterScheduler.scala:444) > at > org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.buildDriverCommand(MesosClusterScheduler.scala:451) > at > org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.org$apache$spark$scheduler$cluster$mesos$MesosClusterScheduler$$createTaskInfo(MesosClusterScheduler.scala:538) > at > org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler$$anonfun$scheduleTasks$1.apply(MesosClusterScheduler.scala:570) > at > org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler$$anonfun$scheduleTasks$1.apply(MesosClusterScheduler.scala:555) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.scheduleTasks(MesosClusterScheduler.scala:555) > at > org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.resourceOffers(MesosClusterScheduler.scala:621) > {code} > Because it's trying to access to it without checking whether is null or not. > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22574) Wrong request causing Spark Dispatcher going inactive
German Schiavon Matteo created SPARK-22574: -- Summary: Wrong request causing Spark Dispatcher going inactive Key: SPARK-22574 URL: https://issues.apache.org/jira/browse/SPARK-22574 Project: Spark Issue Type: Bug Components: Mesos, Spark Submit Affects Versions: 2.2.0 Reporter: German Schiavon Matteo Priority: Minor Fix For: 2.2.0, 2.1.0, 2.0.0 When submitting a wrong _CreateSubmissionRequest_ to Spark Dispatcher is causing a bad state of Dispatcher and making it inactive as a mesos framework. The class CreateSubmissionRequest initialise its arguments to null as follows: {code:title=CreateSubmissionRequest.scala|borderStyle=solid} var appResource: String = null var mainClass: String = null var appArgs: Array[String] = null var sparkProperties: Map[String, String] = null var environmentVariables: Map[String, String] = null {code} There are some checks of this variable but not in all of them, for example in appArgs and environmentVariables. If you don't set _appArgs_ it will cause the following error: {code:title=error|borderStyle=solid} 17/11/21 14:37:24 INFO MesosClusterScheduler: Reviving Offers. Exception in thread "Thread-22" java.lang.NullPointerException at org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.getDriverCommandValue(MesosClusterScheduler.scala:444) at org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.buildDriverCommand(MesosClusterScheduler.scala:451) at org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.org$apache$spark$scheduler$cluster$mesos$MesosClusterScheduler$$createTaskInfo(MesosClusterScheduler.scala:538) at org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler$$anonfun$scheduleTasks$1.apply(MesosClusterScheduler.scala:570) at org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler$$anonfun$scheduleTasks$1.apply(MesosClusterScheduler.scala:555) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.scheduleTasks(MesosClusterScheduler.scala:555) at org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.resourceOffers(MesosClusterScheduler.scala:621) {code} Because it's trying to access to it without checking whether is null or not. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22323) Design doc for different types of pandas_udf
[ https://issues.apache.org/jira/browse/SPARK-22323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16260924#comment-16260924 ] Li Jin commented on SPARK-22323: I believe this is done. (At least my original indention for this). The design doc will be tracked in the link [~hyukjin.kwon] posted. If we want to make this into official pyspark doc, let's revisit. > Design doc for different types of pandas_udf > > > Key: SPARK-22323 > URL: https://issues.apache.org/jira/browse/SPARK-22323 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Li Jin > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22323) Design doc for different types of pandas_udf
[ https://issues.apache.org/jira/browse/SPARK-22323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Jin resolved SPARK-22323. Resolution: Fixed > Design doc for different types of pandas_udf > > > Key: SPARK-22323 > URL: https://issues.apache.org/jira/browse/SPARK-22323 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Li Jin > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22516) CSV Read breaks: When "multiLine" = "true", if "comment" option is set as last line's first character
[ https://issues.apache.org/jira/browse/SPARK-22516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16260904#comment-16260904 ] Marco Gaido commented on SPARK-22516: - [~crkumaresh24] I can't reproduce the issue with the new file you have uploaded. I am running on a OSX, maybe it depends on the OS: {code} scala> val a = spark.read.option("header","true").option("inferSchema", "true").option("multiLine", "true").option("comment", "c").option("parserLib", "univocity").csv("/Users/mgaido/Downloads/test_file_without_eof_char.csv") a: org.apache.spark.sql.DataFrame = [abc: string, def: string] scala> a.show +---+---+ |abc|def| +---+---+ |ghi|jkl| +---+---+ {code} > CSV Read breaks: When "multiLine" = "true", if "comment" option is set as > last line's first character > - > > Key: SPARK-22516 > URL: https://issues.apache.org/jira/browse/SPARK-22516 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Kumaresh C R >Priority: Minor > Labels: csvparser > Attachments: testCommentChar.csv, test_file_without_eof_char.csv > > > Try to read attached CSV file with following parse properties, > scala> *val csvFile = > spark.read.option("header","true").option("inferSchema", > "true").option("parserLib", "univocity").option("comment", > "c").csv("hdfs://localhost:8020/test > CommentChar.csv"); * > > > csvFile: org.apache.spark.sql.DataFrame = [a: string, b: string] > > > > > > scala> csvFile.show > > > +---+---+ > > > | a| b| > > > +---+---+ > > > +---+---+ > {color:#8eb021}*Noticed that it works fine.*{color} > If we add an option "multiLine" = "true", it fails with below exception. This > happens only if we pass "comment" == input dataset's last line's first > character > scala> val csvFile = > *spark.read.option("header","true").{color:red}{color:#d04437}option("multiLine","true"){color}{color}.option("inferSchema", > "true").option("parserLib", "univocity").option("comment", > "c").csv("hdfs://localhost:8020/testCommentChar.csv");* > 17/11/14 14:26:17 ERROR Executor: Exception in task 0.0 in stage 8.0 (TID 8) > com.univocity.parsers.common.TextParsingException: > java.lang.IllegalArgumentException - Unable to skip 1 lines from line 2. End > of input reached > Parser Configuration: CsvParserSettings: > Auto configuration enabled=true > Autodetect column delimiter=false > Autodetect quotes=false > Column reordering enabled=true > Empty value=null > Escape unquoted values=false > Header extraction enabled=null > Headers=null > Ignore leading whitespaces=false > Ignore trailing whitespaces=false > Input buffer size=128 > Input reading on separate thread=false > Keep escape sequences=false > Keep quotes=false > Length of content displayed on error=-1 > Line separator detection enabled=false > Maximum number of characters per column=-1 > Maximum number of columns=20480 > Normalize escaped line separators=true > Null value= > Number of records to read=all > Processor=none > Restricting data in exceptions=false > RowProcessor error handler=null > Selected fields=none > Skip empty lines=true > Unescaped quote handling=STOP_AT_DELIMITERFormat configuration: > CsvFormat: > Comment character=c > Field delimiter=, > Line separator (normalized)=\n > Line separator sequence=\r\n > Quote character=" >
[jira] [Assigned] (SPARK-22566) Better error message for `_merge_type` in Pandas to Spark DF conversion
[ https://issues.apache.org/jira/browse/SPARK-22566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22566: Assignee: (was: Apache Spark) > Better error message for `_merge_type` in Pandas to Spark DF conversion > --- > > Key: SPARK-22566 > URL: https://issues.apache.org/jira/browse/SPARK-22566 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Guilherme Berger >Priority: Minor > > When creating a Spark DF from a Pandas DF without specifying a schema, schema > inference is used. This inference can fail when a column contains values of > two different types; this is ok. The problem is the error message does not > tell us in which column this happened. > When this happens, it is painful to debug since the error message is too > vague. > I plan on submitting a PR which fixes this, providing a better error message > for such cases, containing the column name (and possibly the problematic > values too). > >>> spark_session.createDataFrame(pandas_df) > File "redacted/pyspark/sql/session.py", line 541, in createDataFrame > rdd, schema = self._createFromLocal(map(prepare, data), schema) > File "redacted/pyspark/sql/session.py", line 401, in _createFromLocal > struct = self._inferSchemaFromList(data) > File "redacted/pyspark/sql/session.py", line 333, in _inferSchemaFromList > schema = reduce(_merge_type, map(_infer_schema, data)) > File "redacted/pyspark/sql/types.py", line 1124, in _merge_type > for f in a.fields] > File "redacted/pyspark/sql/types.py", line 1118, in _merge_type > raise TypeError("Can not merge type %s and %s" % (type(a), type(b))) > TypeError: Can not merge type and 'pyspark.sql.types.StringType'> > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22566) Better error message for `_merge_type` in Pandas to Spark DF conversion
[ https://issues.apache.org/jira/browse/SPARK-22566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22566: Assignee: Apache Spark > Better error message for `_merge_type` in Pandas to Spark DF conversion > --- > > Key: SPARK-22566 > URL: https://issues.apache.org/jira/browse/SPARK-22566 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Guilherme Berger >Assignee: Apache Spark >Priority: Minor > > When creating a Spark DF from a Pandas DF without specifying a schema, schema > inference is used. This inference can fail when a column contains values of > two different types; this is ok. The problem is the error message does not > tell us in which column this happened. > When this happens, it is painful to debug since the error message is too > vague. > I plan on submitting a PR which fixes this, providing a better error message > for such cases, containing the column name (and possibly the problematic > values too). > >>> spark_session.createDataFrame(pandas_df) > File "redacted/pyspark/sql/session.py", line 541, in createDataFrame > rdd, schema = self._createFromLocal(map(prepare, data), schema) > File "redacted/pyspark/sql/session.py", line 401, in _createFromLocal > struct = self._inferSchemaFromList(data) > File "redacted/pyspark/sql/session.py", line 333, in _inferSchemaFromList > schema = reduce(_merge_type, map(_infer_schema, data)) > File "redacted/pyspark/sql/types.py", line 1124, in _merge_type > for f in a.fields] > File "redacted/pyspark/sql/types.py", line 1118, in _merge_type > raise TypeError("Can not merge type %s and %s" % (type(a), type(b))) > TypeError: Can not merge type and 'pyspark.sql.types.StringType'> > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22566) Better error message for `_merge_type` in Pandas to Spark DF conversion
[ https://issues.apache.org/jira/browse/SPARK-22566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16260889#comment-16260889 ] Apache Spark commented on SPARK-22566: -- User 'gberger' has created a pull request for this issue: https://github.com/apache/spark/pull/19792 > Better error message for `_merge_type` in Pandas to Spark DF conversion > --- > > Key: SPARK-22566 > URL: https://issues.apache.org/jira/browse/SPARK-22566 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Guilherme Berger >Priority: Minor > > When creating a Spark DF from a Pandas DF without specifying a schema, schema > inference is used. This inference can fail when a column contains values of > two different types; this is ok. The problem is the error message does not > tell us in which column this happened. > When this happens, it is painful to debug since the error message is too > vague. > I plan on submitting a PR which fixes this, providing a better error message > for such cases, containing the column name (and possibly the problematic > values too). > >>> spark_session.createDataFrame(pandas_df) > File "redacted/pyspark/sql/session.py", line 541, in createDataFrame > rdd, schema = self._createFromLocal(map(prepare, data), schema) > File "redacted/pyspark/sql/session.py", line 401, in _createFromLocal > struct = self._inferSchemaFromList(data) > File "redacted/pyspark/sql/session.py", line 333, in _inferSchemaFromList > schema = reduce(_merge_type, map(_infer_schema, data)) > File "redacted/pyspark/sql/types.py", line 1124, in _merge_type > for f in a.fields] > File "redacted/pyspark/sql/types.py", line 1118, in _merge_type > raise TypeError("Can not merge type %s and %s" % (type(a), type(b))) > TypeError: Can not merge type and 'pyspark.sql.types.StringType'> > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22552) Cannot Union multiple kafka streams
[ https://issues.apache.org/jira/browse/SPARK-22552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-22552: - Assignee: (was: Shixiong Zhu) Priority: Minor (was: Major) Fix Version/s: (was: 2.2.2) (was: 2.3.0) > Cannot Union multiple kafka streams > --- > > Key: SPARK-22552 > URL: https://issues.apache.org/jira/browse/SPARK-22552 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: sachin malhotra >Priority: Minor > > When unioning multiple kafka streams I learned that the resulting dataframe > only contains the data that exists in the dataframe that initiated the union > i.e. if df1.union(df2) (or a chaining of unions) the result will only contain > the rows that exist in df1. > Now to be more specific this occurs when data comes in during the same > micro-batch for all three streams. If you wait for each single row to be > processed for each stream the union does return the right results. > For example, if you have 3 kafka streams and you: > send message 1 to stream 1, WAIT for batch to finish, send message 2 to > stream 2, wait for batch to finish, send message 3 to stream 3, wait for > batch to finish. Union will return the right data. > But if you, > send message 1,2,3, WAIT for batch to finish, you only receive data in the > first stream when unioning all three dataframes -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22572) spark-shell does not re-initialize on :replay
[ https://issues.apache.org/jira/browse/SPARK-22572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16260759#comment-16260759 ] Mark Petruska commented on SPARK-22572: --- [~srowen]: Can I ask you to look at https://github.com/apache/spark/pull/19791, please? > spark-shell does not re-initialize on :replay > - > > Key: SPARK-22572 > URL: https://issues.apache.org/jira/browse/SPARK-22572 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 2.3.0 >Reporter: Mark Petruska >Priority: Minor > > Spark-shell does not run the re-initialization script when a `:replay` > command is issued: > {code} > $ ./bin/spark-shell > 17/11/21 12:01:00 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > Spark context Web UI available at http://192.168.1.3:4040 > Spark context available as 'sc' (master = local[*], app id = > local-1511262066013). > Spark session available as 'spark'. > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 2.3.0-SNAPSHOT > /_/ > > Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_74) > Type in expressions to have them evaluated. > Type :help for more information. > scala> sc > res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext@77bb916f > scala> :replay > Replaying: sc > :12: error: not found: value sc >sc >^ > scala> sc > :12: error: not found: value sc >sc >^ > scala> > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22572) spark-shell does not re-initialize on :replay
[ https://issues.apache.org/jira/browse/SPARK-22572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22572: Assignee: Apache Spark > spark-shell does not re-initialize on :replay > - > > Key: SPARK-22572 > URL: https://issues.apache.org/jira/browse/SPARK-22572 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 2.3.0 >Reporter: Mark Petruska >Assignee: Apache Spark >Priority: Minor > > Spark-shell does not run the re-initialization script when a `:replay` > command is issued: > {code} > $ ./bin/spark-shell > 17/11/21 12:01:00 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > Spark context Web UI available at http://192.168.1.3:4040 > Spark context available as 'sc' (master = local[*], app id = > local-1511262066013). > Spark session available as 'spark'. > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 2.3.0-SNAPSHOT > /_/ > > Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_74) > Type in expressions to have them evaluated. > Type :help for more information. > scala> sc > res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext@77bb916f > scala> :replay > Replaying: sc > :12: error: not found: value sc >sc >^ > scala> sc > :12: error: not found: value sc >sc >^ > scala> > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22572) spark-shell does not re-initialize on :replay
[ https://issues.apache.org/jira/browse/SPARK-22572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22572: Assignee: (was: Apache Spark) > spark-shell does not re-initialize on :replay > - > > Key: SPARK-22572 > URL: https://issues.apache.org/jira/browse/SPARK-22572 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 2.3.0 >Reporter: Mark Petruska >Priority: Minor > > Spark-shell does not run the re-initialization script when a `:replay` > command is issued: > {code} > $ ./bin/spark-shell > 17/11/21 12:01:00 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > Spark context Web UI available at http://192.168.1.3:4040 > Spark context available as 'sc' (master = local[*], app id = > local-1511262066013). > Spark session available as 'spark'. > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 2.3.0-SNAPSHOT > /_/ > > Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_74) > Type in expressions to have them evaluated. > Type :help for more information. > scala> sc > res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext@77bb916f > scala> :replay > Replaying: sc > :12: error: not found: value sc >sc >^ > scala> sc > :12: error: not found: value sc >sc >^ > scala> > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22572) spark-shell does not re-initialize on :replay
[ https://issues.apache.org/jira/browse/SPARK-22572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16260756#comment-16260756 ] Apache Spark commented on SPARK-22572: -- User 'mpetruska' has created a pull request for this issue: https://github.com/apache/spark/pull/19791 > spark-shell does not re-initialize on :replay > - > > Key: SPARK-22572 > URL: https://issues.apache.org/jira/browse/SPARK-22572 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 2.3.0 >Reporter: Mark Petruska >Priority: Minor > > Spark-shell does not run the re-initialization script when a `:replay` > command is issued: > {code} > $ ./bin/spark-shell > 17/11/21 12:01:00 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > Spark context Web UI available at http://192.168.1.3:4040 > Spark context available as 'sc' (master = local[*], app id = > local-1511262066013). > Spark session available as 'spark'. > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 2.3.0-SNAPSHOT > /_/ > > Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_74) > Type in expressions to have them evaluated. > Type :help for more information. > scala> sc > res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext@77bb916f > scala> :replay > Replaying: sc > :12: error: not found: value sc >sc >^ > scala> sc > :12: error: not found: value sc >sc >^ > scala> > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20133) User guide for spark.ml.stat.ChiSquareTest
[ https://issues.apache.org/jira/browse/SPARK-20133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-20133. --- Resolution: Not A Problem > User guide for spark.ml.stat.ChiSquareTest > -- > > Key: SPARK-20133 > URL: https://issues.apache.org/jira/browse/SPARK-20133 > Project: Spark > Issue Type: Documentation > Components: Documentation, ML >Affects Versions: 2.2.0 >Reporter: Joseph K. Bradley >Priority: Minor > > Add new user guide section for spark.ml.stat, and document ChiSquareTest. > This may involve adding new example scripts. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22572) spark-shell does not re-initialize on :replay
[ https://issues.apache.org/jira/browse/SPARK-22572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16260736#comment-16260736 ] Sean Owen commented on SPARK-22572: --- Yes, I think this is just how it is. The way Spark is initialized by injecting some code into the shell init means it's not available for replay. Have a look at the integration and see if you have bright ideas, but I recall concluding it wouldn't work. > spark-shell does not re-initialize on :replay > - > > Key: SPARK-22572 > URL: https://issues.apache.org/jira/browse/SPARK-22572 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 2.3.0 >Reporter: Mark Petruska >Priority: Minor > > Spark-shell does not run the re-initialization script when a `:replay` > command is issued: > {code} > $ ./bin/spark-shell > 17/11/21 12:01:00 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > Spark context Web UI available at http://192.168.1.3:4040 > Spark context available as 'sc' (master = local[*], app id = > local-1511262066013). > Spark session available as 'spark'. > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 2.3.0-SNAPSHOT > /_/ > > Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_74) > Type in expressions to have them evaluated. > Type :help for more information. > scala> sc > res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext@77bb916f > scala> :replay > Replaying: sc > :12: error: not found: value sc >sc >^ > scala> sc > :12: error: not found: value sc >sc >^ > scala> > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22528) History service and non-HDFS filesystems
[ https://issues.apache.org/jira/browse/SPARK-22528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16260727#comment-16260727 ] paul mackles commented on SPARK-22528: -- In case anyone else bumps into this, I received some feedback from the data-lake team at MSFT: This is expected behavior since Hadoop supports Kerberos based identity whereas data lake supports OAuth2 – Azure active directory (AAD). The bridge/mapping between Kerberos and AAD OAuth2 is supported only in Azure HDInsight cluster today. OAuth2 support in Hadoop is non-trivial task and is in progress - https://issues.apache.org/jira/browse/HADOOP-11744 Workaround for the limitation is (Specific to data lake) core-site.xml {code} adl.debug.override.localuserasfileowner true {code} What does this configuration do ? FileStatus contains the user/group information which is associated with object id from AAD. Hadoop driver would replace object id with local Hadoop user under the context of Hadoop process. Actual file information in data lake remains unchanged though, only shadowed behind the local Hadoop user. > History service and non-HDFS filesystems > > > Key: SPARK-22528 > URL: https://issues.apache.org/jira/browse/SPARK-22528 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: paul mackles >Priority: Minor > > We are using Azure Data Lake (ADL) to store our event logs. This worked fine > in 2.1.x but in 2.2.0, the event logs are no longer visible to the history > server. I tracked it down to the call to: > {code} > SparkHadoopUtil.get.checkAccessPermission() > {code} > which was added to "FSHistoryProvider" in 2.2.0. > I was able to workaround it by: > * setting the files on ADL to world readable > * setting HADOOP_PROXY to the Azure objectId of the service principal that > owns file > Neither of those workaround are particularly desirable in our environment. > That said, I am not sure how this should be addressed: > * Is this an issue with the Azure/Hadoop bindings not setting up the user > context correctly so that the "checkAccessPermission()" call succeeds w/out > having to use the username under which the process is running? > * Is this an issue with "checkAccessPermission()" not really accounting for > all of the possible FileSystem implementations? If so, I would imagine that > there are similar issues when using S3. > In spite of this check, I know the files are accessible through the > underlying FileSystem object so it feels like the latter but I don't think > that the FileSystem object alone could be used to implement this check. > Any thoughts [~jerryshao]? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22568) Split pair RDDs by keys - an efficient (maybe?) substitute to groupByKey
[ https://issues.apache.org/jira/browse/SPARK-22568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16260701#comment-16260701 ] Sean Owen commented on SPARK-22568: --- To filter individually you would have to collect distinct values first. But the other approaches here don't require that. If you don't need all values in memory at once, try repartitionAndSortWithinPartitions to encounter all values for a key in a row. > Split pair RDDs by keys - an efficient (maybe?) substitute to groupByKey > > > Key: SPARK-22568 > URL: https://issues.apache.org/jira/browse/SPARK-22568 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Éderson Cássio > Labels: features, performance, usability > > Sorry for any mistakes on filling this big form... it's my first issue here :) > Recently, I have the need to separate a RDD by some categorization. I was > able to accomplish that by some ways. > First, the obvious: mapping each element to a pair, with the key being the > category of the element. Then, using the good ol' {{groupByKey}}. > Listening to advices to avoid {{groupByKey}}, I failed to find another way > that was more efficient. I ended up (a) obtaining the distinct list of > element categories, (b) {{collect}} ing them and (c) making a call to > {{filter}} for each category. Of course, before all I {{cache}} d my initial > RDD. > So, I started to speculate: maybe it would be possible to make a number of > RDDs from an initial pair RDD _without the need to shuffle the data_. It > could be made by a kind of _local repartition_: first each partition is > splitted into various by key; then the master group the partitions with the > same key into a new RDD. The operation returns a List or array containing the > new RDDs. > It's just a conjecture, I don't know if it would be feasible in current Spark > Core architecture. But it would be great if it could be done. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22571) How to connect to secure(Kerberos) kafka broker using native KafkaConsumer api in Spark Streamming application
[ https://issues.apache.org/jira/browse/SPARK-22571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-22571. --- Resolution: Invalid This should go to the mailing list. > How to connect to secure(Kerberos) kafka broker using native KafkaConsumer > api in Spark Streamming application > -- > > Key: SPARK-22571 > URL: https://issues.apache.org/jira/browse/SPARK-22571 > Project: Spark > Issue Type: Question > Components: DStreams >Affects Versions: 1.6.2 > Environment: Hortonworks cluster(2.5x) with Kafka 0.10 and Spark 1.6.2 >Reporter: Ujjal Satpathy > > I have been trying to create a KafkaConsumer in SparkStreaming application in > order to fetch kafka partition details from a secured(Kerberos) kafka > cluster.So basically there are total 2 consumers(Spark consumer and > KafkaConsumer ) are running parallelly in yarn cluster mode.I am providing > below option in run command to connect to kafka brokers but the application > is not able to create native KafkaConsumer and throwing Jass configuration > not found.Spark consumer is working fine and successfully creating the > connection > Below are the configuration provided in spark-submit command => > --conf "spark.executor.extraJavaOptions > -Dlog4j.configuration=./log4j.properties > -Djava.security.auth.login.config=./consumer_jaas.conf > -Djava.security.krb5.conf=./consumer_krb5.conf" \ > --conf "spark.driver.extraJavaOptions > -Dlog4j.configuration=./log4j.properties > -Djava.security.auth.login.config=./consumer_jaas.conf > -Djava.security.krb5.conf=./consumer_krb5.conf" \ > I am running this application in yarn-cluster mode in Hortonworks cluster -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22569) Clean up caller of splitExpressions and addMutableState
[ https://issues.apache.org/jira/browse/SPARK-22569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-22569. - Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 19790 [https://github.com/apache/spark/pull/19790] > Clean up caller of splitExpressions and addMutableState > --- > > Key: SPARK-22569 > URL: https://issues.apache.org/jira/browse/SPARK-22569 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Xiao Li >Assignee: Xiao Li > Fix For: 2.3.0 > > > Clean the usage of `addMutableState` and `splitExpressions ` -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22538) SQLTransformer.transform(inputDataFrame) uncaches inputDataFrame
[ https://issues.apache.org/jira/browse/SPARK-22538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-22538: Fix Version/s: (was: 2.2.2) 2.2.1 > SQLTransformer.transform(inputDataFrame) uncaches inputDataFrame > > > Key: SPARK-22538 > URL: https://issues.apache.org/jira/browse/SPARK-22538 > Project: Spark > Issue Type: Bug > Components: ML, PySpark, SQL, Web UI >Affects Versions: 2.2.0 >Reporter: MBA Learns to Code >Assignee: Liang-Chi Hsieh > Fix For: 2.2.1, 2.3.0 > > > When running the below code on PySpark v2.2.0, the cached input DataFrame df > disappears from SparkUI after SQLTransformer.transform(...) is called on it. > I don't yet know whether this is only a SparkUI bug, or the input DataFrame > df is indeed unpersisted from memory. If the latter is true, this can be a > serious bug because any new computation using new_df would have to re-do all > the work leading up to df. > {code} > import pandas > import pyspark > from pyspark.ml.feature import SQLTransformer > spark = pyspark.sql.SparkSession.builder.getOrCreate() > df = spark.createDataFrame(pandas.DataFrame(dict(x=[-1, 0, 1]))) > # after below step, SparkUI Storage shows 1 cached RDD > df.cache(); df.count() > # after below step, cached RDD disappears from SparkUI Storage > new_df = SQLTransformer(statement='SELECT * FROM __THIS__').transform(df) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22498) 64KB JVM bytecode limit problem with concat
[ https://issues.apache.org/jira/browse/SPARK-22498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-22498: Fix Version/s: (was: 2.2.2) 2.2.1 > 64KB JVM bytecode limit problem with concat > --- > > Key: SPARK-22498 > URL: https://issues.apache.org/jira/browse/SPARK-22498 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.2.0 >Reporter: Kazuaki Ishizaki >Assignee: Kazuaki Ishizaki > Fix For: 2.2.1, 2.3.0 > > > {{concat}} can throw an exception due to the 64KB JVM bytecode limit when > they use with a lot of arguments -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22549) 64KB JVM bytecode limit problem with concat_ws
[ https://issues.apache.org/jira/browse/SPARK-22549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-22549: Fix Version/s: (was: 2.2.2) 2.2.1 > 64KB JVM bytecode limit problem with concat_ws > -- > > Key: SPARK-22549 > URL: https://issues.apache.org/jira/browse/SPARK-22549 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.2.0 >Reporter: Kazuaki Ishizaki >Assignee: Kazuaki Ishizaki > Fix For: 2.2.1, 2.3.0 > > > {{concat_ws}} can throw an exception due to the 64KB JVM bytecode limit when > they use with a lot of arguments -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22494) Coalesce and AtLeastNNonNulls can cause 64KB JVM bytecode limit exception
[ https://issues.apache.org/jira/browse/SPARK-22494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-22494: Fix Version/s: (was: 2.2.2) 2.2.1 > Coalesce and AtLeastNNonNulls can cause 64KB JVM bytecode limit exception > - > > Key: SPARK-22494 > URL: https://issues.apache.org/jira/browse/SPARK-22494 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Marco Gaido >Assignee: Marco Gaido > Fix For: 2.2.1, 2.3.0 > > > Both `Coalesce` and `AtLeastNNonNulls` can cause the 64KB limit exception > when used with a lot of arguments and/or complex expressions. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22501) 64KB JVM bytecode limit problem with in
[ https://issues.apache.org/jira/browse/SPARK-22501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-22501: Fix Version/s: (was: 2.2.2) 2.2.1 > 64KB JVM bytecode limit problem with in > --- > > Key: SPARK-22501 > URL: https://issues.apache.org/jira/browse/SPARK-22501 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.2.0 >Reporter: Kazuaki Ishizaki >Assignee: Kazuaki Ishizaki > Fix For: 2.2.1, 2.3.0 > > > {{In}} can throw an exception due to the 64KB JVM bytecode limit when they > use with a lot of arguments -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22550) 64KB JVM bytecode limit problem with elt
[ https://issues.apache.org/jira/browse/SPARK-22550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-22550: Fix Version/s: (was: 2.2.2) 2.2.1 > 64KB JVM bytecode limit problem with elt > > > Key: SPARK-22550 > URL: https://issues.apache.org/jira/browse/SPARK-22550 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.2.0 >Reporter: Kazuaki Ishizaki > Fix For: 2.2.1, 2.3.0 > > > {{elt}} can throw an exception due to the 64KB JVM bytecode limit when they > use with a lot of arguments -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22508) 64KB JVM bytecode limit problem with GenerateUnsafeRowJoiner.create()
[ https://issues.apache.org/jira/browse/SPARK-22508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-22508: Fix Version/s: (was: 2.2.2) 2.2.1 > 64KB JVM bytecode limit problem with GenerateUnsafeRowJoiner.create() > - > > Key: SPARK-22508 > URL: https://issues.apache.org/jira/browse/SPARK-22508 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.2.0 >Reporter: Kazuaki Ishizaki >Assignee: Kazuaki Ishizaki > Fix For: 2.2.1, 2.3.0 > > > {{GenerateUnsafeRowJoiner.create()}} can throw an exception due to the 64KB > JVM bytecode limit when they use a schema with a lot of fields -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22499) 64KB JVM bytecode limit problem with least and greatest
[ https://issues.apache.org/jira/browse/SPARK-22499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-22499: Fix Version/s: (was: 2.2.2) 2.2.1 > 64KB JVM bytecode limit problem with least and greatest > --- > > Key: SPARK-22499 > URL: https://issues.apache.org/jira/browse/SPARK-22499 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.2.0 >Reporter: Kazuaki Ishizaki >Assignee: Kazuaki Ishizaki > Fix For: 2.2.1, 2.3.0 > > > Both {{least}} and {{greatest}} can throw an exception due to the 64KB JVM > bytecode limit when they use with a lot of arguments -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22550) 64KB JVM bytecode limit problem with elt
[ https://issues.apache.org/jira/browse/SPARK-22550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-22550: --- Assignee: Kazuaki Ishizaki > 64KB JVM bytecode limit problem with elt > > > Key: SPARK-22550 > URL: https://issues.apache.org/jira/browse/SPARK-22550 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.2.0 >Reporter: Kazuaki Ishizaki >Assignee: Kazuaki Ishizaki > Fix For: 2.2.1, 2.3.0 > > > {{elt}} can throw an exception due to the 64KB JVM bytecode limit when they > use with a lot of arguments -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22550) 64KB JVM bytecode limit problem with elt
[ https://issues.apache.org/jira/browse/SPARK-22550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-22550. - Resolution: Fixed Fix Version/s: 2.3.0 2.2.2 Issue resolved by pull request 19778 [https://github.com/apache/spark/pull/19778] > 64KB JVM bytecode limit problem with elt > > > Key: SPARK-22550 > URL: https://issues.apache.org/jira/browse/SPARK-22550 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.2.0 >Reporter: Kazuaki Ishizaki > Fix For: 2.2.2, 2.3.0 > > > {{elt}} can throw an exception due to the 64KB JVM bytecode limit when they > use with a lot of arguments -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22568) Split pair RDDs by keys - an efficient (maybe?) substitute to groupByKey
[ https://issues.apache.org/jira/browse/SPARK-22568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16260616#comment-16260616 ] Éderson Cássio commented on SPARK-22568: Thanks for your response. How can I filter by each distinct key without know them previously? It's common to determine the keys by some processing routine. The only way I could be able to filter by all of them is doing a {{collect}}, before this I don't ever know how many they are. > Split pair RDDs by keys - an efficient (maybe?) substitute to groupByKey > > > Key: SPARK-22568 > URL: https://issues.apache.org/jira/browse/SPARK-22568 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Éderson Cássio > Labels: features, performance, usability > > Sorry for any mistakes on filling this big form... it's my first issue here :) > Recently, I have the need to separate a RDD by some categorization. I was > able to accomplish that by some ways. > First, the obvious: mapping each element to a pair, with the key being the > category of the element. Then, using the good ol' {{groupByKey}}. > Listening to advices to avoid {{groupByKey}}, I failed to find another way > that was more efficient. I ended up (a) obtaining the distinct list of > element categories, (b) {{collect}} ing them and (c) making a call to > {{filter}} for each category. Of course, before all I {{cache}} d my initial > RDD. > So, I started to speculate: maybe it would be possible to make a number of > RDDs from an initial pair RDD _without the need to shuffle the data_. It > could be made by a kind of _local repartition_: first each partition is > splitted into various by key; then the master group the partitions with the > same key into a new RDD. The operation returns a List or array containing the > new RDDs. > It's just a conjecture, I don't know if it would be feasible in current Spark > Core architecture. But it would be great if it could be done. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22573) SQL Planner is including unnecessary columns in the projection
[ https://issues.apache.org/jira/browse/SPARK-22573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajkishore Hembram updated SPARK-22573: --- Description: While I was running TPC-H query 18 for benchmarking, I observed that the query plan for Apache Spark 2.2.0 is inefficient than other versions of Apache Spark. I noticed that the other versions of Apache Spark (2.0.2 and 2.1.2) are only including the required columns in the projections. But the query planner of Apache Spark 2.2.0 is including unnecessary columns into the projection for some of the queries and hence unnecessarily increasing the I/O. And because of that the Apache Spark 2.2.0 is taking more time. [Spark 2.1.2 TPC-H Query 18 Plan|https://drive.google.com/file/d/1_u8nPKG_SIM7P6fs0VK-8UEXIhWPY_BN/view] [Spark 2.2.0 TPC-H Query 18 Plan|https://drive.google.com/file/d/1xtxG5Ext36djfTDSdf_W5vGbbdgRApPo/view] TPC-H Query 18 {code:java} select C_NAME,C_CUSTKEY,O_ORDERKEY,O_ORDERDATE,O_TOTALPRICE,sum(L_QUANTITY) from CUSTOMER,ORDERS,LINEITEM where O_ORDERKEY in ( select L_ORDERKEY from LINEITEM group by L_ORDERKEY having sum(L_QUANTITY) > 300 ) and C_CUSTKEY = O_CUSTKEY and O_ORDERKEY = L_ORDERKEY group by C_NAME,C_CUSTKEY,O_ORDERKEY,O_ORDERDATE,O_TOTALPRICE order by O_TOTALPRICE desc,O_ORDERDATE {code} was: While I was running TPC-H query 18 for benchmarking, I observed that the query plan for Apache Spark 2.2.0 is inefficient than other versions of Apache Spark. I noticed that the other versions of Apache Spark (2.0.2 and 2.1.2) are only including the required columns in the projections. But the query planner of Apache Spark 2.2.0 is including unnecessary columns into the projection for some of the queries and hence unnecessarily increasing the I/O. And because of that the Apache Spark 2.2.0 is taking more time. [https://drive.google.com/file/d/1_u8nPKG_SIM7P6fs0VK-8UEXIhWPY_BN/view] [https://drive.google.com/file/d/1xtxG5Ext36djfTDSdf_W5vGbbdgRApPo/view] TPC-H Query 18 {code:java} select C_NAME,C_CUSTKEY,O_ORDERKEY,O_ORDERDATE,O_TOTALPRICE,sum(L_QUANTITY) from CUSTOMER,ORDERS,LINEITEM where O_ORDERKEY in ( select L_ORDERKEY from LINEITEM group by L_ORDERKEY having sum(L_QUANTITY) > 300 ) and C_CUSTKEY = O_CUSTKEY and O_ORDERKEY = L_ORDERKEY group by C_NAME,C_CUSTKEY,O_ORDERKEY,O_ORDERDATE,O_TOTALPRICE order by O_TOTALPRICE desc,O_ORDERDATE {code} > SQL Planner is including unnecessary columns in the projection > -- > > Key: SPARK-22573 > URL: https://issues.apache.org/jira/browse/SPARK-22573 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Rajkishore Hembram > > While I was running TPC-H query 18 for benchmarking, I observed that the > query plan for Apache Spark 2.2.0 is inefficient than other versions of > Apache Spark. I noticed that the other versions of Apache Spark (2.0.2 and > 2.1.2) are only including the required columns in the projections. But the > query planner of Apache Spark 2.2.0 is including unnecessary columns into the > projection for some of the queries and hence unnecessarily increasing the > I/O. And because of that the Apache Spark 2.2.0 is taking more time. > [Spark 2.1.2 TPC-H Query 18 > Plan|https://drive.google.com/file/d/1_u8nPKG_SIM7P6fs0VK-8UEXIhWPY_BN/view] > [Spark 2.2.0 TPC-H Query 18 > Plan|https://drive.google.com/file/d/1xtxG5Ext36djfTDSdf_W5vGbbdgRApPo/view] > TPC-H Query 18 > {code:java} > select C_NAME,C_CUSTKEY,O_ORDERKEY,O_ORDERDATE,O_TOTALPRICE,sum(L_QUANTITY) > from CUSTOMER,ORDERS,LINEITEM where O_ORDERKEY in ( select L_ORDERKEY from > LINEITEM group by L_ORDERKEY having sum(L_QUANTITY) > 300 ) and C_CUSTKEY = > O_CUSTKEY and O_ORDERKEY = L_ORDERKEY group by > C_NAME,C_CUSTKEY,O_ORDERKEY,O_ORDERDATE,O_TOTALPRICE order by O_TOTALPRICE > desc,O_ORDERDATE > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22573) SQL Planner is including unnecessary columns in the projection
[ https://issues.apache.org/jira/browse/SPARK-22573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajkishore Hembram updated SPARK-22573: --- Description: While I was running TPC-H query 18 for benchmarking, I observed that the query plan for Apache Spark 2.2.0 is inefficient than other versions of Apache Spark. I noticed that the other versions of Apache Spark (2.0.2 and 2.1.2) are only including the required columns in the projections. But the query planner of Apache Spark 2.2.0 is including unnecessary columns into the projection for some of the queries and hence unnecessarily increasing the I/O. And because of that the Apache Spark 2.2.0 is taking more time. [https://drive.google.com/file/d/1_u8nPKG_SIM7P6fs0VK-8UEXIhWPY_BN/view] [https://drive.google.com/file/d/1xtxG5Ext36djfTDSdf_W5vGbbdgRApPo/view] TPC-H Query 18 {code:java} select C_NAME,C_CUSTKEY,O_ORDERKEY,O_ORDERDATE,O_TOTALPRICE,sum(L_QUANTITY) from CUSTOMER,ORDERS,LINEITEM where O_ORDERKEY in ( select L_ORDERKEY from LINEITEM group by L_ORDERKEY having sum(L_QUANTITY) > 300 ) and C_CUSTKEY = O_CUSTKEY and O_ORDERKEY = L_ORDERKEY group by C_NAME,C_CUSTKEY,O_ORDERKEY,O_ORDERDATE,O_TOTALPRICE order by O_TOTALPRICE desc,O_ORDERDATE {code} was:While I was running TPC-H query 18 for benchmarking, I observed that the query plan for Apache Spark 2.2.0 is inefficient than other versions of Apache Spark. I noticed that the other versions of Apache Spark (2.0.2 and 2.1.2) are only including the required columns in the projections. But the query planner of Apache Spark 2.2.0 is including unnecessary columns into the projection for some of the queries and hence unnecessarily increasing the I/O. And because of that the Apache Spark 2.2.0 is taking more time. > SQL Planner is including unnecessary columns in the projection > -- > > Key: SPARK-22573 > URL: https://issues.apache.org/jira/browse/SPARK-22573 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Rajkishore Hembram > > While I was running TPC-H query 18 for benchmarking, I observed that the > query plan for Apache Spark 2.2.0 is inefficient than other versions of > Apache Spark. I noticed that the other versions of Apache Spark (2.0.2 and > 2.1.2) are only including the required columns in the projections. But the > query planner of Apache Spark 2.2.0 is including unnecessary columns into the > projection for some of the queries and hence unnecessarily increasing the > I/O. And because of that the Apache Spark 2.2.0 is taking more time. > [https://drive.google.com/file/d/1_u8nPKG_SIM7P6fs0VK-8UEXIhWPY_BN/view] > [https://drive.google.com/file/d/1xtxG5Ext36djfTDSdf_W5vGbbdgRApPo/view] > TPC-H Query 18 > {code:java} > select C_NAME,C_CUSTKEY,O_ORDERKEY,O_ORDERDATE,O_TOTALPRICE,sum(L_QUANTITY) > from CUSTOMER,ORDERS,LINEITEM where O_ORDERKEY in ( select L_ORDERKEY from > LINEITEM group by L_ORDERKEY having sum(L_QUANTITY) > 300 ) and C_CUSTKEY = > O_CUSTKEY and O_ORDERKEY = L_ORDERKEY group by > C_NAME,C_CUSTKEY,O_ORDERKEY,O_ORDERDATE,O_TOTALPRICE order by O_TOTALPRICE > desc,O_ORDERDATE > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22573) SQL Planner is including unnecessary columns in the projection
Rajkishore Hembram created SPARK-22573: -- Summary: SQL Planner is including unnecessary columns in the projection Key: SPARK-22573 URL: https://issues.apache.org/jira/browse/SPARK-22573 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Rajkishore Hembram While I was running TPC-H query 18 for benchmarking, I observed that the query plan for Apache Spark 2.2.0 is inefficient than other versions of Apache Spark. I noticed that the other versions of Apache Spark (2.0.2 and 2.1.2) are only including the required columns in the projections. But the query planner of Apache Spark 2.2.0 is including unnecessary columns into the projection for some of the queries and hence unnecessarily increasing the I/O. And because of that the Apache Spark 2.2.0 is taking more time. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22508) 64KB JVM bytecode limit problem with GenerateUnsafeRowJoiner.create()
[ https://issues.apache.org/jira/browse/SPARK-22508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-22508: --- Assignee: Kazuaki Ishizaki > 64KB JVM bytecode limit problem with GenerateUnsafeRowJoiner.create() > - > > Key: SPARK-22508 > URL: https://issues.apache.org/jira/browse/SPARK-22508 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.2.0 >Reporter: Kazuaki Ishizaki >Assignee: Kazuaki Ishizaki > Fix For: 2.3.0, 2.2.2 > > > {{GenerateUnsafeRowJoiner.create()}} can throw an exception due to the 64KB > JVM bytecode limit when they use a schema with a lot of fields -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22508) 64KB JVM bytecode limit problem with GenerateUnsafeRowJoiner.create()
[ https://issues.apache.org/jira/browse/SPARK-22508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-22508. - Resolution: Fixed Fix Version/s: 2.3.0 2.2.2 Issue resolved by pull request 19737 [https://github.com/apache/spark/pull/19737] > 64KB JVM bytecode limit problem with GenerateUnsafeRowJoiner.create() > - > > Key: SPARK-22508 > URL: https://issues.apache.org/jira/browse/SPARK-22508 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.2.0 >Reporter: Kazuaki Ishizaki > Fix For: 2.2.2, 2.3.0 > > > {{GenerateUnsafeRowJoiner.create()}} can throw an exception due to the 64KB > JVM bytecode limit when they use a schema with a lot of fields -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22572) spark-shell does not re-initialize on :replay
Mark Petruska created SPARK-22572: - Summary: spark-shell does not re-initialize on :replay Key: SPARK-22572 URL: https://issues.apache.org/jira/browse/SPARK-22572 Project: Spark Issue Type: Bug Components: Spark Shell Affects Versions: 2.3.0 Reporter: Mark Petruska Priority: Minor Spark-shell does not run the re-initialization script when a `:replay` command is issued: {code} $ ./bin/spark-shell 17/11/21 12:01:00 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Spark context Web UI available at http://192.168.1.3:4040 Spark context available as 'sc' (master = local[*], app id = local-1511262066013). Spark session available as 'spark'. Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.3.0-SNAPSHOT /_/ Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_74) Type in expressions to have them evaluated. Type :help for more information. scala> sc res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext@77bb916f scala> :replay Replaying: sc :12: error: not found: value sc sc ^ scala> sc :12: error: not found: value sc sc ^ scala> {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18859) Catalyst codegen does not mark column as nullable when it should. Causes NPE
[ https://issues.apache.org/jira/browse/SPARK-18859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16260481#comment-16260481 ] Anurag Agarwal commented on SPARK-18859: Another workaround for prostgres databases can be to make dblink and query via dblink, because at the end you need to mention the target schema. This avoids confusion for the postgres driver and helps it identify nullability properly from a query. I tried the approach by [~mentegy] and it works too, but somehow I was not having create view permission on source database and it made me look around for another approach. > Catalyst codegen does not mark column as nullable when it should. Causes NPE > > > Key: SPARK-18859 > URL: https://issues.apache.org/jira/browse/SPARK-18859 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Affects Versions: 2.0.2 >Reporter: Mykhailo Osypov > > When joining two tables via LEFT JOIN, columns in right table may be NULLs, > however catalyst codegen cannot recognize it. > Example: > {code:title=schema.sql|borderStyle=solid} > create table masterdata.testtable( > id int not null, > age int > ); > create table masterdata.jointable( > id int not null, > name text not null > ); > {code} > {code:title=query_to_select.sql|borderStyle=solid} > (select t.id, t.age, j.name from masterdata.testtable t left join > masterdata.jointable j on t.id = j.id) as testtable; > {code} > {code:title=master code|borderStyle=solid} > val df = sqlContext > .read > .format("jdbc") > .option("dbTable", "query to select") > > .load > //df generated schema > /* > root > |-- id: integer (nullable = false) > |-- age: integer (nullable = true) > |-- name: string (nullable = false) > */ > {code} > {code:title=Codegen|borderStyle=solid} > /* 038 */ scan_rowWriter.write(0, scan_value); > /* 039 */ > /* 040 */ if (scan_isNull1) { > /* 041 */ scan_rowWriter.setNullAt(1); > /* 042 */ } else { > /* 043 */ scan_rowWriter.write(1, scan_value1); > /* 044 */ } > /* 045 */ > /* 046 */ scan_rowWriter.write(2, scan_value2); > {code} > Since *j.name* is from right table of *left join* query, it may be null. > However generated schema doesn't think so (probably because it defined as > *name text not null*) > {code:title=StackTrace|borderStyle=solid} > java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:210) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.debug.package$DebugExec$$anonfun$3$$anon$1.hasNext(package.scala:146) > at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1763) > at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1134) > at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1134) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1899) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1899) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22541) Dataframes: applying multiple filters one after another using udfs and accumulators results in faulty accumulators
[ https://issues.apache.org/jira/browse/SPARK-22541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-22541. - Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 19787 [https://github.com/apache/spark/pull/19787] > Dataframes: applying multiple filters one after another using udfs and > accumulators results in faulty accumulators > -- > > Key: SPARK-22541 > URL: https://issues.apache.org/jira/browse/SPARK-22541 > Project: Spark > Issue Type: Documentation > Components: PySpark >Affects Versions: 2.2.0 > Environment: pyspark 2.2.0, ubuntu >Reporter: Janne K. Olesen > Fix For: 2.3.0 > > > I'm using udf filters and accumulators to keep track of filtered rows in > dataframes. > If I'm applying multiple filters one after the other, they seem to be > executed in parallel, not in sequence, which messes with the accumulators i'm > using to keep track of filtered data. > {code:title=example.py|borderStyle=solid} > from pyspark.sql.functions import udf, col > from pyspark.sql.types import BooleanType > from pyspark.sql import SparkSession > spark = SparkSession.builder.getOrCreate() > sc = spark.sparkContext > df = spark.createDataFrame([("a", 1, 1), ("b", 2, 2), ("c", 3, 3)], ["key", > "val1", "val2"]) > def __myfilter(val, acc): > if val < 2: > return True > else: > acc.add(1) > return False > acc1 = sc.accumulator(0) > acc2 = sc.accumulator(0) > def myfilter1(val): > return __myfilter(val, acc1) > def myfilter2(val): > return __myfilter(val, acc2) > my_udf1 = udf(myfilter1, BooleanType()) > my_udf2 = udf(myfilter2, BooleanType()) > df.show() > # +---+++ > # |key|val1|val2| > # +---+++ > # | a| 1| 1| > # | b| 2| 2| > # | c| 3| 3| > # +---+++ > df = df.filter(my_udf1(col("val1"))) > # df.show() > # +---+++ > # |key|val1|val2| > # +---+++ > # | a| 1| 1| > # +---+++ > # expected acc1: 2 > # expected acc2: 0 > df = df.filter(my_udf2(col("val2"))) > # df.show() > # +---+++ > # |key|val1|val2| > # +---+++ > # | a| 1| 1| > # +---+++ > # expected acc1: 2 > # expected acc2: 0 > df.show() > print("acc1: %s" % acc1.value) # expected 2, is 2 OK > print("acc2: %s" % acc2.value) # expected 0, is 2 !!! > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22541) Dataframes: applying multiple filters one after another using udfs and accumulators results in faulty accumulators
[ https://issues.apache.org/jira/browse/SPARK-22541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-22541: --- Assignee: Liang-Chi Hsieh > Dataframes: applying multiple filters one after another using udfs and > accumulators results in faulty accumulators > -- > > Key: SPARK-22541 > URL: https://issues.apache.org/jira/browse/SPARK-22541 > Project: Spark > Issue Type: Documentation > Components: PySpark >Affects Versions: 2.2.0 > Environment: pyspark 2.2.0, ubuntu >Reporter: Janne K. Olesen >Assignee: Liang-Chi Hsieh > Fix For: 2.3.0 > > > I'm using udf filters and accumulators to keep track of filtered rows in > dataframes. > If I'm applying multiple filters one after the other, they seem to be > executed in parallel, not in sequence, which messes with the accumulators i'm > using to keep track of filtered data. > {code:title=example.py|borderStyle=solid} > from pyspark.sql.functions import udf, col > from pyspark.sql.types import BooleanType > from pyspark.sql import SparkSession > spark = SparkSession.builder.getOrCreate() > sc = spark.sparkContext > df = spark.createDataFrame([("a", 1, 1), ("b", 2, 2), ("c", 3, 3)], ["key", > "val1", "val2"]) > def __myfilter(val, acc): > if val < 2: > return True > else: > acc.add(1) > return False > acc1 = sc.accumulator(0) > acc2 = sc.accumulator(0) > def myfilter1(val): > return __myfilter(val, acc1) > def myfilter2(val): > return __myfilter(val, acc2) > my_udf1 = udf(myfilter1, BooleanType()) > my_udf2 = udf(myfilter2, BooleanType()) > df.show() > # +---+++ > # |key|val1|val2| > # +---+++ > # | a| 1| 1| > # | b| 2| 2| > # | c| 3| 3| > # +---+++ > df = df.filter(my_udf1(col("val1"))) > # df.show() > # +---+++ > # |key|val1|val2| > # +---+++ > # | a| 1| 1| > # +---+++ > # expected acc1: 2 > # expected acc2: 0 > df = df.filter(my_udf2(col("val2"))) > # df.show() > # +---+++ > # |key|val1|val2| > # +---+++ > # | a| 1| 1| > # +---+++ > # expected acc1: 2 > # expected acc2: 0 > df.show() > print("acc1: %s" % acc1.value) # expected 2, is 2 OK > print("acc2: %s" % acc2.value) # expected 0, is 2 !!! > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22564) csv reader no longer logs errors
[ https://issues.apache.org/jira/browse/SPARK-22564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16260428#comment-16260428 ] Adrian Bridgett commented on SPARK-22564: - I guess this should be closed as won't fix. Maybe a better note on https://spark.apache.org/releases/spark-release-2-2-0.html would have been nice :D > csv reader no longer logs errors > > > Key: SPARK-22564 > URL: https://issues.apache.org/jira/browse/SPARK-22564 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Adrian Bridgett >Priority: Minor > > Since upgrading from 2.0.2 to 2.2.0 we no longer see any malformed CSV > warnings in the executor logs. It looks like this maybe related to > https://issues.apache.org/jira/browse/SPARK-19949 (where > maxMalformedLogPerPartition was removed) as it seems to be used but AFAICT, > not set anywhere. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22564) csv reader no longer logs errors
[ https://issues.apache.org/jira/browse/SPARK-22564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16260424#comment-16260424 ] Adrian Bridgett commented on SPARK-22564: - Ah, I had an old copy of CSVRelation.scala from before the refactoring left in that directory which caused my confusion. Yep - it appears to be gone, I'll look at using columnNameOfCorruptRecord instead. Thanks > csv reader no longer logs errors > > > Key: SPARK-22564 > URL: https://issues.apache.org/jira/browse/SPARK-22564 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Adrian Bridgett >Priority: Minor > > Since upgrading from 2.0.2 to 2.2.0 we no longer see any malformed CSV > warnings in the executor logs. It looks like this maybe related to > https://issues.apache.org/jira/browse/SPARK-19949 (where > maxMalformedLogPerPartition was removed) as it seems to be used but AFAICT, > not set anywhere. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org