[jira] [Created] (SPARK-22047) HiveExternalCatalogVersionsSuite is Flaky on Jenkins
Armin Braun created SPARK-22047: --- Summary: HiveExternalCatalogVersionsSuite is Flaky on Jenkins Key: SPARK-22047 URL: https://issues.apache.org/jira/browse/SPARK-22047 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0 Reporter: Armin Braun HiveExternalCatalogVersionsSuite fails quite a bit lately e.g. https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-sbt-hadoop-2.7/3490/testReport/junit/org.apache.spark.sql.hive/HiveExternalCatalogVersionsSuite/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/ {code} Error Message org.scalatest.exceptions.TestFailedException: spark-submit returned with exit code 1. Command line: './bin/spark-submit' '--name' 'prepare testing tables' '--master' 'local[2]' '--conf' 'spark.ui.enabled=false' '--conf' 'spark.master.rest.enabled=false' '--conf' 'spark.sql.warehouse.dir=/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/target/tmp/warehouse-b266cb0e-5180-4ba8-80a3-b790b3be3aa0' '--conf' 'spark.sql.test.version.index=0' '--driver-java-options' '-Dderby.system.home=/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/target/tmp/warehouse-b266cb0e-5180-4ba8-80a3-b790b3be3aa0' '/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/target/tmp/test120059455549609580.py' 2017-09-17 04:26:11.641 - stderr> Error: Could not find or load main class org.apache.spark.launcher.Main Stacktrace sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: spark-submit returned with exit code 1. Command line: './bin/spark-submit' '--name' 'prepare testing tables' '--master' 'local[2]' '--conf' 'spark.ui.enabled=false' '--conf' 'spark.master.rest.enabled=false' '--conf' 'spark.sql.warehouse.dir=/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/target/tmp/warehouse-b266cb0e-5180-4ba8-80a3-b790b3be3aa0' '--conf' 'spark.sql.test.version.index=0' '--driver-java-options' '-Dderby.system.home=/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/target/tmp/warehouse-b266cb0e-5180-4ba8-80a3-b790b3be3aa0' '/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/target/tmp/test120059455549609580.py' 2017-09-17 04:26:11.641 - stderr> Error: Could not find or load main class org.apache.spark.launcher.Main at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528) at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560) at org.scalatest.Assertions$class.fail(Assertions.scala:1089) at org.scalatest.FunSuite.fail(FunSuite.scala:1560) at org.apache.spark.sql.hive.SparkSubmitTestUtils$class.runSparkSubmit(SparkSubmitTestUtils.scala:81) at org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite.runSparkSubmit(HiveExternalCatalogVersionsSuite.scala:38) at org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite$$anonfun$beforeAll$1.apply(HiveExternalCatalogVersionsSuite.scala:120) at org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite$$anonfun$beforeAll$1.apply(HiveExternalCatalogVersionsSuite.scala:105) at scala.collection.immutable.List.foreach(List.scala:381) at org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite.beforeAll(HiveExternalCatalogVersionsSuite.scala:105) at org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:212) at org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:210) at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:31) at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:314) at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:480) at sbt.ForkMain$Run$2.call(ForkMain.java:296) at sbt.ForkMain$Run$2.call(ForkMain.java:286) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21970) Do a Project Wide Sweep for Redundant Throws Declarations
Armin Braun created SPARK-21970: --- Summary: Do a Project Wide Sweep for Redundant Throws Declarations Key: SPARK-21970 URL: https://issues.apache.org/jira/browse/SPARK-21970 Project: Spark Issue Type: Bug Components: Examples, Spark Core, SQL Affects Versions: 2.3.0 Reporter: Armin Braun Priority: Trivial Unfortunately, redundant throws declarations are not caught by Checkstyle and there are quite a few in the current Java codebase. In one case `ShuffleExternalSorter#closeAndGetSpills` this hides some dead code too. I think it's worthwhile to do a sweep for these and remove them. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21967) org.apache.spark.unsafe.types.UTF8String#compareTo Should Compare 8 Bytes at a Time for Better Performance
Armin Braun created SPARK-21967: --- Summary: org.apache.spark.unsafe.types.UTF8String#compareTo Should Compare 8 Bytes at a Time for Better Performance Key: SPARK-21967 URL: https://issues.apache.org/jira/browse/SPARK-21967 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.2.0 Reporter: Armin Braun Priority: Minor org.apache.spark.unsafe.types.UTF8String#compareTo contains the following TODO: {code} int len = Math.min(numBytes, other.numBytes); // TODO: compare 8 bytes as unsigned long for (int i = 0; i < len; i ++) { // In UTF-8, the byte should be unsigned, so we should compare them as unsigned int. {code} The todo should be resolved by comparing the maximum number of 64bit words possible in this method, before falling back to unsigned int comparison. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20201) Flaky Test: org.apache.spark.sql.catalyst.expressions.OrderingSuite
[ https://issues.apache.org/jira/browse/SPARK-20201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16154975#comment-16154975 ] Armin Braun commented on SPARK-20201: - This was resolved by this commit in June https://github.com/original-brownbear/spark/commit/b32b2123ddca66e00acf4c9d956232e07f779f9f#diff-4fe0e85423909b24c2a56287468271f1R138 > Flaky Test: org.apache.spark.sql.catalyst.expressions.OrderingSuite > --- > > Key: SPARK-20201 > URL: https://issues.apache.org/jira/browse/SPARK-20201 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Takuya Ueshin >Priority: Minor > Labels: flaky-test > > This test failed recently here: > https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.7/2856/testReport/junit/org.apache.spark.sql.catalyst.expressions/OrderingSuite/SPARK_16845__GeneratedClass$SpecificOrdering_grows_beyond_64_KB/ > Dashboard > https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.catalyst.expressions.OrderingSuite&test_name=SPARK-16845%3A+GeneratedClass%24SpecificOrdering+grows+beyond+64+KB > Error Message > {code} > java.lang.StackOverflowError > {code} > {code} > com.google.common.util.concurrent.ExecutionError: java.lang.StackOverflowError > at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2261) > at com.google.common.cache.LocalCache.get(LocalCache.java:4000) > at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4004) > at > com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:903) > at > org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.create(GenerateOrdering.scala:188) > at > org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.create(GenerateOrdering.scala:43) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:887) > at > org.apache.spark.sql.catalyst.expressions.OrderingSuite$$anonfun$1.apply$mcV$sp(OrderingSuite.scala:138) > at > org.apache.spark.sql.catalyst.expressions.OrderingSuite$$anonfun$1.apply(OrderingSuite.scala:131) > at > org.apache.spark.sql.catalyst.expressions.OrderingSuite$$anonfun$1.apply(OrderingSuite.scala:131) > at > org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) > at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) > at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68) > at > org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) > at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) > at org.scalatest.FunSuite.runTest(FunSuite.scala:1555) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) > at scala.collection.immutable.List.foreach(List.scala:381) > at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) > at > org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) > at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) > at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208) > at org.scalatest.FunSuite.runTests(FunSuite.scala:1555) > at org.scalatest.Suite$class.run(Suite.scala:1424) > at > org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555) > at > org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) > at > org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) > at org.scalatest.SuperEngine.runImpl(Engine.scala:545) > at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212) > at > org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(Spa
[jira] [Commented] (SPARK-20336) spark.read.csv() with wholeFile=True option fails to read non ASCII unicode characters
[ https://issues.apache.org/jira/browse/SPARK-20336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15982669#comment-15982669 ] Armin Braun commented on SPARK-20336: - [~priancho] my bad apparently in the above. I can't retrace the exact version I ran on (maybe I mistakenly ran an old revision, sorry about that). But I see the same with `master` revision `31345fde82` from today. {code} Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 17/04/25 12:14:55 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 17/04/25 12:14:57 WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME. Spark context Web UI available at http://192.168.178.57:4040 Spark context available as 'sc' (master = yarn, app id = application_1493115274587_0001). Spark session available as 'spark'. Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.3.0-SNAPSHOT /_/ Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_131) Type in expressions to have them evaluated. Type :help for more information. scala> spark.read.option("wholeFile", true).option("header", true).csv("file:///tmp/sample.csv").show() +++-+ |col1|col2| col3| +++-+ | 1| a| text| | 2| b| テキスト| | 3| c| 텍스트| | 4| d|text テキスト 텍스트| | 5| e| last| +++-+ {code} > spark.read.csv() with wholeFile=True option fails to read non ASCII unicode > characters > -- > > Key: SPARK-20336 > URL: https://issues.apache.org/jira/browse/SPARK-20336 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 > Environment: Spark 2.2.0 (master branch is downloaded from Github) > PySpark >Reporter: HanCheol Cho > > I used spark.read.csv() method with wholeFile=True option to load data that > has multi-line records. > However, non-ASCII characters are not properly loaded. > The following is a sample data for test: > {code:none} > col1,col2,col3 > 1,a,text > 2,b,テキスト > 3,c,텍스트 > 4,d,"text > テキスト > 텍스트" > 5,e,last > {code} > When it is loaded without wholeFile=True option, non-ASCII characters are > shown correctly although multi-line records are parsed incorrectly as follows: > {code:none} > testdf_default = spark.read.csv("test.encoding.csv", header=True) > testdf_default.show() > ++++ > |col1|col2|col3| > ++++ > | 1| a|text| > | 2| b|テキスト| > | 3| c| 텍스트| > | 4| d|text| > |テキスト|null|null| > | 텍스트"|null|null| > | 5| e|last| > ++++ > {code} > When wholeFile=True option is used, non-ASCII characters are broken as > follows: > {code:none} > testdf_wholefile = spark.read.csv("test.encoding.csv", header=True, > wholeFile=True) > testdf_wholefile.show() > ++++ > |col1|col2|col3| > ++++ > | 1| a|text| > | 2| b|| > | 3| c| �| > | 4| d|text > ...| > | 5| e|last| > ++++ > {code} > The result is same even if I use encoding="UTF-8" option with wholeFile=True. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20455) Missing Test Target in Documentation for "Running Docker-based Integration Test Suites"
Armin Braun created SPARK-20455: --- Summary: Missing Test Target in Documentation for "Running Docker-based Integration Test Suites" Key: SPARK-20455 URL: https://issues.apache.org/jira/browse/SPARK-20455 Project: Spark Issue Type: Documentation Components: Documentation Affects Versions: 2.1.0 Reporter: Armin Braun Priority: Minor The doc at http://spark.apache.org/docs/latest/building-spark.html#running-docker-based-integration-test-suites is missing the `test` goal in the second line of the Maven build description. It should be: {code} ./build/mvn install -DskipTests ./build/mvn test -Pdocker-integration-tests -pl :spark-docker-integration-tests_2.11 {code} Adding a PR now. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20436) NullPointerException when restart from checkpoint file
[ https://issues.apache.org/jira/browse/SPARK-20436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Armin Braun updated SPARK-20436: Description: I have written a Spark Streaming application which have two DStreams. Code is : {code} object KafkaTwoInkfk { def main(args: Array[String]) { val Array(checkPointDir, brokers, topic1, topic2, batchSize) = args val ssc = StreamingContext.getOrCreate(checkPointDir, () => createContext(args)) ssc.start() ssc.awaitTermination() } def createContext(args : Array[String]) : StreamingContext = { val Array(checkPointDir, brokers, topic1, topic2, batchSize) = args val sparkConf = new SparkConf().setAppName("KafkaWordCount") val ssc = new StreamingContext(sparkConf, Seconds(batchSize.toLong)) ssc.checkpoint(checkPointDir) val topicArr1 = topic1.split(",") val topicSet1 = topicArr1.toSet val topicArr2 = topic2.split(",") val topicSet2 = topicArr2.toSet val kafkaParams = Map[String, String]( "metadata.broker.list" -> brokers ) val lines1 = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicSet1) val words1 = lines1.map(_._2).flatMap(_.split(" ")) val wordCounts1 = words1.map(x => { (x, 1L)}).reduceByKey(_ + _) wordCounts1.print() val lines2 = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicSet2) val words2 = lines1.map(_._2).flatMap(_.split(" ")) val wordCounts2 = words2.map(x => { (x, 1L)}).reduceByKey(_ + _) wordCounts2.print() return ssc } } {code} when restart from checkpoint file, it throw NullPointerException: java.lang.NullPointerException at org.apache.spark.streaming.dstream.DStreamCheckpointData$$anonfun$writeObject$1.apply$mcV$sp(DStreamCheckpointData.scala:126) at org.apache.spark.streaming.dstream.DStreamCheckpointData$$anonfun$writeObject$1.apply(DStreamCheckpointData.scala:124) at org.apache.spark.streaming.dstream.DStreamCheckpointData$$anonfun$writeObject$1.apply(DStreamCheckpointData.scala:124) at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1291) at org.apache.spark.streaming.dstream.DStreamCheckpointData.writeObject(DStreamCheckpointData.scala:124) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) at java.io.ObjectOutputStream.defaultWriteObject(ObjectOutputStream.java:441) at org.apache.spark.streaming.dstream.DStream$$anonfun$writeObject$1.apply$mcV$sp(DStream.scala:528) at org.apache.spark.streaming.dstream.DStream$$anonfun$writeObject$1.apply(DStream.scala:523) at org.apache.spark.streaming.dstream.DStream$$anonfun$writeObject$1.apply(DStream.scala:523) at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1291) at org.apache.spark.streaming.dstream.DStream.writeObject(DStream.scala:523) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) at java.io.ObjectOutputStream.defaultWriteO
[jira] [Commented] (SPARK-20155) CSV-files with quoted quotes can't be parsed, if delimiter follows quoted quote
[ https://issues.apache.org/jira/browse/SPARK-20155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15981166#comment-15981166 ] Armin Braun commented on SPARK-20155: - [~RPCMoritz] take a look at what I just found: https://issues.apache.org/jira/browse/SPARK-19834?focusedCommentId=15925375&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15925375 :) It's on the radar apparently > CSV-files with quoted quotes can't be parsed, if delimiter follows quoted > quote > --- > > Key: SPARK-20155 > URL: https://issues.apache.org/jira/browse/SPARK-20155 > Project: Spark > Issue Type: Bug > Components: Input/Output, SQL >Affects Versions: 2.0.0 >Reporter: Rick Moritz > > According to : > https://tools.ietf.org/html/rfc4180#section-2 > 7. If double-quotes are used to enclose fields, then a double-quote >appearing inside a field must be escaped by preceding it with >another double quote. For example: >"aaa","b""bb","ccc" > This currently works as is, but the following does not: > "aaa","b""b,b","ccc" > while "aaa","b\"b,b","ccc" does get parsed. > I assume, this happens because quotes are currently being parsed in pairs, > and that somehow ends up unquoting delimiter. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20155) CSV-files with quoted quotes can't be parsed, if delimiter follows quoted quote
[ https://issues.apache.org/jira/browse/SPARK-20155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Armin Braun resolved SPARK-20155. - Resolution: Won't Fix > CSV-files with quoted quotes can't be parsed, if delimiter follows quoted > quote > --- > > Key: SPARK-20155 > URL: https://issues.apache.org/jira/browse/SPARK-20155 > Project: Spark > Issue Type: Bug > Components: Input/Output, SQL >Affects Versions: 2.0.0 >Reporter: Rick Moritz > > According to : > https://tools.ietf.org/html/rfc4180#section-2 > 7. If double-quotes are used to enclose fields, then a double-quote >appearing inside a field must be escaped by preceding it with >another double quote. For example: >"aaa","b""bb","ccc" > This currently works as is, but the following does not: > "aaa","b""b,b","ccc" > while "aaa","b\"b,b","ccc" does get parsed. > I assume, this happens because quotes are currently being parsed in pairs, > and that somehow ends up unquoting delimiter. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20155) CSV-files with quoted quotes can't be parsed, if delimiter follows quoted quote
[ https://issues.apache.org/jira/browse/SPARK-20155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15981009#comment-15981009 ] Armin Braun commented on SPARK-20155: - [~RPCMoritz] sorry I was under the wrong assumption that quote escaping was enabled by default: see the difference with your example: {code} scala> spark.read.csv("file:///tmp/tmp2.tmp").show() +---+-+---+---+ |_c0| _c1|_c2|_c3| +---+-+---+---+ |aaa|"b""b| b"|ccc| +---+-+---+---+ scala> spark.read.option("escape", "\"").csv("file:///tmp/tmp2.tmp").show() +---+-+---+ |_c0| _c1|_c2| +---+-+---+ |aaa|b"b,b|ccc| +---+-+---+ {code} I think this can be closed, I don't think changing the default behavior is an option here. > CSV-files with quoted quotes can't be parsed, if delimiter follows quoted > quote > --- > > Key: SPARK-20155 > URL: https://issues.apache.org/jira/browse/SPARK-20155 > Project: Spark > Issue Type: Bug > Components: Input/Output, SQL >Affects Versions: 2.0.0 >Reporter: Rick Moritz > > According to : > https://tools.ietf.org/html/rfc4180#section-2 > 7. If double-quotes are used to enclose fields, then a double-quote >appearing inside a field must be escaped by preceding it with >another double quote. For example: >"aaa","b""bb","ccc" > This currently works as is, but the following does not: > "aaa","b""b,b","ccc" > while "aaa","b\"b,b","ccc" does get parsed. > I assume, this happens because quotes are currently being parsed in pairs, > and that somehow ends up unquoting delimiter. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20155) CSV-files with quoted quotes can't be parsed, if delimiter follows quoted quote
[ https://issues.apache.org/jira/browse/SPARK-20155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15980974#comment-15980974 ] Armin Braun commented on SPARK-20155: - I was able to reproduce this: {code} "aaa","b\"b,b","ccc" {code} gives us {code} scala> spark.read.option("wholeFile", true).csv("file:///tmp/tmp2.csv").show() +---+-+---+ |_c0| _c1|_c2| +---+-+---+ |aaa|b"b,b|ccc| +---+-+---+ {code} while {code} "aaa","b""b,b","ccc" {code} gives us: {code} scala> spark.read.option("wholeFile", true).csv("file:///tmp/tmp2.csv").show() +---+-+---+---+ |_c0| _c1|_c2|_c3| +---+-+---+---+ |aaa|"b""b| b"|ccc| {code} Will try to fix :) > CSV-files with quoted quotes can't be parsed, if delimiter follows quoted > quote > --- > > Key: SPARK-20155 > URL: https://issues.apache.org/jira/browse/SPARK-20155 > Project: Spark > Issue Type: Bug > Components: Input/Output, SQL >Affects Versions: 2.0.0 >Reporter: Rick Moritz > > According to : > https://tools.ietf.org/html/rfc4180#section-2 > 7. If double-quotes are used to enclose fields, then a double-quote >appearing inside a field must be escaped by preceding it with >another double quote. For example: >"aaa","b""bb","ccc" > This currently works as is, but the following does not: > "aaa","b""b,b","ccc" > while "aaa","b\"b,b","ccc" does get parsed. > I assume, this happens because quotes are currently being parsed in pairs, > and that somehow ends up unquoting delimiter. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20336) spark.read.csv() with wholeFile=True option fails to read non ASCII unicode characters
[ https://issues.apache.org/jira/browse/SPARK-20336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15980962#comment-15980962 ] Armin Braun commented on SPARK-20336: - I just tried this on latest Spark and Hadoop/Yarn `2.6.3` and it looks fine to me with your file: {code} $ bin/spark-shell --master yarn --deploy-mode client' scala> spark.read.option("wholeFile", true).option("header", true).csv("file:///tmp/temp.csv").show() ++++ |col1|col2|col3| ++++ | 1| a|text| | 2| b|テキスト| | 3| c| 텍스트| | 4| d|text| |テキスト|null|null| |텍스트"|null|null| | 5| e|last| ++++ {code} > spark.read.csv() with wholeFile=True option fails to read non ASCII unicode > characters > -- > > Key: SPARK-20336 > URL: https://issues.apache.org/jira/browse/SPARK-20336 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 > Environment: Spark 2.2.0 (master branch is downloaded from Github) > PySpark >Reporter: HanCheol Cho > > I used spark.read.csv() method with wholeFile=True option to load data that > has multi-line records. > However, non-ASCII characters are not properly loaded. > The following is a sample data for test: > {code:none} > col1,col2,col3 > 1,a,text > 2,b,テキスト > 3,c,텍스트 > 4,d,"text > テキスト > 텍스트" > 5,e,last > {code} > When it is loaded without wholeFile=True option, non-ASCII characters are > shown correctly although multi-line records are parsed incorrectly as follows: > {code:none} > testdf_default = spark.read.csv("test.encoding.csv", header=True) > testdf_default.show() > ++++ > |col1|col2|col3| > ++++ > | 1| a|text| > | 2| b|テキスト| > | 3| c| 텍스트| > | 4| d|text| > |テキスト|null|null| > | 텍스트"|null|null| > | 5| e|last| > ++++ > {code} > When wholeFile=True option is used, non-ASCII characters are broken as > follows: > {code:none} > testdf_wholefile = spark.read.csv("test.encoding.csv", header=True, > wholeFile=True) > testdf_wholefile.show() > ++++ > |col1|col2|col3| > ++++ > | 1| a|text| > | 2| b|| > | 3| c| �| > | 4| d|text > ...| > | 5| e|last| > ++++ > {code} > The result is same even if I use encoding="UTF-8" option with wholeFile=True. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17280) Flaky test: org.apache.spark.streaming.kafka010.JavaKafkaRDDSuite and JavaDirectKafkaStreamSuite.testKafkaStream
[ https://issues.apache.org/jira/browse/SPARK-17280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Armin Braun resolved SPARK-17280. - Resolution: Fixed closing this, can't find any recent examples of this on Jenkins and haven't experienced this locally either as of late. Also tried reproducing this running 1k+ loops of all the Kafka0.10_2.11 tests with 3 forks in parallel without issues. > Flaky test: org.apache.spark.streaming.kafka010.JavaKafkaRDDSuite and > JavaDirectKafkaStreamSuite.testKafkaStream > > > Key: SPARK-17280 > URL: https://issues.apache.org/jira/browse/SPARK-17280 > Project: Spark > Issue Type: Bug > Components: DStreams, Tests >Reporter: Yin Huai > > https://spark-tests.appspot.com/builds/spark-master-test-maven-hadoop-2.2/1793 > https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.2/1793/ > {code} > org.apache.spark.streaming.kafka010.JavaDirectKafkaStreamSuite.testKafkaStream > Error Message > assertion failed: Partition [topic1, 0] metadata not propagated after timeout > Stacktrace > java.util.concurrent.TimeoutException: assertion failed: Partition [topic1, > 0] metadata not propagated after timeout > at > org.apache.spark.streaming.kafka010.JavaDirectKafkaStreamSuite.createTopicAndSendData(JavaDirectKafkaStreamSuite.java:176) > at > org.apache.spark.streaming.kafka010.JavaDirectKafkaStreamSuite.testKafkaStream(JavaDirectKafkaStreamSuite.java:74) > {code} > {code} > org.apache.spark.streaming.kafka010.JavaKafkaRDDSuite.testKafkaRDD > Error Message > Job aborted due to stage failure: Task 1 in stage 0.0 failed 1 times, most > recent failure: Lost task 1.0 in stage 0.0 (TID 1, localhost): > java.lang.AssertionError: assertion failed: Failed to get records for > spark-executor-java-test-consumer--363965267-1472280538438 topic2 0 0 after > polling for 512 > at scala.Predef$.assert(Predef.scala:170) > at > org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:74) > at > org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227) > at > org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1684) > at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1134) > at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1134) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1910) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1910) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Driver stacktrace: > Stacktrace > org.apache.spark.SparkException: > Job aborted due to stage failure: Task 1 in stage 0.0 failed 1 times, most > recent failure: Lost task 1.0 in stage 0.0 (TID 1, localhost): > java.lang.AssertionError: assertion failed: Failed to get records for > spark-executor-java-test-consumer--363965267-1472280538438 topic2 0 0 after > polling for 512 > at scala.Predef$.assert(Predef.scala:170) > at > org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:74) > at > org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227) > at > org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1684) > at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1134) > at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1134) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1910) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1910) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745)
[jira] [Closed] (SPARK-19592) Duplication in Test Configuration Relating to SparkConf Settings Should be Removed
[ https://issues.apache.org/jira/browse/SPARK-19592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Armin Braun closed SPARK-19592. --- Resolution: Won't Fix > Duplication in Test Configuration Relating to SparkConf Settings Should be > Removed > -- > > Key: SPARK-19592 > URL: https://issues.apache.org/jira/browse/SPARK-19592 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 2.1.0, 2.2.0 > Environment: Applies to all Environments >Reporter: Armin Braun >Priority: Minor > > This configuration for Surefire, Scalatest is duplicated in the parent POM as > well as the SBT build. > While this duplication cannot be removed in general it can at least be > removed for all system properties that simply result in a SparkConf setting I > think. > Instead of having lines like > {code} > false > {code} > twice in the pom.xml > and once in SBT as > {code} > javaOptions in Test += "-Dspark.ui.enabled=false", > {code} > it would be a lot cleaner to simply have a > {code} > var conf: SparkConf > {code} > field in > {code} > org.apache.spark.SparkFunSuite > {code} > that has SparkConf set up with all the shared configuration that > `systemProperties` currently provide. Obviously this cannot be done straight > away given that > many subclasses of the parent suit do this, so I think it would be best to > simply add a method to the parent that provides this configuration for now > and start refactoring away duplication in other suit setups from there step > by step until the sys properties can be removed from the pom and sbt.build. > This makes the build a lot easier to maintain and makes tests more readable > by making the environment setup more explicit in the code. > (also it would allow running more tests straight from the IDE which is always > a nice thing imo) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19592) Duplication in Test Configuration Relating to SparkConf Settings Should be Removed
[ https://issues.apache.org/jira/browse/SPARK-19592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15869985#comment-15869985 ] Armin Braun commented on SPARK-19592: - I see your point on these two: {quote} Isn't this going to mean changing every single test suite? That is to say I could kind of imagine a broader cleanup and refactoring of test state. A big change just to remove a few lines of config doesn't seem worth it. {quote} Yea it will obviously require wider changes (but see below). Looks to me like this would be a valid start for cleaning up tests state in general. {quote} Ideally that's cleaned up all in one go or not. {quote} I mean you could go testsuite by testsuite and eventually drop the properties being injected by the build system. Doing this all in one go would admittedly be a big change. Even an incremental approach (doing this step by step and having the test setup inside ScalaTest be redundant) would be worth it in my opinion though ... would already make the test env more readable (and less importantly but nice to have ... runnable from the IDE) wouldn't it? > Duplication in Test Configuration Relating to SparkConf Settings Should be > Removed > -- > > Key: SPARK-19592 > URL: https://issues.apache.org/jira/browse/SPARK-19592 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 2.1.0, 2.2.0 > Environment: Applies to all Environments >Reporter: Armin Braun >Priority: Minor > > This configuration for Surefire, Scalatest is duplicated in the parent POM as > well as the SBT build. > While this duplication cannot be removed in general it can at least be > removed for all system properties that simply result in a SparkConf setting I > think. > Instead of having lines like > {code} > false > {code} > twice in the pom.xml > and once in SBT as > {code} > javaOptions in Test += "-Dspark.ui.enabled=false", > {code} > it would be a lot cleaner to simply have a > {code} > var conf: SparkConf > {code} > field in > {code} > org.apache.spark.SparkFunSuite > {code} > that has SparkConf set up with all the shared configuration that > `systemProperties` currently provide. Obviously this cannot be done straight > away given that > many subclasses of the parent suit do this, so I think it would be best to > simply add a method to the parent that provides this configuration for now > and start refactoring away duplication in other suit setups from there step > by step until the sys properties can be removed from the pom and sbt.build. > This makes the build a lot easier to maintain and makes tests more readable > by making the environment setup more explicit in the code. > (also it would allow running more tests straight from the IDE which is always > a nice thing imo) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19592) Duplication in Test Configuration Relating to SparkConf Settings Should be Removed
[ https://issues.apache.org/jira/browse/SPARK-19592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15869933#comment-15869933 ] Armin Braun commented on SPARK-19592: - [~srowen] could I convince you or better to drop this one ? :) > Duplication in Test Configuration Relating to SparkConf Settings Should be > Removed > -- > > Key: SPARK-19592 > URL: https://issues.apache.org/jira/browse/SPARK-19592 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 2.1.0, 2.2.0 > Environment: Applies to all Environments >Reporter: Armin Braun >Priority: Minor > > This configuration for Surefire, Scalatest is duplicated in the parent POM as > well as the SBT build. > While this duplication cannot be removed in general it can at least be > removed for all system properties that simply result in a SparkConf setting I > think. > Instead of having lines like > {code} > false > {code} > twice in the pom.xml > and once in SBT as > {code} > javaOptions in Test += "-Dspark.ui.enabled=false", > {code} > it would be a lot cleaner to simply have a > {code} > var conf: SparkConf > {code} > field in > {code} > org.apache.spark.SparkFunSuite > {code} > that has SparkConf set up with all the shared configuration that > `systemProperties` currently provide. Obviously this cannot be done straight > away given that > many subclasses of the parent suit do this, so I think it would be best to > simply add a method to the parent that provides this configuration for now > and start refactoring away duplication in other suit setups from there step > by step until the sys properties can be removed from the pom and sbt.build. > This makes the build a lot easier to maintain and makes tests more readable > by making the environment setup more explicit in the code. > (also it would allow running more tests straight from the IDE which is always > a nice thing imo) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19275) Spark Streaming, Kafka receiver, "Failed to get records for ... after polling for 512"
[ https://issues.apache.org/jira/browse/SPARK-19275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Armin Braun resolved SPARK-19275. - Resolution: Not A Problem > Spark Streaming, Kafka receiver, "Failed to get records for ... after polling > for 512" > -- > > Key: SPARK-19275 > URL: https://issues.apache.org/jira/browse/SPARK-19275 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.0 > Environment: Apache Spark 2.0.0, Kafka 0.10 for Scala 2.11 >Reporter: Dmitry Ochnev > > We have a Spark Streaming application reading records from Kafka 0.10. > Some tasks are failed because of the following error: > "java.lang.AssertionError: assertion failed: Failed to get records for (...) > after polling for 512" > The first attempt fails and the second attempt (retry) completes > successfully, - this is the pattern that we see for many tasks in our logs. > These fails and retries consume resources. > A similar case with a stack trace are described here: > https://www.mail-archive.com/user@spark.apache.org/msg56564.html > https://gist.github.com/SrikanthTati/c2e95c4ac689cd49aab817e24ec42767 > Here is the line from the stack trace where the error is raised: > org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:74) > We tried several values for "spark.streaming.kafka.consumer.poll.ms", - 2, 5, > 10, 30 and 60 seconds, but the error appeared in all the cases except the > last one. Moreover, increasing the threshold led to increasing total Spark > stage duration. > In other words, increasing "spark.streaming.kafka.consumer.poll.ms" led to > fewer task failures but with cost of total stage duration. So, it is bad for > performance when processing data streams. > We have a suspicion that there is a bug in CachedKafkaConsumer (and/or other > related classes) which inhibits the reading process. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19592) Duplication in Test Configuration Relating to SparkConf Settings Should be Removed
[ https://issues.apache.org/jira/browse/SPARK-19592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15866428#comment-15866428 ] Armin Braun commented on SPARK-19592: - Imo this also relates to the ability to handle https://issues.apache.org/jira/browse/SPARK-8985 in a clean way btw. > Duplication in Test Configuration Relating to SparkConf Settings Should be > Removed > -- > > Key: SPARK-19592 > URL: https://issues.apache.org/jira/browse/SPARK-19592 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 2.1.0, 2.2.0 > Environment: Applies to all Environments >Reporter: Armin Braun >Priority: Minor > > This configuration for Surefire, Scalatest is duplicated in the parent POM as > well as the SBT build. > While this duplication cannot be removed in general it can at least be > removed for all system properties that simply result in a SparkConf setting I > think. > Instead of having lines like > {code} > false > {code} > twice in the pom.xml > and once in SBT as > {code} > javaOptions in Test += "-Dspark.ui.enabled=false", > {code} > it would be a lot cleaner to simply have a > {code} > var conf: SparkConf > {code} > field in > {code} > org.apache.spark.SparkFunSuite > {code} > that has SparkConf set up with all the shared configuration that > `systemProperties` currently provide. Obviously this cannot be done straight > away given that > many subclasses of the parent suit do this, so I think it would be best to > simply add a method to the parent that provides this configuration for now > and start refactoring away duplication in other suit setups from there step > by step until the sys properties can be removed from the pom and sbt.build. > This makes the build a lot easier to maintain and makes tests more readable > by making the environment setup more explicit in the code. > (also it would allow running more tests straight from the IDE which is always > a nice thing imo) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19592) Duplication in Test Configuration Relating to SparkConf Settings Should be Removed
[ https://issues.apache.org/jira/browse/SPARK-19592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15866367#comment-15866367 ] Armin Braun commented on SPARK-19592: - [~srowen] {quote} What about tests that make their own conf or need to? {quote} Those tests in particular made me interested in this for correctness/readability reasons. Maybe an example helps :) in _org.apache.spark.streaming.InputStreamsSuite_ we have the situation that the conf is set up in the parent suit via just {code} val conf = new SparkConf() .setMaster(master) .setAppName(framework) {code} now if you run that suit from the IDE one of the tests fails with an apparent error in the logic. {code} The code passed to eventually never returned normally. Attempted 664 times over 10.01260721901 seconds. Last failure message: 10 did not equal 5. {code} You debug it and find out that it's because you get some _StreamingListener_ added to the context twice because the tests adds one manually that is already on the context. Reason for that being that it's also added by the UI when you have _spark.ui.enabled_ set to default _true_. So basically you now have a seemingly redundant line of code in a bunch of tests: {code} ssc.addStreamingListener(ssc.progressListener) {code} ... that appears wrong with the configuration (that you see if you just read the code) and requires you to also consider (maintain) what Maven or SBT is injecting in terms of environment. --- So I think those tests that make their own config are the most troublesome since they have non-standard defaults injected. In my opinion it would be a lot easier to work with if the defaults would just be the standard production env. defaults when I create a new instance of the SparkConf and all deviation from that would be explicit in the code. I agree it's not a big pain, still a quality issue worth fixing (imo). Reduces maintenance effort from drier test configs and makes tests easier to read like in the example above. > Duplication in Test Configuration Relating to SparkConf Settings Should be > Removed > -- > > Key: SPARK-19592 > URL: https://issues.apache.org/jira/browse/SPARK-19592 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 2.1.0, 2.2.0 > Environment: Applies to all Environments >Reporter: Armin Braun >Priority: Minor > > This configuration for Surefire, Scalatest is duplicated in the parent POM as > well as the SBT build. > While this duplication cannot be removed in general it can at least be > removed for all system properties that simply result in a SparkConf setting I > think. > Instead of having lines like > {code} > false > {code} > twice in the pom.xml > and once in SBT as > {code} > javaOptions in Test += "-Dspark.ui.enabled=false", > {code} > it would be a lot cleaner to simply have a > {code} > var conf: SparkConf > {code} > field in > {code} > org.apache.spark.SparkFunSuite > {code} > that has SparkConf set up with all the shared configuration that > `systemProperties` currently provide. Obviously this cannot be done straight > away given that > many subclasses of the parent suit do this, so I think it would be best to > simply add a method to the parent that provides this configuration for now > and start refactoring away duplication in other suit setups from there step > by step until the sys properties can be removed from the pom and sbt.build. > This makes the build a lot easier to maintain and makes tests more readable > by making the environment setup more explicit in the code. > (also it would allow running more tests straight from the IDE which is always > a nice thing imo) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19592) Duplication in Test Configuration Relating to SparkConf Settings Should be Removed
[ https://issues.apache.org/jira/browse/SPARK-19592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Armin Braun updated SPARK-19592: Affects Version/s: 2.2.0 > Duplication in Test Configuration Relating to SparkConf Settings Should be > Removed > -- > > Key: SPARK-19592 > URL: https://issues.apache.org/jira/browse/SPARK-19592 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 2.1.0, 2.2.0 > Environment: Applies to all Environments >Reporter: Armin Braun >Priority: Minor > > This configuration for Surefire, Scalatest is duplicated in the parent POM as > well as the SBT build. > While this duplication cannot be removed in general it can at least be > removed for all system properties that simply result in a SparkConf setting I > think. > Instead of having lines like > {code} > false > {code} > twice in the pom.xml > and once in SBT as > {code} > javaOptions in Test += "-Dspark.ui.enabled=false", > {code} > it would be a lot cleaner to simply have a > {code} > var conf: SparkConf > {code} > field in > {code} > org.apache.spark.SparkFunSuite > {code} > that has SparkConf set up with all the shared configuration that > `systemProperties` currently provide. Obviously this cannot be done straight > away given that > many subclasses of the parent suit do this, so I think it would be best to > simply add a method to the parent that provides this configuration for now > and start refactoring away duplication in other suit setups from there step > by step until the sys properties can be removed from the pom and sbt.build. > This makes the build a lot easier to maintain and makes tests more readable > by making the environment setup more explicit in the code. > (also it would allow running more tests straight from the IDE which is always > a nice thing imo) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19592) Duplication in Test Configuration Relating to SparkConf Settings Should be Removed
[ https://issues.apache.org/jira/browse/SPARK-19592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Armin Braun updated SPARK-19592: Description: This configuration for Surefire, Scalatest is duplicated in the parent POM as well as the SBT build. While this duplication cannot be removed in general it can at least be removed for all system properties that simply result in a SparkConf setting I think. Instead of having lines like {code} false {code} twice in the pom.xml and once in SBT as {code} javaOptions in Test += "-Dspark.ui.enabled=false", {code} it would be a lot cleaner to simply have a {code} var conf: SparkConf {code} field in {code} org.apache.spark.SparkFunSuite {code} that has SparkConf set up with all the shared configuration that `systemProperties` currently provide. Obviously this cannot be done straight away given that many subclasses of the parent suit do this, so I think it would be best to simply add a method to the parent that provides this configuration for now and start refactoring away duplication in other suit setups from there step by step until the sys properties can be removed from the pom and sbt.build. This makes the build a lot easier to maintain and makes tests more readable by making the environment setup more explicit in the code. (also it would allow running more tests straight from the IDE which is always a nice thing imo) was: This configuration for Surefire, Scalatest is duplicated in the parent POM as well as the SBT build. While this duplication cannot be removed in general it can at least be removed for all system properties that simply result in a SparkConf setting I think. Instead of having lines like {code} false {code} twice in the pom.xml and once in SBT as {code} javaOptions in Test += "-Dspark.ui.enabled=false", {code} it would be a lot cleaner to simply have a {code} var conf: SparkConf {code} field in {code} org.apache.spark.SparkFunSuite {code} that has SparkConf set up with all the shared configuration that `systemProperties` currently provide. This makes the build a lot easier to maintain and makes tests more readable by making the environment setup more explicit in the code. (also it would allow running more tests straight from the IDE which is always a nice thing imo) > Duplication in Test Configuration Relating to SparkConf Settings Should be > Removed > -- > > Key: SPARK-19592 > URL: https://issues.apache.org/jira/browse/SPARK-19592 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 2.1.0 > Environment: Applies to all Environments >Reporter: Armin Braun >Priority: Minor > > This configuration for Surefire, Scalatest is duplicated in the parent POM as > well as the SBT build. > While this duplication cannot be removed in general it can at least be > removed for all system properties that simply result in a SparkConf setting I > think. > Instead of having lines like > {code} > false > {code} > twice in the pom.xml > and once in SBT as > {code} > javaOptions in Test += "-Dspark.ui.enabled=false", > {code} > it would be a lot cleaner to simply have a > {code} > var conf: SparkConf > {code} > field in > {code} > org.apache.spark.SparkFunSuite > {code} > that has SparkConf set up with all the shared configuration that > `systemProperties` currently provide. Obviously this cannot be done straight > away given that > many subclasses of the parent suit do this, so I think it would be best to > simply add a method to the parent that provides this configuration for now > and start refactoring away duplication in other suit setups from there step > by step until the sys properties can be removed from the pom and sbt.build. > This makes the build a lot easier to maintain and makes tests more readable > by making the environment setup more explicit in the code. > (also it would allow running more tests straight from the IDE which is always > a nice thing imo) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19592) Duplication in Test Configuration Relating to SparkConf Settings Should be Removed
[ https://issues.apache.org/jira/browse/SPARK-19592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Armin Braun updated SPARK-19592: Description: This configuration for Surefire, Scalatest is duplicated in the parent POM as well as the SBT build. While this duplication cannot be removed in general it can at least be removed for all system properties that simply result in a SparkConf setting I think. Instead of having lines like {code} false {code} twice in the pom.xml and once in SBT as {code} javaOptions in Test += "-Dspark.ui.enabled=false", {code} it would be a lot cleaner to simply have a {code} var conf: SparkConf {code} field in {code} org.apache.spark.SparkFunSuite {code} that has SparkConf set up with all the shared configuration that `systemProperties` currently provide. This makes the build a lot easier to maintain and makes tests more readable by making the environment setup more explicit in the code. (also it would allow running more tests straight from the IDE which is always a nice thing imo) was: This configuration for Surefire, Scalatest is duplicated in the parent POM as well as the SBT build. While this duplication cannot be removed in general it can at least be removed for all system properties that simply result in a SparkConf setting I think. Instead of having lines like {code} false {code} twice in the pom.xml and once in SBT as {code} javaOptions in Test += "-Dspark.ui.enabled=false", {code} it would be a lot cleaner to simply have a {code} var conf: SparkConf {code} field in {code} org.apache.spark.SparkFunSuite {code} that has SparkConf set up with all the shared configuration that `systemProperties` currently provide. This makes the build a lot easier to maintain and makes tests more readable by making the environment setup more explicit in the code. > Duplication in Test Configuration Relating to SparkConf Settings Should be > Removed > -- > > Key: SPARK-19592 > URL: https://issues.apache.org/jira/browse/SPARK-19592 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 2.1.0 > Environment: Applies to all Environments >Reporter: Armin Braun >Priority: Minor > > This configuration for Surefire, Scalatest is duplicated in the parent POM as > well as the SBT build. > While this duplication cannot be removed in general it can at least be > removed for all system properties that simply result in a SparkConf setting I > think. > Instead of having lines like > {code} > false > {code} > twice in the pom.xml > and once in SBT as > {code} > javaOptions in Test += "-Dspark.ui.enabled=false", > {code} > it would be a lot cleaner to simply have a > {code} > var conf: SparkConf > {code} > field in > {code} > org.apache.spark.SparkFunSuite > {code} > that has SparkConf set up with all the shared configuration that > `systemProperties` currently provide. > This makes the build a lot easier to maintain and makes tests more readable > by making the environment setup more explicit in the code. > (also it would allow running more tests straight from the IDE which is always > a nice thing imo) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19592) Duplication in Test Configuration Relating to SparkConf Settings Should be Removed
[ https://issues.apache.org/jira/browse/SPARK-19592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Armin Braun updated SPARK-19592: Description: This configuration for Surefire, Scalatest is duplicated in the parent POM as well as the SBT build. While this duplication cannot be removed in general it can at least be removed for all system properties that simply result in a SparkConf setting I think. Instead of having lines like {code} false {code} twice in the pom.xml and once in SBT as {code} javaOptions in Test += "-Dspark.ui.enabled=false", {code} it would be a lot cleaner to simply have a {code} var conf: SparkConf {code} field in {code} org.apache.spark.SparkFunSuite {code} that has SparkConf set up with all the shared configuration that `systemProperties` currently provide. This makes the build a lot easier to maintain and makes tests more readable by making the environment setup more explicit in the code. was: This configuration for Surefire, Scalatest is duplicated in the parent POM as well as the SBT build. While this duplication cannot be removed in general it can at least be removed for all system properties that simply result in a SparkConf setting I think. Instead of having lines like {code} false {code} twice in the pom.xml and once in SBT as {code} javaOptions in Test += "-Dspark.ui.enabled=false", {code} it would be a lot cleaner to simply have a `conf` field in `org.apache.spark.SparkFunSuite` that has SparkConf set up with all the shared configuration that `systemProperties` currently provide. This makes the build a lot easier to maintain and makes tests more readable by making the environment setup more explicit in the code. > Duplication in Test Configuration Relating to SparkConf Settings Should be > Removed > -- > > Key: SPARK-19592 > URL: https://issues.apache.org/jira/browse/SPARK-19592 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 2.1.0 > Environment: Applies to all Environments >Reporter: Armin Braun >Priority: Minor > > This configuration for Surefire, Scalatest is duplicated in the parent POM as > well as the SBT build. > While this duplication cannot be removed in general it can at least be > removed for all system properties that simply result in a SparkConf setting I > think. > Instead of having lines like > {code} > false > {code} > twice in the pom.xml > and once in SBT as > {code} > javaOptions in Test += "-Dspark.ui.enabled=false", > {code} > it would be a lot cleaner to simply have a > {code} > var conf: SparkConf > {code} > field in > {code} > org.apache.spark.SparkFunSuite > {code} > that has SparkConf set up with all the shared configuration that > `systemProperties` currently provide. > This makes the build a lot easier to maintain and makes tests more readable > by making the environment setup more explicit in the code. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19592) Duplication in Test Configuration Relating to SparkConf Settings Should be Removed
Armin Braun created SPARK-19592: --- Summary: Duplication in Test Configuration Relating to SparkConf Settings Should be Removed Key: SPARK-19592 URL: https://issues.apache.org/jira/browse/SPARK-19592 Project: Spark Issue Type: Improvement Components: Tests Affects Versions: 2.1.0 Environment: Applies to all Environments Reporter: Armin Braun Priority: Minor This configuration for Surefire, Scalatest is duplicated in the parent POM as well as the SBT build. While this duplication cannot be removed in general it can at least be removed for all system properties that simply result in a SparkConf setting I think. Instead of having lines like {code} false {code} twice in the pom.xml and once in SBT as {code} javaOptions in Test += "-Dspark.ui.enabled=false", {code} it would be a lot cleaner to simply have a `conf` field in `org.apache.spark.SparkFunSuite` that has SparkConf set up with all the shared configuration that `systemProperties` currently provide. This makes the build a lot easier to maintain and makes tests more readable by making the environment setup more explicit in the code. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19562) Gitignore Misses Folder dev/pr-deps
[ https://issues.apache.org/jira/browse/SPARK-19562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15862954#comment-15862954 ] Armin Braun commented on SPARK-19562: - PR added https://github.com/apache/spark/pull/16904 > Gitignore Misses Folder dev/pr-deps > --- > > Key: SPARK-19562 > URL: https://issues.apache.org/jira/browse/SPARK-19562 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0 > Environment: Applies to all Environments >Reporter: Armin Braun >Priority: Trivial > > It's basically in the title. > Running the build and tests as instructed by the Readme creates the folder > `dev/pr-deps` that is not covered by the gitignore leaving us with this: > {code:none} > ➜ spark git:(master) ✗ git status > > On branch master > Your branch is up-to-date with 'origin/master'. > Untracked files: > (use "git add ..." to include in what will be committed) > dev/pr-deps/ > {code} > I think that folder should be added to the gitignore. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19562) Gitignore Misses Folder dev/pr-deps
[ https://issues.apache.org/jira/browse/SPARK-19562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Armin Braun updated SPARK-19562: Description: It's basically in the title. Running the build and tests as instructed by the Readme creates the folder `dev/pr-deps` that is not covered by the gitignore leaving us with this: {code:none} ➜ spark git:(master) ✗ git status On branch master Your branch is up-to-date with 'origin/master'. Untracked files: (use "git add ..." to include in what will be committed) dev/pr-deps/ {code} I think that folder should be added to the gitignore. was: It's basically in the title. Running the build and tests as instructed by the Readme creates the folder `dev/pr-deps` that is not covered by the gitignore leaving us with this: {code:bash} ➜ spark git:(master) ✗ git status On branch master Your branch is up-to-date with 'origin/master'. Untracked files: (use "git add ..." to include in what will be committed) dev/pr-deps/ {code} I think that folder should be added to the gitignore. > Gitignore Misses Folder dev/pr-deps > --- > > Key: SPARK-19562 > URL: https://issues.apache.org/jira/browse/SPARK-19562 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0 > Environment: Applies to all Environments >Reporter: Armin Braun >Priority: Trivial > > It's basically in the title. > Running the build and tests as instructed by the Readme creates the folder > `dev/pr-deps` that is not covered by the gitignore leaving us with this: > {code:none} > ➜ spark git:(master) ✗ git status > > On branch master > Your branch is up-to-date with 'origin/master'. > Untracked files: > (use "git add ..." to include in what will be committed) > dev/pr-deps/ > {code} > I think that folder should be added to the gitignore. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19562) Gitignore Misses Folder dev/pr-deps
Armin Braun created SPARK-19562: --- Summary: Gitignore Misses Folder dev/pr-deps Key: SPARK-19562 URL: https://issues.apache.org/jira/browse/SPARK-19562 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.1.0 Environment: Applies to all Environments Reporter: Armin Braun Priority: Trivial It's basically in the title. Running the build and tests as instructed by the Readme creates the folder `dev/pr-deps` that is not covered by the gitignore leaving us with this: {code:bash} ➜ spark git:(master) ✗ git status On branch master Your branch is up-to-date with 'origin/master'. Untracked files: (use "git add ..." to include in what will be committed) dev/pr-deps/ {code} I think that folder should be added to the gitignore. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org