[jira] [Commented] (SPARK-36163) Propagate correct JDBC properties in JDBC connector provider and add "connectionProvider" option
[ https://issues.apache.org/jira/browse/SPARK-36163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17501026#comment-17501026 ] William Shen commented on SPARK-36163: -- Any interest in backporting this to 3.1.x and 3.2.x? > Propagate correct JDBC properties in JDBC connector provider and add > "connectionProvider" option > > > Key: SPARK-36163 > URL: https://issues.apache.org/jira/browse/SPARK-36163 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0, 3.1.1, 3.1.2 >Reporter: Ivan Sadikov >Assignee: Ivan >Priority: Major > Fix For: 3.3.0 > > > There are a couple of issues with JDBC connection providers. The first is a > bug caused by > [https://github.com/apache/spark/commit/c3ce9701b458511255072c72b9b245036fa98653] > where we would pass all properties, including JDBC data source keys, to the > JDBC driver which results in errors like {{java.sql.SQLException: > Unrecognized connection property 'url'}}. > Connection properties are supposed to only include vendor properties, url > config is a JDBC option and should be excluded. > The fix would be replacing {{jdbcOptions.asProperties.asScala.foreach}} with > {{jdbcOptions.asConnectionProperties.asScala.foreach}} which is > java.sql.Driver friendly. > > I also investigated the problem with multiple providers and I think there are > a couple of oversights in {{ConnectionProvider}} implementation. I think it > is missing two things: > * Any {{JdbcConnectionProvider}} should take precedence over > {{BasicConnectionProvider}}. {{BasicConnectionProvider}} should only be > selected if there was no match found when inferring providers that can handle > JDBC url. > * There is currently no way to select a specific provider that you want, > similar to how you can select a JDBC driver. The use case is, for example, > having connection providers for two databases that handle the same URL but > have slightly different semantics and you want to select one in one case and > the other one in others. > ** I think the first point could be discarded when the second one is > addressed. > You can technically use {{spark.sql.sources.disabledJdbcConnProviderList}} to > exclude ones that don’t need to be included, but I am not quite sure why it > was done that way - it is much simpler to allow users to enforce the provider > they want. > This ticket fixes it by adding a {{connectionProvider}} option to the JDBC > data source that allows users to select a particular provider when the > ambiguity arises. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6238) Support shuffle where individual blocks might be > 2G
[ https://issues.apache.org/jira/browse/SPARK-6238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16479645#comment-16479645 ] William Shen commented on SPARK-6238: - [~irashid], you marked this as a duplicate. Can you also mark which ticket this duplicates? I see that you marked this as duplicated by SPARK-5928. Did you mean that this duplicates SPARK-5928? (or if this duplicates SPARK-19659 per your comment). Thanks in advance for the clarification! > Support shuffle where individual blocks might be > 2G > - > > Key: SPARK-6238 > URL: https://issues.apache.org/jira/browse/SPARK-6238 > Project: Spark > Issue Type: Sub-task > Components: Shuffle, Spark Core >Reporter: Reynold Xin >Assignee: jin xing >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5928) Remote Shuffle Blocks cannot be more than 2 GB
[ https://issues.apache.org/jira/browse/SPARK-5928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16464553#comment-16464553 ] William Shen commented on SPARK-5928: - [~UZiVcbfPXaNrMtT], if you increase parallelism (having more partitions) over your massive data size, would that be able to reduce the size for each partition to work around this issue? > Remote Shuffle Blocks cannot be more than 2 GB > -- > > Key: SPARK-5928 > URL: https://issues.apache.org/jira/browse/SPARK-5928 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Imran Rashid >Priority: Major > Attachments: image-2018-03-29-11-52-32-075.png > > > If a shuffle block is over 2GB, the shuffle fails, with an uninformative > exception. The tasks get retried a few times and then eventually the job > fails. > Here is an example program which can cause the exception: > {code} > val rdd = sc.parallelize(1 to 1e6.toInt, 1).map{ ignore => > val n = 3e3.toInt > val arr = new Array[Byte](n) > //need to make sure the array doesn't compress to something small > scala.util.Random.nextBytes(arr) > arr > } > rdd.map { x => (1, x)}.groupByKey().count() > {code} > Note that you can't trigger this exception in local mode, it only happens on > remote fetches. I triggered these exceptions running with > {{MASTER=yarn-client spark-shell --num-executors 2 --executor-memory 4000m}} > {noformat} > 15/02/20 11:10:23 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 3, > imran-3.ent.cloudera.com): FetchFailed(BlockManagerId(1, > imran-2.ent.cloudera.com, 55028), shuffleId=1, mapId=0, reduceId=0, message= > org.apache.spark.shuffle.FetchFailedException: Adjusted frame length exceeds > 2147483647: 3021252889 - discarded > at > org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67) > at > org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83) > at > org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at > org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:125) > at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58) > at > org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:46) > at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) > at org.apache.spark.scheduler.Task.run(Task.scala:56) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: io.netty.handler.codec.TooLongFrameException: Adjusted frame > length exceeds 2147483647: 3021252889 - discarded > at > io.netty.handler.codec.LengthFieldBasedFrameDecoder.fail(LengthFieldBasedFrameDecoder.java:501) > at > io.netty.handler.codec.LengthFieldBasedFrameDecoder.failIfNecessary(LengthFieldBasedFrameDecoder.java:477) > at > io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:403) > at > io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:343) > at > io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:249) > at > io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:149) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) > at > io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) > at > io.netty.channel.nio.NioEventLoop.processSelectedK
[jira] [Commented] (SPARK-17888) Memory leak in streaming driver when use SparkSQL in Streaming
[ https://issues.apache.org/jira/browse/SPARK-17888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16323260#comment-16323260 ] William Shen commented on SPARK-17888: -- [~FireThief], did you ever find out what was wrong? We are encountering driver heap increase in 1.6 streaming, with a very similar heap histogram with lots of Long and LongSQLMetricValue {code} num #instances #bytes class name -- 1: 91155998 4375487904 scala.collection.immutable.HashMap$HashMap1 2: 64295181 4028440384 [C 3: 96126890 2307045360 java.lang.Long 4: 89875107 2157002568 org.apache.spark.sql.execution.metric.LongSQLMetricValue 5: 63941246 2046119872 java.lang.String 6: 30106557 1685967192 org.apache.spark.scheduler.AccumulableInfo 7: 49762250 1592392000 scala.collection.immutable.$colon$colon 8: 25223481 1529732184 [Lscala.collection.immutable.HashMap; 9: 2219699 1016786672 [B 10: 25222833 807130656 scala.collection.immutable.HashMap$HashTrieMap 11: 5624636 794226288 [Ljava.lang.Object; 12: 31443622 754646928 scala.Some 13: 10954338 525808224 java.util.Hashtable$Entry 14: 10915201 523929648 java.util.concurrent.ConcurrentHashMap$Node 15:292754 461068288 [I 16: 26986 440039088 [Ljava.util.concurrent.ConcurrentHashMap$Node; 17: 35656 174934384 [Ljava.util.Hashtable$Entry; 18: 3227318 103274176 org.apache.spark.sql.catalyst.trees.Origin 19: 2669050 85409600 scala.collection.mutable.ArrayBuffer {code} > Memory leak in streaming driver when use SparkSQL in Streaming > -- > > Key: SPARK-17888 > URL: https://issues.apache.org/jira/browse/SPARK-17888 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 1.6.2 > Environment: scala 2.10.4 > java 1.7.0_71 >Reporter: weilin.chen > Labels: leak, memory > > Hi > I have a little program of spark 1.5, it receive data from a publisher in > spark streaming. It will process these received data with spark sql. But when > the time goes by I found the memory leak in driver, so i update to spark > 1.6.2. But, there is no change in the situation. > here is the code: > {quote} > val lines = ssc.receiverStream(new RReceiver("10.0.200.15", 6380, > "subresult")) > val jsonf = > lines.map(JSON.parseFull(_)).map(_.get.asInstanceOf[scala.collection.immutable.Map[String, > Any]]) > val logs = jsonf.map(data => LogStashV1(data("message").toString, > data("path").toString, data("host").toString, > data("lineno").toString.toDouble, data("timestamp").toString)) > logs.foreachRDD( rdd => { > import sqc.implicits._ > rdd.toDF.registerTempTable("logstash") > val sqlreport0 = sqc.sql("SELECT message, COUNT(message) AS host_c, > SUM(lineno) AS line_a FROM logstash WHERE path = '/var/log/system.log' AND > lineno > 70 GROUP BY message ORDER BY host_c DESC LIMIT 100") > sqlreport0.map(t => AlertMsg(t(0).toString, t(1).toString.toInt, > t(2).toString.toDouble)).collect().foreach(println) > sqlreport0.map(t => AlertMsg(t(0).toString, t(1).toString.toInt, > t(2).toString.toDouble)).collect().foreach(println) > {quote} > jmap information: > {quote} > num #instances #bytes class name > -- >1: 34819 72711952 [B >2: 2297557 66010656 [C >3: 2296294 55111056 java.lang.String >4: 1063491 42539640 org.apache.spark.scheduler.AccumulableInfo >5: 1251001 40032032 > scala.collection.immutable.HashMap$HashMap1 >6: 1394364 33464736 java.lang.Long >7: 1102516 26460384 scala.collection.immutable.$colon$colon >8: 1058202 25396848 > org.apache.spark.sql.execution.metric.LongSQLMetricValue >9: 1266499 20263984 scala.Some > 10:124052 15889104 > 11:124052 15269568 > 12: 11350 12082432 > 13: 11350 11692880 > 14: 96682 10828384 org.apache.spark.executor.TaskMetrics > 15:2334819505896 [Lscala.collection.immutable.HashMap; > 16: 966826961104 org.apache.spark.scheduler.TaskInfo > 17: 95896433312 > 18:2330005592000 > scala.collection.immutable.HashMap$HashTrieMap > 19: 962005387200 > org.apache.spark.executor.ShuffleReadMetrics > 20:1133813628192 scala.collection.mutable.ListBuffer > 21: 72522891792 > 22:1
[jira] [Commented] (SPARK-22517) NullPointerException in ShuffleExternalSorter.spill()
[ https://issues.apache.org/jira/browse/SPARK-22517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16319496#comment-16319496 ] William Shen commented on SPARK-22517: -- I came across a similar stack trace when debugging our spark streaming application in 1.6. Similarly, I also do not have an easily reproducible snippet, and it occurs after the jobs ran fine for some time. [~asmaier], did you have any luck solving this issue? {code} 18/01/03 10:55:04 ERROR Executor: Exception in task 588.0 in stage 12.0 (TID 41821) java.lang.NullPointerException at org.apache.spark.memory.TaskMemoryManager.getPage(TaskMemoryManager.java:351) at org.apache.spark.shuffle.sort.ShuffleExternalSorter.writeSortedFile(ShuffleExternalSorter.java:192) at org.apache.spark.shuffle.sort.ShuffleExternalSorter.spill(ShuffleExternalSorter.java:254) at org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:175) at org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:249) at org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:83) at org.apache.spark.shuffle.sort.ShuffleInMemorySorter.reset(ShuffleInMemorySorter.java:76) at org.apache.spark.shuffle.sort.ShuffleExternalSorter.spill(ShuffleExternalSorter.java:256) at org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:175) at org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:249) at org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:112) at org.apache.spark.shuffle.sort.ShuffleExternalSorter.acquireNewPageIfNecessary(ShuffleExternalSorter.java:359) at org.apache.spark.shuffle.sort.ShuffleExternalSorter.insertRecord(ShuffleExternalSorter.java:380) at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.insertRecordIntoSorter(UnsafeShuffleWriter.java:237) at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:229) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) {code} > NullPointerException in ShuffleExternalSorter.spill() > - > > Key: SPARK-22517 > URL: https://issues.apache.org/jira/browse/SPARK-22517 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Andreas Maier > > I see a NullPointerException during sorting with the following stacktrace: > {code} > 17/11/13 15:02:56 ERROR Executor: Exception in task 138.0 in stage 9.0 (TID > 13497) > java.lang.NullPointerException > at > org.apache.spark.memory.TaskMemoryManager.getPage(TaskMemoryManager.java:383) > at > org.apache.spark.shuffle.sort.ShuffleExternalSorter.writeSortedFile(ShuffleExternalSorter.java:193) > at > org.apache.spark.shuffle.sort.ShuffleExternalSorter.spill(ShuffleExternalSorter.java:254) > at > org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:203) > at > org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:281) > at > org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:90) > at > org.apache.spark.shuffle.sort.ShuffleInMemorySorter.reset(ShuffleInMemorySorter.java:100) > at > org.apache.spark.shuffle.sort.ShuffleExternalSorter.spill(ShuffleExternalSorter.java:256) > at > org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:203) > at > org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:281) > at > org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:90) > at > org.apache.spark.shuffle.sort.ShuffleExternalSorter.growPointerArrayIfNecessary(ShuffleExternalSorter.java:328) > at > org.apache.spark.shuffle.sort.ShuffleExternalSorter.insertRecord(ShuffleExternalSorter.java:379) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.insertRecordIntoSorter(UnsafeShuffleWriter.java:246) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:167) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
[jira] [Commented] (SPARK-18886) Delay scheduling should not delay some executors indefinitely if one task is scheduled before delay timeout
[ https://issues.apache.org/jira/browse/SPARK-18886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16225335#comment-16225335 ] William Shen commented on SPARK-18886: -- [~imranr], I came across this issue because it is marked as duplicated by SPARK-11460. SPARK-11460 has affected versions of 1.0.0, 1.0.1, 1.0.2, 1.1.0, 1.1.1, 1.2.0, 1.2.1, 1.2.2, 1.3.0, 1.3.1, 1.4.0, 1.4.1, 1.5.0, 1.5.1, and this issue has affected version of 2.1.0. Do we know if this is also an issue for 1.6.0? I am observing similar behavior for our application in 1.6.0. We are also able to achieve better utilization of the executors and performance through setting the wait to 0. Thank you > Delay scheduling should not delay some executors indefinitely if one task is > scheduled before delay timeout > --- > > Key: SPARK-18886 > URL: https://issues.apache.org/jira/browse/SPARK-18886 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Imran Rashid > > Delay scheduling can introduce an unbounded delay and underutilization of > cluster resources under the following circumstances: > 1. Tasks have locality preferences for a subset of available resources > 2. Tasks finish in less time than the delay scheduling. > Instead of having *one* delay to wait for resources with better locality, > spark waits indefinitely. > As an example, consider a cluster with 100 executors, and a taskset with 500 > tasks. Say all tasks have a preference for one executor, which is by itself > on one host. Given the default locality wait of 3s per level, we end up with > a 6s delay till we schedule on other hosts (process wait + host wait). > If each task takes 5 seconds (under the 6 second delay), then _all 500_ tasks > get scheduled on _only one_ executor. This means you're only using a 1% of > your cluster, and you get a ~100x slowdown. You'd actually be better off if > tasks took 7 seconds. > *WORKAROUNDS*: > (1) You can change the locality wait times so that it is shorter than the > task execution time. You need to take into account the sum of all wait times > to use all the resources on your cluster. For example, if you have resources > on different racks, this will include the sum of > "spark.locality.wait.process" + "spark.locality.wait.node" + > "spark.locality.wait.rack". Those each default to "3s". The simplest way to > be to set "spark.locality.wait.process" to your desired wait interval, and > set both "spark.locality.wait.node" and "spark.locality.wait.rack" to "0". > For example, if your tasks take ~3 seconds on average, you might set > "spark.locality.wait.process" to "1s". *NOTE*: due to SPARK-18967, avoid > setting the {{spark.locality.wait=0}} -- instead, use > {{spark.locality.wait=1ms}}. > Note that this workaround isn't perfect --with less delay scheduling, you may > not get as good resource locality. After this issue is fixed, you'd most > likely want to undo these configuration changes. > (2) The worst case here will only happen if your tasks have extreme skew in > their locality preferences. Users may be able to modify their job to > controlling the distribution of the original input data. > (2a) A shuffle may end up with very skewed locality preferences, especially > if you do a repartition starting from a small number of partitions. (Shuffle > locality preference is assigned if any node has more than 20% of the shuffle > input data -- by chance, you may have one node just above that threshold, and > all other nodes just below it.) In this case, you can turn off locality > preference for shuffle data by setting > {{spark.shuffle.reduceLocality.enabled=false}} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5632) not able to resolve dot('.') in field name
[ https://issues.apache.org/jira/browse/SPARK-5632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15765558#comment-15765558 ] William Shen commented on SPARK-5632: - [~marmbrus], Found another weirdness with dot in field name, that might be of interest too. in 1.5.0, {code} import org.apache.spark.sql.functions._ val data = Seq(("test1","test2","test3")).toDF("col1", "col with . in it", "col3"); data.withColumn("col1", trim(data("col1"))) {code} fails with {code} org.apache.spark.sql.AnalysisException: cannot resolve 'col with . in it' given input columns col1, col with . in it, col3; at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:56) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:53) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:292) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:249) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:279) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:290) at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:108) at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:118) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:122) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:122) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:126) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:126) at o
[jira] [Commented] (SPARK-5632) not able to resolve dot('.') in field name
[ https://issues.apache.org/jira/browse/SPARK-5632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15752948#comment-15752948 ] William Shen commented on SPARK-5632: - Thanks [~marmbrus]. I see that the backtick works in 1.5.0 as well (with the limitation on distinct, which is fixed in SPARK-15230). Hopefully this will get sorted out together with SPARK-18084. Thanks again for your help! > not able to resolve dot('.') in field name > -- > > Key: SPARK-5632 > URL: https://issues.apache.org/jira/browse/SPARK-5632 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.2.0, 1.3.0 > Environment: Spark cluster: EC2 m1.small + Spark 1.2.0 > Cassandra cluster: EC2 m3.xlarge + Cassandra 2.1.2 >Reporter: Lishu Liu >Assignee: Wenchen Fan >Priority: Blocker > Fix For: 1.4.0 > > > My cassandra table task_trace has a field sm.result which contains dot in the > name. So SQL tried to look up sm instead of full name 'sm.result'. > Here is my code: > {code} > scala> import org.apache.spark.sql.cassandra.CassandraSQLContext > scala> val cc = new CassandraSQLContext(sc) > scala> val task_trace = cc.jsonFile("/task_trace.json") > scala> task_trace.registerTempTable("task_trace") > scala> cc.setKeyspace("cerberus_data_v4") > scala> val res = cc.sql("SELECT received_datetime, task_body.cerberus_id, > task_body.sm.result FROM task_trace WHERE task_id = > 'fff7304e-9984-4b45-b10c-0423a96745ce'") > res: org.apache.spark.sql.SchemaRDD = > SchemaRDD[57] at RDD at SchemaRDD.scala:108 > == Query Plan == > == Physical Plan == > java.lang.RuntimeException: No such struct field sm in cerberus_batch_id, > cerberus_id, couponId, coupon_code, created, description, domain, expires, > message_id, neverShowAfter, neverShowBefore, offerTitle, screenshots, > sm.result, sm.task, startDate, task_id, url, uuid, validationDateTime, > validity > {code} > The full schema look like this: > {code} > scala> task_trace.printSchema() > root > \|-- received_datetime: long (nullable = true) > \|-- task_body: struct (nullable = true) > \|\|-- cerberus_batch_id: string (nullable = true) > \|\|-- cerberus_id: string (nullable = true) > \|\|-- couponId: integer (nullable = true) > \|\|-- coupon_code: string (nullable = true) > \|\|-- created: string (nullable = true) > \|\|-- description: string (nullable = true) > \|\|-- domain: string (nullable = true) > \|\|-- expires: string (nullable = true) > \|\|-- message_id: string (nullable = true) > \|\|-- neverShowAfter: string (nullable = true) > \|\|-- neverShowBefore: string (nullable = true) > \|\|-- offerTitle: string (nullable = true) > \|\|-- screenshots: array (nullable = true) > \|\|\|-- element: string (containsNull = false) > \|\|-- sm.result: struct (nullable = true) > \|\|\|-- cerberus_batch_id: string (nullable = true) > \|\|\|-- cerberus_id: string (nullable = true) > \|\|\|-- code: string (nullable = true) > \|\|\|-- couponId: integer (nullable = true) > \|\|\|-- created: string (nullable = true) > \|\|\|-- description: string (nullable = true) > \|\|\|-- domain: string (nullable = true) > \|\|\|-- expires: string (nullable = true) > \|\|\|-- message_id: string (nullable = true) > \|\|\|-- neverShowAfter: string (nullable = true) > \|\|\|-- neverShowBefore: string (nullable = true) > \|\|\|-- offerTitle: string (nullable = true) > \|\|\|-- result: struct (nullable = true) > \|\|\|\|-- post: struct (nullable = true) > \|\|\|\|\|-- alchemy_out_of_stock: struct (nullable = true) > \|\|\|\|\|\|-- ci: double (nullable = true) > \|\|\|\|\|\|-- value: boolean (nullable = true) > \|\|\|\|\|-- meta: struct (nullable = true) > \|\|\|\|\|\|-- None_tx_value: array (nullable = true) > \|\|\|\|\|\|\|-- element: string (containsNull = > false) > \|\|\|\|\|\|-- exceptions: array (nullable = true) > \|\|\|\|\|\|\|-- element: string (containsNull = > false) > \|\|\|\|\|\|-- no_input_value: array (nullable = true) > \|\|\|\|\|\|\|-- element: string (containsNull = > false) > \|\|\|\|\|\|-- not_mapped: array (nullable = true) > \|\|\|\|\|\|\|-- element: string (containsNull = > false) > \|\|\|\|\|\|-- not_transformed: array (nullable = true) > \|\|\|\|\|\|\|-- element: array (containsNull = > false) > \|\|\|\|\|
[jira] [Commented] (SPARK-5632) not able to resolve dot('.') in field name
[ https://issues.apache.org/jira/browse/SPARK-5632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15752896#comment-15752896 ] William Shen commented on SPARK-5632: - Thank you [~marmbrus] for the speedy response! However I ran into the following issue in 1.5.0, which seems to be the same issue with resolving dot in field name. {noformat} scala> import sqlContext.implicits._ import sqlContext.implicits._ scala> val data = Seq((1,2)).toDF("column_1", "column.with.dot") data: org.apache.spark.sql.DataFrame = [column_1: int, column.with.dot: int] scala> data.select("column.with.dot").collect org.apache.spark.sql.AnalysisException: cannot resolve 'column.with.dot' given input columns column_1, column.with.dot; at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:56) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:53) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:292) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:249) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:279) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:290) at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:108) at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:118) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:122) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:122) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:126) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$cla
[jira] [Commented] (SPARK-5632) not able to resolve dot('.') in field name
[ https://issues.apache.org/jira/browse/SPARK-5632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15752747#comment-15752747 ] William Shen commented on SPARK-5632: - Is this still targeted for 1.4.0 as indicated in JIRA (or was it released with 1.4.0)? The git commit is tagged with v2.1.0-rc3, can someone confirm if it has been moved to 2.1.0? Thanks! > not able to resolve dot('.') in field name > -- > > Key: SPARK-5632 > URL: https://issues.apache.org/jira/browse/SPARK-5632 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.2.0, 1.3.0 > Environment: Spark cluster: EC2 m1.small + Spark 1.2.0 > Cassandra cluster: EC2 m3.xlarge + Cassandra 2.1.2 >Reporter: Lishu Liu >Assignee: Wenchen Fan >Priority: Blocker > Fix For: 1.4.0 > > > My cassandra table task_trace has a field sm.result which contains dot in the > name. So SQL tried to look up sm instead of full name 'sm.result'. > Here is my code: > {code} > scala> import org.apache.spark.sql.cassandra.CassandraSQLContext > scala> val cc = new CassandraSQLContext(sc) > scala> val task_trace = cc.jsonFile("/task_trace.json") > scala> task_trace.registerTempTable("task_trace") > scala> cc.setKeyspace("cerberus_data_v4") > scala> val res = cc.sql("SELECT received_datetime, task_body.cerberus_id, > task_body.sm.result FROM task_trace WHERE task_id = > 'fff7304e-9984-4b45-b10c-0423a96745ce'") > res: org.apache.spark.sql.SchemaRDD = > SchemaRDD[57] at RDD at SchemaRDD.scala:108 > == Query Plan == > == Physical Plan == > java.lang.RuntimeException: No such struct field sm in cerberus_batch_id, > cerberus_id, couponId, coupon_code, created, description, domain, expires, > message_id, neverShowAfter, neverShowBefore, offerTitle, screenshots, > sm.result, sm.task, startDate, task_id, url, uuid, validationDateTime, > validity > {code} > The full schema look like this: > {code} > scala> task_trace.printSchema() > root > \|-- received_datetime: long (nullable = true) > \|-- task_body: struct (nullable = true) > \|\|-- cerberus_batch_id: string (nullable = true) > \|\|-- cerberus_id: string (nullable = true) > \|\|-- couponId: integer (nullable = true) > \|\|-- coupon_code: string (nullable = true) > \|\|-- created: string (nullable = true) > \|\|-- description: string (nullable = true) > \|\|-- domain: string (nullable = true) > \|\|-- expires: string (nullable = true) > \|\|-- message_id: string (nullable = true) > \|\|-- neverShowAfter: string (nullable = true) > \|\|-- neverShowBefore: string (nullable = true) > \|\|-- offerTitle: string (nullable = true) > \|\|-- screenshots: array (nullable = true) > \|\|\|-- element: string (containsNull = false) > \|\|-- sm.result: struct (nullable = true) > \|\|\|-- cerberus_batch_id: string (nullable = true) > \|\|\|-- cerberus_id: string (nullable = true) > \|\|\|-- code: string (nullable = true) > \|\|\|-- couponId: integer (nullable = true) > \|\|\|-- created: string (nullable = true) > \|\|\|-- description: string (nullable = true) > \|\|\|-- domain: string (nullable = true) > \|\|\|-- expires: string (nullable = true) > \|\|\|-- message_id: string (nullable = true) > \|\|\|-- neverShowAfter: string (nullable = true) > \|\|\|-- neverShowBefore: string (nullable = true) > \|\|\|-- offerTitle: string (nullable = true) > \|\|\|-- result: struct (nullable = true) > \|\|\|\|-- post: struct (nullable = true) > \|\|\|\|\|-- alchemy_out_of_stock: struct (nullable = true) > \|\|\|\|\|\|-- ci: double (nullable = true) > \|\|\|\|\|\|-- value: boolean (nullable = true) > \|\|\|\|\|-- meta: struct (nullable = true) > \|\|\|\|\|\|-- None_tx_value: array (nullable = true) > \|\|\|\|\|\|\|-- element: string (containsNull = > false) > \|\|\|\|\|\|-- exceptions: array (nullable = true) > \|\|\|\|\|\|\|-- element: string (containsNull = > false) > \|\|\|\|\|\|-- no_input_value: array (nullable = true) > \|\|\|\|\|\|\|-- element: string (containsNull = > false) > \|\|\|\|\|\|-- not_mapped: array (nullable = true) > \|\|\|\|\|\|\|-- element: string (containsNull = > false) > \|\|\|\|\|\|-- not_transformed: array (nullable = true) > \|\|\|\|\|\|\|-- element: array (containsNull = > false) > \|\|\|\|\|\|\|\|-- element: string