[jira] [Commented] (SPARK-36163) Propagate correct JDBC properties in JDBC connector provider and add "connectionProvider" option

2022-03-03 Thread William Shen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17501026#comment-17501026
 ] 

William Shen commented on SPARK-36163:
--

Any interest in backporting this to 3.1.x and 3.2.x?

> Propagate correct JDBC properties in JDBC connector provider and add 
> "connectionProvider" option
> 
>
> Key: SPARK-36163
> URL: https://issues.apache.org/jira/browse/SPARK-36163
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0, 3.1.1, 3.1.2
>Reporter: Ivan Sadikov
>Assignee: Ivan
>Priority: Major
> Fix For: 3.3.0
>
>
> There are a couple of issues with JDBC connection providers. The first is a 
> bug caused by 
> [https://github.com/apache/spark/commit/c3ce9701b458511255072c72b9b245036fa98653]
>  where we would pass all properties, including JDBC data source keys, to the 
> JDBC driver which results in errors like {{java.sql.SQLException: 
> Unrecognized connection property 'url'}}.
> Connection properties are supposed to only include vendor properties, url 
> config is a JDBC option and should be excluded.
> The fix would be replacing {{jdbcOptions.asProperties.asScala.foreach}} with 
> {{jdbcOptions.asConnectionProperties.asScala.foreach}} which is 
> java.sql.Driver friendly.
>  
> I also investigated the problem with multiple providers and I think there are 
> a couple of oversights in {{ConnectionProvider}} implementation. I think it 
> is missing two things:
>  * Any {{JdbcConnectionProvider}} should take precedence over 
> {{BasicConnectionProvider}}. {{BasicConnectionProvider}} should only be 
> selected if there was no match found when inferring providers that can handle 
> JDBC url.
>  * There is currently no way to select a specific provider that you want, 
> similar to how you can select a JDBC driver. The use case is, for example, 
> having connection providers for two databases that handle the same URL but 
> have slightly different semantics and you want to select one in one case and 
> the other one in others.
>  ** I think the first point could be discarded when the second one is 
> addressed.
> You can technically use {{spark.sql.sources.disabledJdbcConnProviderList}} to 
> exclude ones that don’t need to be included, but I am not quite sure why it 
> was done that way - it is much simpler to allow users to enforce the provider 
> they want.
> This ticket fixes it by adding a {{connectionProvider}} option to the JDBC 
> data source that allows users to select a particular provider when the 
> ambiguity arises.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6238) Support shuffle where individual blocks might be > 2G

2018-05-17 Thread William Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16479645#comment-16479645
 ] 

William Shen commented on SPARK-6238:
-

[~irashid], you marked this as a duplicate. Can you also mark which ticket this 
duplicates? I see that you marked this as duplicated by SPARK-5928. Did you 
mean that this duplicates SPARK-5928? (or if this duplicates SPARK-19659 per 
your comment). Thanks in advance for the clarification!

> Support shuffle where individual blocks might be > 2G
> -
>
> Key: SPARK-6238
> URL: https://issues.apache.org/jira/browse/SPARK-6238
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Reporter: Reynold Xin
>Assignee: jin xing
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5928) Remote Shuffle Blocks cannot be more than 2 GB

2018-05-04 Thread William Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16464553#comment-16464553
 ] 

William Shen commented on SPARK-5928:
-

[~UZiVcbfPXaNrMtT], if you increase parallelism (having more partitions) over 
your massive data size, would that be able to reduce the size for each 
partition to work around this issue?

> Remote Shuffle Blocks cannot be more than 2 GB
> --
>
> Key: SPARK-5928
> URL: https://issues.apache.org/jira/browse/SPARK-5928
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Imran Rashid
>Priority: Major
> Attachments: image-2018-03-29-11-52-32-075.png
>
>
> If a shuffle block is over 2GB, the shuffle fails, with an uninformative 
> exception.  The tasks get retried a few times and then eventually the job 
> fails.
> Here is an example program which can cause the exception:
> {code}
> val rdd = sc.parallelize(1 to 1e6.toInt, 1).map{ ignore =>
>   val n = 3e3.toInt
>   val arr = new Array[Byte](n)
>   //need to make sure the array doesn't compress to something small
>   scala.util.Random.nextBytes(arr)
>   arr
> }
> rdd.map { x => (1, x)}.groupByKey().count()
> {code}
> Note that you can't trigger this exception in local mode, it only happens on 
> remote fetches.   I triggered these exceptions running with 
> {{MASTER=yarn-client spark-shell --num-executors 2 --executor-memory 4000m}}
> {noformat}
> 15/02/20 11:10:23 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 3, 
> imran-3.ent.cloudera.com): FetchFailed(BlockManagerId(1, 
> imran-2.ent.cloudera.com, 55028), shuffleId=1, mapId=0, reduceId=0, message=
> org.apache.spark.shuffle.FetchFailedException: Adjusted frame length exceeds 
> 2147483647: 3021252889 - discarded
>   at 
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67)
>   at 
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
>   at 
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:125)
>   at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
>   at 
> org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:46)
>   at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>   at org.apache.spark.scheduler.Task.run(Task.scala:56)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: io.netty.handler.codec.TooLongFrameException: Adjusted frame 
> length exceeds 2147483647: 3021252889 - discarded
>   at 
> io.netty.handler.codec.LengthFieldBasedFrameDecoder.fail(LengthFieldBasedFrameDecoder.java:501)
>   at 
> io.netty.handler.codec.LengthFieldBasedFrameDecoder.failIfNecessary(LengthFieldBasedFrameDecoder.java:477)
>   at 
> io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:403)
>   at 
> io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:343)
>   at 
> io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:249)
>   at 
> io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:149)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
>   at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787)
>   at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedK

[jira] [Commented] (SPARK-17888) Memory leak in streaming driver when use SparkSQL in Streaming

2018-01-11 Thread William Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16323260#comment-16323260
 ] 

William Shen commented on SPARK-17888:
--

[~FireThief], did you ever find out what was wrong? We are encountering driver 
heap increase in 1.6 streaming, with a very similar heap histogram with lots of 
Long and LongSQLMetricValue

{code}
num #instances #bytes  class name
--
   1:  91155998 4375487904  scala.collection.immutable.HashMap$HashMap1
   2:  64295181 4028440384  [C
   3:  96126890 2307045360  java.lang.Long
   4:  89875107 2157002568  
org.apache.spark.sql.execution.metric.LongSQLMetricValue
   5:  63941246 2046119872  java.lang.String
   6:  30106557 1685967192  org.apache.spark.scheduler.AccumulableInfo
   7:  49762250 1592392000  scala.collection.immutable.$colon$colon
   8:  25223481 1529732184  [Lscala.collection.immutable.HashMap;
   9:   2219699 1016786672  [B
  10:  25222833  807130656  
scala.collection.immutable.HashMap$HashTrieMap
  11:   5624636  794226288  [Ljava.lang.Object;
  12:  31443622  754646928  scala.Some
  13:  10954338  525808224  java.util.Hashtable$Entry
  14:  10915201  523929648  java.util.concurrent.ConcurrentHashMap$Node
  15:292754  461068288  [I
  16: 26986  440039088  
[Ljava.util.concurrent.ConcurrentHashMap$Node;
  17: 35656  174934384  [Ljava.util.Hashtable$Entry;
  18:   3227318  103274176  org.apache.spark.sql.catalyst.trees.Origin
  19:   2669050   85409600  scala.collection.mutable.ArrayBuffer
{code}

> Memory leak in streaming driver when use SparkSQL in Streaming
> --
>
> Key: SPARK-17888
> URL: https://issues.apache.org/jira/browse/SPARK-17888
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 1.6.2
> Environment: scala 2.10.4
> java 1.7.0_71
>Reporter: weilin.chen
>  Labels: leak, memory
>
> Hi
>   I have a little program of spark 1.5, it receive data from a publisher in 
> spark streaming. It will process these received data with spark sql. But when 
> the time goes by I found the memory leak in driver, so i update to spark 
> 1.6.2. But, there is no change in the situation.
> here is the code:
> {quote}
>  val lines = ssc.receiverStream(new RReceiver("10.0.200.15", 6380, 
> "subresult"))
> val jsonf = 
> lines.map(JSON.parseFull(_)).map(_.get.asInstanceOf[scala.collection.immutable.Map[String,
>  Any]])
> val logs = jsonf.map(data => LogStashV1(data("message").toString, 
> data("path").toString, data("host").toString, 
> data("lineno").toString.toDouble, data("timestamp").toString))
> logs.foreachRDD( rdd => { 
>  import sqc.implicits._
>  rdd.toDF.registerTempTable("logstash")
>  val sqlreport0 = sqc.sql("SELECT message, COUNT(message) AS host_c, 
> SUM(lineno) AS line_a FROM logstash WHERE path = '/var/log/system.log' AND 
> lineno > 70 GROUP BY message ORDER BY host_c DESC LIMIT 100")
>  sqlreport0.map(t => AlertMsg(t(0).toString, t(1).toString.toInt, 
> t(2).toString.toDouble)).collect().foreach(println)
> sqlreport0.map(t => AlertMsg(t(0).toString, t(1).toString.toInt, 
> t(2).toString.toDouble)).collect().foreach(println)
>  {quote}
> jmap information:
>  {quote}
>  num #instances #bytes  class name
> --
>1: 34819   72711952  [B
>2:   2297557   66010656  [C
>3:   2296294   55111056  java.lang.String
>4:   1063491   42539640  org.apache.spark.scheduler.AccumulableInfo
>5:   1251001   40032032  
> scala.collection.immutable.HashMap$HashMap1
>6:   1394364   33464736  java.lang.Long
>7:   1102516   26460384  scala.collection.immutable.$colon$colon
>8:   1058202   25396848  
> org.apache.spark.sql.execution.metric.LongSQLMetricValue
>9:   1266499   20263984  scala.Some
>   10:124052   15889104  
>   11:124052   15269568  
>   12: 11350   12082432  
>   13: 11350   11692880  
>   14: 96682   10828384  org.apache.spark.executor.TaskMetrics
>   15:2334819505896  [Lscala.collection.immutable.HashMap;
>   16: 966826961104  org.apache.spark.scheduler.TaskInfo
>   17:  95896433312  
>   18:2330005592000  
> scala.collection.immutable.HashMap$HashTrieMap
>   19: 962005387200  
> org.apache.spark.executor.ShuffleReadMetrics
>   20:1133813628192  scala.collection.mutable.ListBuffer
>   21:  72522891792  
>   22:1

[jira] [Commented] (SPARK-22517) NullPointerException in ShuffleExternalSorter.spill()

2018-01-09 Thread William Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16319496#comment-16319496
 ] 

William Shen commented on SPARK-22517:
--

I came across a similar stack trace when debugging our spark streaming 
application in 1.6.
Similarly, I also do not have an easily reproducible snippet, and it occurs 
after the jobs ran fine for some time. [~asmaier], did you have any luck 
solving this issue? 

{code}
18/01/03 10:55:04 ERROR Executor: Exception in task 588.0 in stage 12.0 (TID 
41821)
java.lang.NullPointerException
at 
org.apache.spark.memory.TaskMemoryManager.getPage(TaskMemoryManager.java:351)
at 
org.apache.spark.shuffle.sort.ShuffleExternalSorter.writeSortedFile(ShuffleExternalSorter.java:192)
at 
org.apache.spark.shuffle.sort.ShuffleExternalSorter.spill(ShuffleExternalSorter.java:254)
at 
org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:175)
at 
org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:249)
at 
org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:83)
at 
org.apache.spark.shuffle.sort.ShuffleInMemorySorter.reset(ShuffleInMemorySorter.java:76)
at 
org.apache.spark.shuffle.sort.ShuffleExternalSorter.spill(ShuffleExternalSorter.java:256)
at 
org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:175)
at 
org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:249)
at 
org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:112)
at 
org.apache.spark.shuffle.sort.ShuffleExternalSorter.acquireNewPageIfNecessary(ShuffleExternalSorter.java:359)
at 
org.apache.spark.shuffle.sort.ShuffleExternalSorter.insertRecord(ShuffleExternalSorter.java:380)
at 
org.apache.spark.shuffle.sort.UnsafeShuffleWriter.insertRecordIntoSorter(UnsafeShuffleWriter.java:237)
at 
org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:229)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
{code}

> NullPointerException in ShuffleExternalSorter.spill()
> -
>
> Key: SPARK-22517
> URL: https://issues.apache.org/jira/browse/SPARK-22517
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Andreas Maier
>
> I see a NullPointerException during sorting with the following stacktrace:
> {code}
> 17/11/13 15:02:56 ERROR Executor: Exception in task 138.0 in stage 9.0 (TID 
> 13497)
> java.lang.NullPointerException
> at 
> org.apache.spark.memory.TaskMemoryManager.getPage(TaskMemoryManager.java:383)
> at 
> org.apache.spark.shuffle.sort.ShuffleExternalSorter.writeSortedFile(ShuffleExternalSorter.java:193)
> at 
> org.apache.spark.shuffle.sort.ShuffleExternalSorter.spill(ShuffleExternalSorter.java:254)
> at 
> org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:203)
> at 
> org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:281)
> at 
> org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:90)
> at 
> org.apache.spark.shuffle.sort.ShuffleInMemorySorter.reset(ShuffleInMemorySorter.java:100)
> at 
> org.apache.spark.shuffle.sort.ShuffleExternalSorter.spill(ShuffleExternalSorter.java:256)
> at 
> org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:203)
> at 
> org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:281)
> at 
> org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:90)
> at 
> org.apache.spark.shuffle.sort.ShuffleExternalSorter.growPointerArrayIfNecessary(ShuffleExternalSorter.java:328)
> at 
> org.apache.spark.shuffle.sort.ShuffleExternalSorter.insertRecord(ShuffleExternalSorter.java:379)
> at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.insertRecordIntoSorter(UnsafeShuffleWriter.java:246)
> at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:167)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)

[jira] [Commented] (SPARK-18886) Delay scheduling should not delay some executors indefinitely if one task is scheduled before delay timeout

2017-10-30 Thread William Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16225335#comment-16225335
 ] 

William Shen commented on SPARK-18886:
--

[~imranr],

I came across this issue because it is marked as duplicated by SPARK-11460. 
SPARK-11460 has affected versions of 1.0.0, 1.0.1, 1.0.2, 1.1.0, 1.1.1, 1.2.0, 
1.2.1, 1.2.2, 1.3.0, 1.3.1, 1.4.0, 1.4.1, 1.5.0, 1.5.1, and this issue has 
affected version of 2.1.0. Do we know if this is also an issue for 1.6.0? I am 
observing similar behavior for our application in 1.6.0. We are also able to 
achieve better utilization of the executors and performance through setting the 
wait to 0.

Thank you

> Delay scheduling should not delay some executors indefinitely if one task is 
> scheduled before delay timeout
> ---
>
> Key: SPARK-18886
> URL: https://issues.apache.org/jira/browse/SPARK-18886
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Imran Rashid
>
> Delay scheduling can introduce an unbounded delay and underutilization of 
> cluster resources under the following circumstances:
> 1. Tasks have locality preferences for a subset of available resources
> 2. Tasks finish in less time than the delay scheduling.
> Instead of having *one* delay to wait for resources with better locality, 
> spark waits indefinitely.
> As an example, consider a cluster with 100 executors, and a taskset with 500 
> tasks.  Say all tasks have a preference for one executor, which is by itself 
> on one host.  Given the default locality wait of 3s per level, we end up with 
> a 6s delay till we schedule on other hosts (process wait + host wait).
> If each task takes 5 seconds (under the 6 second delay), then _all 500_ tasks 
> get scheduled on _only one_ executor.  This means you're only using a 1% of 
> your cluster, and you get a ~100x slowdown.  You'd actually be better off if 
> tasks took 7 seconds.
> *WORKAROUNDS*: 
> (1) You can change the locality wait times so that it is shorter than the 
> task execution time.  You need to take into account the sum of all wait times 
> to use all the resources on your cluster.  For example, if you have resources 
> on different racks, this will include the sum of 
> "spark.locality.wait.process" + "spark.locality.wait.node" + 
> "spark.locality.wait.rack".  Those each default to "3s".  The simplest way to 
> be to set "spark.locality.wait.process" to your desired wait interval, and 
> set both "spark.locality.wait.node" and "spark.locality.wait.rack" to "0".  
> For example, if your tasks take ~3 seconds on average, you might set 
> "spark.locality.wait.process" to "1s".  *NOTE*: due to SPARK-18967, avoid 
> setting the {{spark.locality.wait=0}} -- instead, use 
> {{spark.locality.wait=1ms}}.
> Note that this workaround isn't perfect --with less delay scheduling, you may 
> not get as good resource locality.  After this issue is fixed, you'd most 
> likely want to undo these configuration changes.
> (2) The worst case here will only happen if your tasks have extreme skew in 
> their locality preferences.  Users may be able to modify their job to 
> controlling the distribution of the original input data.
> (2a) A shuffle may end up with very skewed locality preferences, especially 
> if you do a repartition starting from a small number of partitions.  (Shuffle 
> locality preference is assigned if any node has more than 20% of the shuffle 
> input data -- by chance, you may have one node just above that threshold, and 
> all other nodes just below it.)  In this case, you can turn off locality 
> preference for shuffle data by setting 
> {{spark.shuffle.reduceLocality.enabled=false}}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5632) not able to resolve dot('.') in field name

2016-12-20 Thread William Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15765558#comment-15765558
 ] 

William Shen commented on SPARK-5632:
-

[~marmbrus],
Found another weirdness with dot in field name, that might be of interest too.
in 1.5.0,
{code}
import org.apache.spark.sql.functions._
val data = Seq(("test1","test2","test3")).toDF("col1", "col with . in it", 
"col3"); data.withColumn("col1", trim(data("col1")))
{code}
fails with
{code}
org.apache.spark.sql.AnalysisException: cannot resolve 'col with . in it' given 
input columns col1, col with . in it, col3;
at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:56)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:53)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:292)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:249)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:279)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:290)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:108)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:118)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:122)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:122)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:126)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:126)
at 
o

[jira] [Commented] (SPARK-5632) not able to resolve dot('.') in field name

2016-12-15 Thread William Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15752948#comment-15752948
 ] 

William Shen commented on SPARK-5632:
-

Thanks [~marmbrus]. 
I see that the backtick works in 1.5.0 as well (with the limitation on 
distinct, which is fixed in SPARK-15230). Hopefully this will get sorted out 
together with SPARK-18084. Thanks again for your help!

> not able to resolve dot('.') in field name
> --
>
> Key: SPARK-5632
> URL: https://issues.apache.org/jira/browse/SPARK-5632
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.2.0, 1.3.0
> Environment: Spark cluster: EC2 m1.small + Spark 1.2.0
> Cassandra cluster: EC2 m3.xlarge + Cassandra 2.1.2
>Reporter: Lishu Liu
>Assignee: Wenchen Fan
>Priority: Blocker
> Fix For: 1.4.0
>
>
> My cassandra table task_trace has a field sm.result which contains dot in the 
> name. So SQL tried to look up sm instead of full name 'sm.result'. 
> Here is my code: 
> {code}
> scala> import org.apache.spark.sql.cassandra.CassandraSQLContext
> scala> val cc = new CassandraSQLContext(sc)
> scala> val task_trace = cc.jsonFile("/task_trace.json")
> scala> task_trace.registerTempTable("task_trace")
> scala> cc.setKeyspace("cerberus_data_v4")
> scala> val res = cc.sql("SELECT received_datetime, task_body.cerberus_id, 
> task_body.sm.result FROM task_trace WHERE task_id = 
> 'fff7304e-9984-4b45-b10c-0423a96745ce'")
> res: org.apache.spark.sql.SchemaRDD = 
> SchemaRDD[57] at RDD at SchemaRDD.scala:108
> == Query Plan ==
> == Physical Plan ==
> java.lang.RuntimeException: No such struct field sm in cerberus_batch_id, 
> cerberus_id, couponId, coupon_code, created, description, domain, expires, 
> message_id, neverShowAfter, neverShowBefore, offerTitle, screenshots, 
> sm.result, sm.task, startDate, task_id, url, uuid, validationDateTime, 
> validity
> {code}
> The full schema look like this:
> {code}
> scala> task_trace.printSchema()
> root
>  \|-- received_datetime: long (nullable = true)
>  \|-- task_body: struct (nullable = true)
>  \|\|-- cerberus_batch_id: string (nullable = true)
>  \|\|-- cerberus_id: string (nullable = true)
>  \|\|-- couponId: integer (nullable = true)
>  \|\|-- coupon_code: string (nullable = true)
>  \|\|-- created: string (nullable = true)
>  \|\|-- description: string (nullable = true)
>  \|\|-- domain: string (nullable = true)
>  \|\|-- expires: string (nullable = true)
>  \|\|-- message_id: string (nullable = true)
>  \|\|-- neverShowAfter: string (nullable = true)
>  \|\|-- neverShowBefore: string (nullable = true)
>  \|\|-- offerTitle: string (nullable = true)
>  \|\|-- screenshots: array (nullable = true)
>  \|\|\|-- element: string (containsNull = false)
>  \|\|-- sm.result: struct (nullable = true)
>  \|\|\|-- cerberus_batch_id: string (nullable = true)
>  \|\|\|-- cerberus_id: string (nullable = true)
>  \|\|\|-- code: string (nullable = true)
>  \|\|\|-- couponId: integer (nullable = true)
>  \|\|\|-- created: string (nullable = true)
>  \|\|\|-- description: string (nullable = true)
>  \|\|\|-- domain: string (nullable = true)
>  \|\|\|-- expires: string (nullable = true)
>  \|\|\|-- message_id: string (nullable = true)
>  \|\|\|-- neverShowAfter: string (nullable = true)
>  \|\|\|-- neverShowBefore: string (nullable = true)
>  \|\|\|-- offerTitle: string (nullable = true)
>  \|\|\|-- result: struct (nullable = true)
>  \|\|\|\|-- post: struct (nullable = true)
>  \|\|\|\|\|-- alchemy_out_of_stock: struct (nullable = true)
>  \|\|\|\|\|\|-- ci: double (nullable = true)
>  \|\|\|\|\|\|-- value: boolean (nullable = true)
>  \|\|\|\|\|-- meta: struct (nullable = true)
>  \|\|\|\|\|\|-- None_tx_value: array (nullable = true)
>  \|\|\|\|\|\|\|-- element: string (containsNull = 
> false)
>  \|\|\|\|\|\|-- exceptions: array (nullable = true)
>  \|\|\|\|\|\|\|-- element: string (containsNull = 
> false)
>  \|\|\|\|\|\|-- no_input_value: array (nullable = true)
>  \|\|\|\|\|\|\|-- element: string (containsNull = 
> false)
>  \|\|\|\|\|\|-- not_mapped: array (nullable = true)
>  \|\|\|\|\|\|\|-- element: string (containsNull = 
> false)
>  \|\|\|\|\|\|-- not_transformed: array (nullable = true)
>  \|\|\|\|\|\|\|-- element: array (containsNull = 
> false)
>  \|\|\|\|\|   

[jira] [Commented] (SPARK-5632) not able to resolve dot('.') in field name

2016-12-15 Thread William Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15752896#comment-15752896
 ] 

William Shen commented on SPARK-5632:
-

Thank you [~marmbrus] for the speedy response!
However I ran into the following issue in 1.5.0, which seems to be the same 
issue with resolving dot in field name.
{noformat}
scala> import sqlContext.implicits._
import sqlContext.implicits._

scala> val data = Seq((1,2)).toDF("column_1", "column.with.dot")
data: org.apache.spark.sql.DataFrame = [column_1: int, column.with.dot: int]

scala> data.select("column.with.dot").collect
org.apache.spark.sql.AnalysisException: cannot resolve 'column.with.dot' given 
input columns column_1, column.with.dot;
at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:56)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:53)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:292)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:249)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:279)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:290)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:108)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:118)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:122)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:122)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:126)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$cla

[jira] [Commented] (SPARK-5632) not able to resolve dot('.') in field name

2016-12-15 Thread William Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15752747#comment-15752747
 ] 

William Shen commented on SPARK-5632:
-

Is this still targeted for 1.4.0 as indicated in JIRA (or was it released with 
1.4.0)? 
The git commit is tagged with v2.1.0-rc3, can someone confirm if it has been 
moved to 2.1.0? 
Thanks!

> not able to resolve dot('.') in field name
> --
>
> Key: SPARK-5632
> URL: https://issues.apache.org/jira/browse/SPARK-5632
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.2.0, 1.3.0
> Environment: Spark cluster: EC2 m1.small + Spark 1.2.0
> Cassandra cluster: EC2 m3.xlarge + Cassandra 2.1.2
>Reporter: Lishu Liu
>Assignee: Wenchen Fan
>Priority: Blocker
> Fix For: 1.4.0
>
>
> My cassandra table task_trace has a field sm.result which contains dot in the 
> name. So SQL tried to look up sm instead of full name 'sm.result'. 
> Here is my code: 
> {code}
> scala> import org.apache.spark.sql.cassandra.CassandraSQLContext
> scala> val cc = new CassandraSQLContext(sc)
> scala> val task_trace = cc.jsonFile("/task_trace.json")
> scala> task_trace.registerTempTable("task_trace")
> scala> cc.setKeyspace("cerberus_data_v4")
> scala> val res = cc.sql("SELECT received_datetime, task_body.cerberus_id, 
> task_body.sm.result FROM task_trace WHERE task_id = 
> 'fff7304e-9984-4b45-b10c-0423a96745ce'")
> res: org.apache.spark.sql.SchemaRDD = 
> SchemaRDD[57] at RDD at SchemaRDD.scala:108
> == Query Plan ==
> == Physical Plan ==
> java.lang.RuntimeException: No such struct field sm in cerberus_batch_id, 
> cerberus_id, couponId, coupon_code, created, description, domain, expires, 
> message_id, neverShowAfter, neverShowBefore, offerTitle, screenshots, 
> sm.result, sm.task, startDate, task_id, url, uuid, validationDateTime, 
> validity
> {code}
> The full schema look like this:
> {code}
> scala> task_trace.printSchema()
> root
>  \|-- received_datetime: long (nullable = true)
>  \|-- task_body: struct (nullable = true)
>  \|\|-- cerberus_batch_id: string (nullable = true)
>  \|\|-- cerberus_id: string (nullable = true)
>  \|\|-- couponId: integer (nullable = true)
>  \|\|-- coupon_code: string (nullable = true)
>  \|\|-- created: string (nullable = true)
>  \|\|-- description: string (nullable = true)
>  \|\|-- domain: string (nullable = true)
>  \|\|-- expires: string (nullable = true)
>  \|\|-- message_id: string (nullable = true)
>  \|\|-- neverShowAfter: string (nullable = true)
>  \|\|-- neverShowBefore: string (nullable = true)
>  \|\|-- offerTitle: string (nullable = true)
>  \|\|-- screenshots: array (nullable = true)
>  \|\|\|-- element: string (containsNull = false)
>  \|\|-- sm.result: struct (nullable = true)
>  \|\|\|-- cerberus_batch_id: string (nullable = true)
>  \|\|\|-- cerberus_id: string (nullable = true)
>  \|\|\|-- code: string (nullable = true)
>  \|\|\|-- couponId: integer (nullable = true)
>  \|\|\|-- created: string (nullable = true)
>  \|\|\|-- description: string (nullable = true)
>  \|\|\|-- domain: string (nullable = true)
>  \|\|\|-- expires: string (nullable = true)
>  \|\|\|-- message_id: string (nullable = true)
>  \|\|\|-- neverShowAfter: string (nullable = true)
>  \|\|\|-- neverShowBefore: string (nullable = true)
>  \|\|\|-- offerTitle: string (nullable = true)
>  \|\|\|-- result: struct (nullable = true)
>  \|\|\|\|-- post: struct (nullable = true)
>  \|\|\|\|\|-- alchemy_out_of_stock: struct (nullable = true)
>  \|\|\|\|\|\|-- ci: double (nullable = true)
>  \|\|\|\|\|\|-- value: boolean (nullable = true)
>  \|\|\|\|\|-- meta: struct (nullable = true)
>  \|\|\|\|\|\|-- None_tx_value: array (nullable = true)
>  \|\|\|\|\|\|\|-- element: string (containsNull = 
> false)
>  \|\|\|\|\|\|-- exceptions: array (nullable = true)
>  \|\|\|\|\|\|\|-- element: string (containsNull = 
> false)
>  \|\|\|\|\|\|-- no_input_value: array (nullable = true)
>  \|\|\|\|\|\|\|-- element: string (containsNull = 
> false)
>  \|\|\|\|\|\|-- not_mapped: array (nullable = true)
>  \|\|\|\|\|\|\|-- element: string (containsNull = 
> false)
>  \|\|\|\|\|\|-- not_transformed: array (nullable = true)
>  \|\|\|\|\|\|\|-- element: array (containsNull = 
> false)
>  \|\|\|\|\|\|\|\|-- element: string