is there a way to create new column with timeuuid using raw spark sql ?
Hi All, Is there any way to create a new timeuuid column of a existing dataframe using raw sql? you can assume that there is a timeuuid udf function if that helps. Thanks!
Re: is there a way to create new column with timeuuid using raw spark sql ?
Sure, use withColumn()... jg > On Feb 1, 2018, at 05:50, kant kodali wrote: > > Hi All, > > Is there any way to create a new timeuuid column of a existing dataframe > using raw sql? you can assume that there is a timeuuid udf function if that > helps. > > Thanks! - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: [Structured Streaming] Reuse computation result
You can use persist() or cache() operation on DataFrame. On Tue, Dec 26, 2017 at 4:02 PM Shu Li Zheng wrote: > Hi all, > > I have a scenario like this: > > val df = dataframe.map().filter() > // agg 1 > val query1 = df.sum.writeStream.start > // agg 2 > val query2 = df.count.writeStream.start > > With spark streaming, we can apply persist() on rdd to reuse the df > computation result, when we call persist() after filter() map().filter() > operator only run once. > With SS, we can’t apply persist() direct on dataframe. query1 and query2 > will not reuse result after filter. map/filter run twice. So is there a way > to solve this. > > Regards, > > Shu li Zheng > >
Re: Apache Spark - Exception on adding column to Structured Streaming DataFrame
Hi TD: Here is the udpated code with explain and full stack trace. Please let me know what could be the issue and what to look for in the explain output. Updated code: import scala.collection.immutableimport org.apache.spark.sql.functions._import org.joda.time._import org.apache.spark.sql._import org.apache.spark.sql.types._import org.apache.spark.sql.streaming._import org.apache.log4j._ object StreamingTest { def main(args:Array[String]) : Unit = { val sparkBuilder = SparkSession .builder. config("spark.sql.streaming.checkpointLocation", "./checkpointes"). appName("StreamingTest").master("local[4]") val spark = sparkBuilder.getOrCreate() val schema = StructType( Array( StructField("id", StringType, false), StructField("visit", StringType, false) )) var dataframeInput = spark.readStream.option("sep","\t").schema(schema).csv("./data/") var dataframe2 = dataframeInput.select("*") dataframe2 = dataframe2.withColumn("cts", current_timestamp().cast("long")) dataframe2.explain(true) val query = dataframe2.writeStream.option("trucate","false").format("console").start query.awaitTermination() }} Explain output: == Parsed Logical Plan ==Project [id#0, visit#1, cast(current_timestamp() as bigint) AS cts#6L]+- AnalysisBarrier +- Project [id#0, visit#1] +- StreamingRelation DataSource(org.apache.spark.sql.SparkSession@3c74aa0d,csv,List(),Some(StructType(StructField(id,StringType,false), StructField(visit,StringType,false))),List(),None,Map(sep -> , path -> ./data/),None), FileSource[./data/], [id#0, visit#1] == Analyzed Logical Plan ==id: string, visit: string, cts: bigintProject [id#0, visit#1, cast(current_timestamp() as bigint) AS cts#6L]+- Project [id#0, visit#1] +- StreamingRelation DataSource(org.apache.spark.sql.SparkSession@3c74aa0d,csv,List(),Some(StructType(StructField(id,StringType,false), StructField(visit,StringType,false))),List(),None,Map(sep -> , path -> ./data/),None), FileSource[./data/], [id#0, visit#1] == Optimized Logical Plan ==Project [id#0, visit#1, 1517499591 AS cts#6L]+- StreamingRelation DataSource(org.apache.spark.sql.SparkSession@3c74aa0d,csv,List(),Some(StructType(StructField(id,StringType,false), StructField(visit,StringType,false))),List(),None,Map(sep -> , path -> ./data/),None), FileSource[./data/], [id#0, visit#1] == Physical Plan ==*(1) Project [id#0, visit#1, 1517499591 AS cts#6L]+- StreamingRelation FileSource[./data/], [id#0, visit#1] Here is the exception: 18/02/01 07:39:52 ERROR MicroBatchExecution: Query [id = a0e573f0-e93b-48d9-989c-1aaa73539b58, runId = b5c618cb-30c7-4eff-8f09-ea1d064878ae] terminated with errororg.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to dataType on unresolved object, tree: 'cts at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(unresolved.scala:105) at org.apache.spark.sql.types.StructType$$anonfun$fromAttributes$1.apply(StructType.scala:435) at org.apache.spark.sql.types.StructType$$anonfun$fromAttributes$1.apply(StructType.scala:435) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at org.apache.spark.sql.types.StructType$.fromAttributes(StructType.scala:435) at org.apache.spark.sql.catalyst.plans.QueryPlan.schema$lzycompute(QueryPlan.scala:157) at org.apache.spark.sql.catalyst.plans.QueryPlan.schema(QueryPlan.scala:157) at org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:448) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:134) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:122) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:122) at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271) at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:122) at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56) at org.apache.spark.sql.execution.strea
unsubscribe
unsubscribe
Re: is there a way to create new column with timeuuid using raw spark sql ?
Hi, Are you talking about df.withColumn() ? If so, thats not what I meant. I meant creating a new column using raw sql. otherwords say I dont have a dataframe I only have the view name from df.createOrReplaceView("table") so I can do things like "select * from table" so in a similar fashion I want to see how I can create a new Column using the raw sql. I am looking at this reference https://docs.databricks.com/spark/latest/spark-sql/index.html and I am not seeing a way. Thanks! On Thu, Feb 1, 2018 at 4:01 AM, Jean Georges Perrin wrote: > Sure, use withColumn()... > > jg > > > > On Feb 1, 2018, at 05:50, kant kodali wrote: > > > > Hi All, > > > > Is there any way to create a new timeuuid column of a existing dataframe > using raw sql? you can assume that there is a timeuuid udf function if that > helps. > > > > Thanks! > >
spark 2.2.1
I am setting up a spark 2.2.1 cluster, however, when I bring up the master and workers (both on spark 2.2.1) I get this error. I tried spark 2.2.0 and get the same error. It works fine on spark 2.0.2. Have you seen this before, any idea what's wrong? I found this, but it's in a different situation: https://github.com/apache/spark/pull/19802 18/02/01 05:07:22 ERROR Utils: Exception encountered java.io.InvalidClassException: org.apache.spark.rpc.RpcEndpointRef; local class incompatible: stream classdesc serialVersionUID = -1223633663228316618, local class serialVersionUID = 1835832137613908542 at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:687) at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1885) at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1751) at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1885) at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1751) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2042) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2287) at java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:563) at org.apache.spark.deploy.master.WorkerInfo$$anonfun$readObject$1.apply$mcV$sp(WorkerInfo.scala:52) at org.apache.spark.deploy.master.WorkerInfo$$anonfun$readObject$1.apply(WorkerInfo.scala:51) at org.apache.spark.deploy.master.WorkerInfo$$anonfun$readObject$1.apply(WorkerInfo.scala:51) at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1303) at org.apache.spark.deploy.master.WorkerInfo.readObject(WorkerInfo.scala:51) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1158) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2178) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2069) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:433) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75) at org.apache.spark.deploy.master.FileSystemPersistenceEngine.org$apache$spark$deploy$master$FileSystemPersistenceEngine$$deserializeFromFile(FileSystemPersistenceEngine.scala:80) at org.apache.spark.deploy.master.FileSystemPersistenceEngine$$anonfun$read$1.apply(FileSystemPersistenceEngine.scala:56) at org.apache.spark.deploy.master.FileSystemPersistenceEngine$$anonfun$read$1.apply(FileSystemPersistenceEngine.scala:56) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) at org.apache.spark.deploy.master.FileSystemPersistenceEngine.read(FileSystemPersistenceEngine.scala:56) at org.apache.spark.deploy.master.PersistenceEngine$$anonfun$readPersistedData$1.apply(PersistenceEngine.scala:87) at org.apache.spark.deploy.master.PersistenceEngine$$anonfun$readPersistedData$1.apply(PersistenceEngine.scala:86) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) at org.apache.spark.rpc.netty.NettyRpcEnv.deserialize(NettyRpcEnv.scala:316) packet_write_wait: Connection to 9.30.118.193 port 22: Broken pipeData(PersistenceEngine.scala:86) Regards, Mihai IacobDSX Local - Security, IBM Analytics - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: is there a way to create new column with timeuuid using raw spark sql ?
If you have the temp view name (table, for example), couldn't you do something like this? val dfWithColumn=spark.sql("select *, as new_column from table") Thanks, Subhash On Thu, Feb 1, 2018 at 11:18 AM, kant kodali wrote: > Hi, > > Are you talking about df.withColumn() ? If so, thats not what I meant. I > meant creating a new column using raw sql. otherwords say I dont have a > dataframe I only have the view name from df.createOrReplaceView("table") > so I can do things like "select * from table" so in a similar fashion I > want to see how I can create a new Column using the raw sql. I am looking > at this reference https://docs.databricks.com/spark/latest/ > spark-sql/index.html and I am not seeing a way. > > Thanks! > > On Thu, Feb 1, 2018 at 4:01 AM, Jean Georges Perrin wrote: > >> Sure, use withColumn()... >> >> jg >> >> >> > On Feb 1, 2018, at 05:50, kant kodali wrote: >> > >> > Hi All, >> > >> > Is there any way to create a new timeuuid column of a existing >> dataframe using raw sql? you can assume that there is a timeuuid udf >> function if that helps. >> > >> > Thanks! >> >> >
Re: is there a way to create new column with timeuuid using raw spark sql ?
unsubscribe DISCLAIMER: Aquest missatge pot contenir informació confidencial. Si vostè no n'és el destinatari, si us plau, esborri'l i faci'ns-ho saber immediatament a la següent adreça: le...@eurecat.org Si el destinatari d'aquest missatge no consent la utilització del correu electrònic via Internet i la gravació de missatges, li preguem que ens ho comuniqui immediatament. DISCLAIMER: Este mensaje puede contener información confidencial. Si usted no es el destinatario del mensaje, por favor bórrelo y notifíquenoslo inmediatamente a la siguiente dirección: le...@eurecat.org Si el destinatario de este mensaje no consintiera la utilización del correo electrónico vía Internet y la grabación de los mensajes, rogamos lo ponga en nuestro conocimiento de forma inmediata. DISCLAIMER: Privileged/Confidential Information may be contained in this message. If you are not the addressee indicated in this message you should destroy this message, and notify us immediately to the following address: le...@eurecat.org. If the addressee of this message does not consent to the use of Internet e-mail and message recording, please notify us immediately.
Spark JDBC bulk insert
Hey everyone, I have a use case where I will be processing data in Spark and then writing it back to MS SQL Server. Is it possible to use bulk insert functionality and/or batch the writes back to SQL? I am using the DataFrame API to write the rows: sqlContext.write.jdbc(...) Thanks in advance for any ideas or assistance! Subhash
does Kinesis Connector for structured streaming auto-scales receivers if a cluster is using dynamic allocation and auto-scaling?
does Kinesis Connector for structured streaming auto-scales receivers if a cluster is using dynamic allocation and auto-scaling?
Structure streaming to hive with kafka 0.9
Hi, Could anyone please share example of on how to use spark structured streaming with kafka and write data into hive. Versions that I do have are Spark 2.1 on CDH5.10 Kafka 0.9 Thanks, Asmath Sent from my iPhone - To unsubscribe e-mail: user-unsubscr...@spark.apache.org