Re: Create static Map Type column

2017-07-26 Thread ayan guha
never mind, found the solution. Spark 2.0 + val df1 = df.withColumn("newcol",map(lit("field1"),lit("fieldName1"))) scala> df1.show() +---++ | a| newcol| +---++ | 1|Map(field1 -> fie...| | 2|Map(field1 -> fie...| | 3|Map(field1 -> fie...|

running spark application compiled with 1.6 on spark 2.1 cluster

2017-07-26 Thread satishl
My Spark application is compiled with 1.6 spark core and dependencies. When I try to run this app on a spark 2.1 cluster - I run into *ERROR ApplicationMaster: User class threw exception: java.lang.NoClassDefFoundError: org/apache/spark/Logging * I was hoping that 2.+ spark is backward

Please unsubscribe

2017-07-26 Thread sowmya ramesh

Create static Map Type column

2017-07-26 Thread ayan guha
Hi I want to create a static Map Type column to a dataframe. How I am doing now: val fieldList = spark.sparkContext.parallelize(Array(Row(Map("field1" -> "someField" val fieldListSchemaBase = new StructType() val f =

Can i move TFS and TSFT out of spark package

2017-07-26 Thread Jone Zhang
I have build the spark-assembly-1.6.0-hadoop2.5.1.jar cat spark-assembly-1.6.0-hadoop2.5.1.jar/META-INF/services/org. apache.hadoop.fs.FileSystem ... org.apache.hadoop.hdfs.DistributedFileSystem org.apache.hadoop.hdfs.web.HftpFileSystem org.apache.hadoop.hdfs.web.HsftpFileSystem

Re: [SPARK STRUCTURED STREAMING]: Alternatives to using Foreach sink in pyspark

2017-07-26 Thread ayan guha
Hi TD I thought structured streaming does provide similar concept of dataframes where it does not matter which language I use to invoke the APIs, with exception of udf. So, when I think of support foreach sink in python, I think it as just a wrapper api and data should remain in JVM only.

Re: [SPARK STRUCTURED STREAMING]: Alternatives to using Foreach sink in pyspark

2017-07-26 Thread Tathagata Das
We see that all the time. For example, in SQL, people can write their user-defined function in Scala/Java and use it from SQL/python/anywhere. That is the recommended way to get the best combo of performance and ease-of-use from non-jvm languages. On Wed, Jul 26, 2017 at 11:49 AM, Priyank

Re: [SPARK STRUCTURED STREAMING]: Alternatives to using Foreach sink in pyspark

2017-07-26 Thread Priyank Shrivastava
Thanks TD. I am going to try the python-scala hybrid approach by using scala only for custom redis sink and python for the rest of the app . I understand it might not be as efficient as purely writing the app in scala but unfortunately I am constrained on scala resources. Have you come across

DStream Spark 2.1.1 Streaming on EMR at scale - long running job fails after two hours

2017-07-26 Thread Mikhailau, Alex
Guys, I am trying hard to make a DStream API Spark streaming job work on EMR. I’ve succeeded to the point of running it for a few hours with eventual failure which is when I start seeing some out of memory exception via “yarn logs” aggregate. I am doing a JSON map and extraction of some

[Spark streaming-Mesos-cluster mode] java.lang.RuntimeException: Stream jar not found

2017-07-26 Thread RCinna
Hello, I have a spark streaming job using hdfs and checkpointing components and running well on a standalone spark cluster with multi nodes, both in client and cluster deploy mode. I would like to switch with Mesos cluster manager and submit job as cluster deploy mode. First launch of the app is

Re: some Ideas on expressing Spark SQL using JSON

2017-07-26 Thread Sathish Kumaran Vairavelu
Agreed. For the same reason dataframes / dataset which is another DSL used in Spark On Wed, Jul 26, 2017 at 1:00 AM Georg Heiler wrote: > Because sparks dsl partially supports compile time type safety. E.g. the > compiler will notify you that a sql function was

[Spark SQL] [pyspark.sql]: Potential bug in toDF using nested structures

2017-07-26 Thread msachdev
Hi, I am trying to create a DF from a Python dictionary and encountered an issue where some of the nested fields are being returned as None (on collect). I have created a sample here with the output: https://gist.github.com/sachdevm/04c27ec91adbe2fdbe5969f4af723642. The sample contains two

Re: Need some help around a Spark Error

2017-07-26 Thread Alonso Isidoro Roman
I hope that helps https://stackoverflow.com/questions/40623957/slave-lost-and-very-slow-join-in-spark Alonso Isidoro Roman [image: https://]about.me/alonso.isidoro.roman 2017-07-26

Re: [SPARK STRUCTURED STREAMING]: Alternatives to using Foreach sink in pyspark

2017-07-26 Thread Tathagata Das
Hello Priyank Writing something purely in Scale/Java would be the most efficient. Even if we expose python APIs that allow writing custom sinks in pure Python, it wont be as efficient as Scala/Java foreach as the data would have to go through JVM / PVM boundary which has significant overheads. So

Re: some Ideas on expressing Spark SQL using JSON

2017-07-26 Thread Georg Heiler
Because sparks dsl partially supports compile time type safety. E.g. the compiler will notify you that a sql function was misspelled when using the dsl opposed to the plain sql string which is only parsed at runtime. Sathish Kumaran Vairavelu schrieb am Di. 25. Juli