Re: Spark 2.0.0 - Broadcast variable - What is ClassTag?

2016-08-07 Thread Aseem Bansal
I am using the following to broadcast and it explicitly requires classtag sparkSession.sparkContext().broadcast On Mon, Aug 8, 2016 at 12:09 PM, Holden Karau wrote: > Classtag is Scala concept (see http://docs.scala-lang. > org/overviews/reflection/typetags-manifests.html) - although this shoul

Re: hdfs persist rollbacks when spark job is killed

2016-08-07 Thread Chanh Le
Thank you Gourav, > Moving files from _temp folders to main folders is an additional overhead > when you are working on S3 as there is no move operation. Good catch. Is that GCS the same? > I generally have a set of Data Quality checks after each job to ascertain > whether everything went fine

Re: hdfs persist rollbacks when spark job is killed

2016-08-07 Thread Gourav Sengupta
But you have to be careful, that is the default setting. There is a way you can overwrite it so that the writing to _temp folder does not take place and you write directly to the main folder. Moving files from _temp folders to main folders is an additional overhead when you are working on S3 as th

Re: hdfs persist rollbacks when spark job is killed

2016-08-07 Thread Chanh Le
It’s out of the box in Spark. When you write data into hfs or any storage it only creates a new parquet folder properly if your Spark job was success else only _temp folder inside to mark it’s still not success (spark was killed) or nothing inside (Spark job was failed). > On Aug 8, 2016,

Re: Spark 2.0.0 - Broadcast variable - What is ClassTag?

2016-08-07 Thread Holden Karau
Classtag is Scala concept (see http://docs.scala-lang.org/overviews/reflection/typetags-manifests.html) - although this should not be explicitly required - looking at http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext we can see that in Scala the classtag tag is

Re: Spark 2.0.0 - Apply schema on few columns of dataset

2016-08-07 Thread Aseem Bansal
Hi Ewan The .as function take a single encoder or a single string or a single Symbol. I have like more than 10 columns so I cannot use the tuple functions. Passing using bracket does not work. On Mon, Aug 8, 2016 at 11:26 AM, Ewan Leith wrote: > Looking at the encoders api documentation at > >

hdfs persist rollbacks when spark job is killed

2016-08-07 Thread Sumit Khanna
Hello, the use case is as follows : say I am inserting 200K rows as dataframe.write.formate("parquet") etc etc (like a basic write to hdfs command), but say due to some reason or rhyme my job got killed, when the run was in the mid of it, meaning lets say I was only able to insert 100K rows when

Spark 2.0.0 - Broadcast variable - What is ClassTag?

2016-08-07 Thread Aseem Bansal
Earlier for broadcasting we just needed to use sparkcontext.broadcast(objectToBroadcast) But now it is sparkcontext.broadcast(objectToBroadcast, classTag) What is classTag here?

Is Spark right for my use case?

2016-08-07 Thread danellis
Spark n00b here. Working with online retailers, I start with a list of their products in Cassandra (with prices, stock levels, descriptions, etc) and then receive an HTTP request every time one of them changes. For each change, I update the product in Cassandra and store the change with the old an

Re: submitting spark job with kerberized Hadoop issue

2016-08-07 Thread Aneela Saleem
Thanks Saisai and Ted, I have already configured HBase security and it's working fine. I have also done kinit before submitting job. Following is the code i'm trying to use System.setProperty("java.security.krb5.conf", "/etc/krb5.conf"); System.setProperty("java.security.auth.login.config", "/

map vs mapPartitions

2016-08-07 Thread rtijoriwala
Hi All, I am a newbie to spark and want to know if there is any performance difference between map vs mapPartitions if I am doing strictly a per item transformation? For e.g. reversedWords = words.map(w => w.reverse()); vs. reversedWords = words.mapPartitions(pwordsIterator => { List pWordLi

Re: Spark 2.0.0 - Apply schema on few columns of dataset

2016-08-07 Thread Ewan Leith
Looking at the encoders api documentation at http://spark.apache.org/docs/latest/api/java/ == Java == Encoders are specified by calling static methods on Encoders. List data = Arrays.asList("abc", "abc", "xyz"); Da

Re: Spark 2.0.0 - Apply schema on few columns of dataset

2016-08-07 Thread Aseem Bansal
Hi All Has anyone done this with Java API? On Fri, Aug 5, 2016 at 5:36 PM, Aseem Bansal wrote: > I need to use few columns out of a csv. But as there is no option to read > few columns out of csv so > 1. I am reading the whole CSV using SparkSession.csv() > 2. selecting few of the columns us

Re: Any exceptions during an action doesn't fail the Spark streaming batch in yarn-client mode

2016-08-07 Thread ayan guha
Is it a python app? On Mon, Aug 8, 2016 at 2:44 PM, Hemalatha A < hemalatha.amru...@googlemail.com> wrote: > Hello, > > I am seeing multiple exceptions shown in logs during an action, but none > of them fails the Spark streaming batch in yarn-client mode, whereas the > same exception is thrown i

Any exceptions during an action doesn't fail the Spark streaming batch in yarn-client mode

2016-08-07 Thread Hemalatha A
Hello, I am seeing multiple exceptions shown in logs during an action, but none of them fails the Spark streaming batch in yarn-client mode, whereas the same exception is thrown in Yarn-cluster mode and the application ends. I am trying to save a Dataframe To cassandra, which results in error du

Re: submitting spark job with kerberized Hadoop issue

2016-08-07 Thread Ted Yu
The link in Jerry's response was quite old. Please see: http://hbase.apache.org/book.html#security Thanks On Sun, Aug 7, 2016 at 6:55 PM, Saisai Shao wrote: > 1. Standalone mode doesn't support accessing kerberized Hadoop, simply > because it lacks the mechanism to distribute delegation tokens

Re: silence the spark debug logs

2016-08-07 Thread Sachin Janani
Hi, You can switch of the logs by setting log level to off as follows: import org.apache.log4j.Loggerimport org.apache.log4j.Level Logger.getLogger("org").setLevel(Level.OFF)Logger.getLogger("akka").setLevel(Level.OFF) Regards, Sachin J On Mon, Aug 8, 2016 at 9:39 AM, Sumit Khanna wrote: >

silence the spark debug logs

2016-08-07 Thread Sumit Khanna
Hello, I dont want to print the all spark logs, but say a few only, e.g just the executions plans etc etc. How do I silence the spark debug ? Thanks, Sumit

Re: [Spark1.6] Or (||) operator not working in DataFrame

2016-08-07 Thread Divya Gehlot
I tried with condition expression also but it didn't work :( On Aug 8, 2016 11:13 AM, "Chanh Le" wrote: > You should use *df.where(conditionExpr)* which is more convenient to > express some simple term in SQL. > > > /** > * Filters rows using the given SQL expression. > * {{{ > * peopleDf.

Re: [Spark1.6] Or (||) operator not working in DataFrame

2016-08-07 Thread Chanh Le
You should use df.where(conditionExpr) which is more convenient to express some simple term in SQL. /** * Filters rows using the given SQL expression. * {{{ * peopleDf.where("age > 15") * }}} * @group dfops * @since 1.5.0 */ def where(conditionExpr: String): DataFrame = { filter(Colu

Re: [SPARK-2.0][SQL] UDF containing non-serializable object does not work as expected

2016-08-07 Thread Muthu Jayakumar
Hello Hao Ren, Doesn't the code... val add = udf { (a: Int) => a + notSer.value } Mean UDF function that Int => Int ? Thanks, Muthu On Sun, Aug 7, 2016 at 2:31 PM, Hao Ren wrote: > I am playing with spark 2.0 > What I tried to test is: > > Create a UDF in which there is a non serial

Random forest binary classification H20 difference Spark

2016-08-07 Thread Javier Rey
Hi everybody. I have executed RF on H2O I didn't troubles with nulls values, by in contrast in Spark using dataframes and ML library I obtain this error,l I know my dataframe contains nulls, but I understand that Random Forest supports null values: "Values to assemble cannot be null" Any advice,

Re: submitting spark job with kerberized Hadoop issue

2016-08-07 Thread Saisai Shao
1. Standalone mode doesn't support accessing kerberized Hadoop, simply because it lacks the mechanism to distribute delegation tokens via cluster manager. 2. For the HBase token fetching failure, I think you have to do kinit to generate tgt before start spark application ( http://hbase.apache.org/0

Using Kyro for DataFrames (Dataset)?

2016-08-07 Thread Jestin Ma
When using DataFrames (Dataset), there's no option for an Encoder. Does that mean DataFrames (since it builds on top of an RDD) uses Java serialization? Does using Kyro make sense as an optimization here? If not, what's the difference between Java/Kyro serialization, Tungsten, and Encoders? Thank

[SPARK-2.0][SQL] UDF containing non-serializable object does not work as expected

2016-08-07 Thread Hao Ren
I am playing with spark 2.0 What I tried to test is: Create a UDF in which there is a non serializable object. What I expected is when this UDF is called during materializing the dataFrame where the UDF is used in "select", an task non serializable exception should be thrown. It depends also which

Re: submitting spark job with kerberized Hadoop issue

2016-08-07 Thread Aneela Saleem
Thanks Wojciech and Jacek! I tried with Spark on Yarn with kerberized cluster it works fine now. But now when i try to access Hbase through spark i get the following error: 2016-08-07 20:43:57,617 WARN [hconnection-0x24b5fa45-metaLookup-shared--pool2-t1] ipc.RpcClientImpl: Exception encountered w

Accessing HBase through Spark with Security enabled

2016-08-07 Thread Aneela Saleem
Hi all, I'm trying to run a spark job that accesses HBase with security enabled. When i run the following command: */usr/local/spark-2/bin/spark-submit --keytab /etc/hadoop/conf/spark.keytab --principal spark/hadoop-master@platalyticsrealm --class com.platalytics.example.spark.App --master yarn

Re: [Spark1.6] Or (||) operator not working in DataFrame

2016-08-07 Thread Mich Talebzadeh
although the logic should be col1 <> a && col(1) <> b to exclude both Like df.filter('transactiontype > " ").filter(not('transactiontype ==="DEB") && not('transactiontype ==="BGC")).select('transactiontype).distinct.collect.foreach(println) HTH Dr Mich Talebzadeh LinkedIn * https://www.lin

Re: [Spark1.6] Or (||) operator not working in DataFrame

2016-08-07 Thread Mich Talebzadeh
try similar to this df.filter(not('transactiontype ==="DEB") || not('transactiontype ==="CRE")) HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: [Spark1.6] Or (||) operator not working in DataFrame

2016-08-07 Thread janardhan shetty
Can you try 'or' keyword instead? On Aug 7, 2016 7:43 AM, "Divya Gehlot" wrote: > Hi, > I have use case where I need to use or[||] operator in filter condition. > It seems its not working its taking the condition before the operator and > ignoring the other filter condition after or operator. > A

Sorting a DStream and taking topN

2016-08-07 Thread Ahmed El-Gamal
I have some DStream in Spark Scala and I want to sort it then take the top N. The problem is that whenever I try to run it I get NotSerializableException and the exception message says: This is because the DStream object is being referred to from within the closure. The problem is that I don't kn

Sorting a DStream and taking topN

2016-08-07 Thread Ahmed El-Gamal
I have some DStream in Spark Scala and I want to sort it then take the top N. The problem is that whenever I try to run it I get NotSerializableException and the exception message says: This is because the DStream object is being referred to from within the closure. The problem is that I don't kn

[Spark1.6] Or (||) operator not working in DataFrame

2016-08-07 Thread Divya Gehlot
Hi, I have use case where I need to use or[||] operator in filter condition. It seems its not working its taking the condition before the operator and ignoring the other filter condition after or operator. As any body faced similar issue . Psuedo code : df.filter(col("colName").notEqual("no_value"

Re: Help testing the Spark Extensions for the Apache Bahir 2.0.0 release

2016-08-07 Thread Luciano Resende
Simple, just help us test the available extensions using Spark 2.0.0... preferable in real workloads that you might be using in your day to day usage of Spark. I wrote a quick getting started for using the new MQTT Structured Streaming on my blog, which can serve as an example: http://lresende.bl

Re: Help testing the Spark Extensions for the Apache Bahir 2.0.0 release

2016-08-07 Thread Sivakumaran S
Hi, How can I help? regards, Sivakumaran S > On 06-Aug-2016, at 6:18 PM, Luciano Resende wrote: > > Apache Bahir is voting it's 2.0.0 release based on Apache Spark 2.0.0. > > https://www.mail-archive.com/dev@bahir.apache.org/msg00312.html >