Re: Machine learning question (suing spark)- removing redundant factors while doing clustering

2016-08-08 Thread Robin East
Another approach is to use L1 regularisation eg http://spark.apache.org/docs/latest/mllib-linear-methods.html#linear-least-squares-lasso-and-ridge-regression. This adds a penalty term to the regression equation to reduce model complexity. When you use L1 (as opposed to say L2) this tends to

Re: Machine learning question (suing spark)- removing redundant factors while doing clustering

2016-08-08 Thread Rohit Chaddha
@Peyman - does any of the clustering algorithms have "feature Importance" or "feature selection" ability ? I can't seem to pinpoint On Tue, Aug 9, 2016 at 8:49 AM, Peyman Mohajerian wrote: > You can try 'feature Importances' or 'feature selection' depending on what > else

SparkR error when repartition is called

2016-08-08 Thread Shane Lee
Hi All, I am trying out SparkR 2.0 and have run into an issue with repartition.  Here is the R code (essentially a port of the pi-calculating scala example in the spark package) that can reproduce the behavior: schema <- structType(structField("input", "integer"), structField("output",

Re: Machine learning question (suing spark)- removing redundant factors while doing clustering

2016-08-08 Thread Peyman Mohajerian
You can try 'feature Importances' or 'feature selection' depending on what else you want to do with the remaining features that's a possibility. Let's say you are trying to do classification then some of the Spark Libraries have a model parameter called 'featureImportances' that tell you which

Re: Machine learning question (suing spark)- removing redundant factors while doing clustering

2016-08-08 Thread Rohit Chaddha
I would rather have less features to make better inferences on the data based on the smaller number of factors, Any suggestions Sean ? On Mon, Aug 8, 2016 at 11:37 PM, Sean Owen wrote: > Yes, that's exactly what PCA is for as Sivakumaran noted. Do you > really want to select

答复: 答复: how to generate a column using mapParition and then add it back to the df?

2016-08-08 Thread 莫涛
Hi guha, Thanks a lot! This is perfectly what I want and I'll try to implement it. MoTao 发件人: ayan guha 发送时间: 2016年8月8日 18:05:37 收件人: 莫涛 抄送: ndj...@gmail.com; user@spark.apache.org 主题: Re: 答复: how to generate a column using mapParition and

Re: Cumulative Sum function using Dataset API

2016-08-08 Thread Jon Barksdale
I don't think that would work properly, and would probably just give me the sum for each partition. I'll give it a try when I get home just to be certain. To maybe explain the intent better, if I have a column (pre sorted) of (1,2,3,4), then the cumulative sum would return (1,3,6,10). Does that

Re: Cumulative Sum function using Dataset API

2016-08-08 Thread ayan guha
You mean you are not able to use sum(col) over (partition by key order by some_col) ? On Tue, Aug 9, 2016 at 9:53 AM, jon wrote: > Hi all, > > I'm trying to write a function that calculates a cumulative sum as a column > using the Dataset API, and I'm a little stuck on

Cumulative Sum function using Dataset API

2016-08-08 Thread jon
Hi all, I'm trying to write a function that calculates a cumulative sum as a column using the Dataset API, and I'm a little stuck on the implementation. From what I can tell, UserDefinedAggregateFunctions don't seem to support windowing clauses, which I think I need for this use case. If I

Logistic regression formula string

2016-08-08 Thread Cesar
I have a data frame with four columns, label , feature_1, feature_2, feature_3. Is there a simple way in the ML library to give me the weights based in feature names? I can only get the weights, which make this simple task complicated when one of my features is categorical. I am looking for

Issue with temporary table in Spark 2

2016-08-08 Thread Mich Talebzadeh
Hi, This used to work in Spark 1.6.1. I am trying in Spark 2 scala> val a = df.filter(col("Transaction Date") > "").map(p => Accounts(p(0).toString,p(1).toString,p(2).toString,p(3).toString,p(4).toString,p(5).toString,p(6).toString,p(7).toString.toDouble)) a:

Re: [SPARK-2.0][SQL] UDF containing non-serializable object does not work as expected

2016-08-08 Thread Reynold Xin
The show thing was the result of an optimization that short-circuited any real Spark computation when the input is a local collection, and the result was simply the first few rows. That's why it completed without serializing anything. It is somewhat inconsistent. One way to eliminate the

Re: java.lang.OutOfMemoryError: GC overhead limit exceeded when using UDFs in SparkSQL (Spark 2.0.0)

2016-08-08 Thread Davies Liu
On Mon, Aug 8, 2016 at 2:24 PM, Zoltan Fedor wrote: > Hi all, > > I have an interesting issue trying to use UDFs from SparkSQL in Spark 2.0.0 > using pyspark. > > There is a big table (5.6 Billion rows, 450Gb in memory) loaded into 300 > executors's memory in SparkSQL,

java.lang.OutOfMemoryError: GC overhead limit exceeded when using UDFs in SparkSQL (Spark 2.0.0)

2016-08-08 Thread Zoltan Fedor
Hi all, I have an interesting issue trying to use UDFs from SparkSQL in Spark 2.0.0 using pyspark. There is a big table (5.6 Billion rows, 450Gb in memory) loaded into 300 executors's memory in SparkSQL, on which we would do some calculation using UDFs in pyspark. If I run my SQL on only a

Re: zip for pyspark

2016-08-08 Thread Ewan Leith
If you build a normal python egg file with the dependencies, you can execute that like you are executing a .py file with --py-files Thanks, Ewan On 8 Aug 2016 3:44 p.m., pseudo oduesp wrote: hi, how i can export all project on pyspark like zip from local session to

Re: Symbol HasInputCol is inaccesible from this place

2016-08-08 Thread janardhan shetty
Can some experts shed light on this one? Still facing issues with extends HasInputCol and DefaultParamsWritable On Mon, Aug 8, 2016 at 9:56 AM, janardhan shetty wrote: > you mean is it deprecated ? > > On Mon, Aug 8, 2016 at 5:02 AM, Strange, Nick

Re: [SPARK-2.0][SQL] UDF containing non-serializable object does not work as expected

2016-08-08 Thread Reynold Xin
That is unfortunately the way how Scala compiler captures (and defines) closures. Nothing is really final in the JVM. You can always use reflection or unsafe to modify the value of fields. On Mon, Aug 8, 2016 at 8:16 PM, Simon Scott wrote: > But does the “notSer”

Re: Getting a TreeNode Exception while saving into Hadoop

2016-08-08 Thread Ted Yu
Mind showing the complete stack trace ? Thanks On Mon, Aug 8, 2016 at 12:30 PM, max square wrote: > Thanks Ted for the prompt reply. > > There are three or four DFs that are coming from various sources and I'm > doing a unionAll on them. > > val placesProcessed =

Re: SPARK SQL READING FROM HIVE

2016-08-08 Thread Mich Talebzadeh
Unfortunately it is the case for now spark-sql> show create table payees; Error in query: Failed to execute SHOW CREATE TABLE against table `accounts`.`payees`, which is created by Hive and uses the following unsupported feature(s) - bucketing; HTH Dr Mich Talebzadeh LinkedIn *

Re: Spark join and large temp files

2016-08-08 Thread Yong Zhang
Join requires shuffling. The problem is that you have to shuffle 1.5T data, which caused problem on your disk usage. Another way is to broadcast the 1.5G small dataset, so there is no shuffle requirement for 1.5T dataset. But you need to make sure you have enough memory. Can you try to

Re: SPARK SQL READING FROM HIVE

2016-08-08 Thread manish jaiswal
correct its creating delta file in hdfs.but after compaction it merge all data and create extra directory where all bucketed data present.( i am able to read data from hive but not from sparksql).

Re: Getting a TreeNode Exception while saving into Hadoop

2016-08-08 Thread Ted Yu
Can you show the code snippet for unionAll operation ? Which Spark release do you use ? BTW please use user@spark.apache.org in the future. On Mon, Aug 8, 2016 at 11:47 AM, max square wrote: > Hey guys, > > I'm trying to save Dataframe in CSV format after performing

Re: SPARK SQL READING FROM HIVE

2016-08-08 Thread manish jaiswal
i am using spark 1.6.0 and hive 1.2.1. reading from hive transactional table is not supported yet by sparl sql? On Tue, Aug 9, 2016 at 12:18 AM, manish jaiswal wrote: > Hi, > > I am not able to read data from hive transactional table using sparksql. > (i don't want read

Re: SPARK SQL READING FROM HIVE

2016-08-08 Thread Mich Talebzadeh
I suspect this is happening because the underlying table has got delta files in it due to updates etc and spark cannot read it and requires compaction Can you do hdfs dfs -ls Also can you query a normal table in hive (meaning non transactional) HTH Dr Mich Talebzadeh LinkedIn *

RE: Spark join and large temp files

2016-08-08 Thread Ashic Mahtab
Hi Deepak,Thanks for the response. Registering the temp tables didn't help. Here's what I have: val a = sqlContext..read.parquet(...).select("eid.id", "name").withColumnRenamed("eid.id", "id")val b = sqlContext.read.parquet(...).select("id", "number")

Re: SPARK SQL READING FROM HIVE

2016-08-08 Thread Deepak Sharma
Can you please post the code snippet and the error you are getting ? -Deepak On 9 Aug 2016 12:18 am, "manish jaiswal" wrote: > Hi, > > I am not able to read data from hive transactional table using sparksql. > (i don't want read via hive jdbc) > > > > Please help. >

Re: SPARK SQL READING FROM HIVE

2016-08-08 Thread Mich Talebzadeh
Which version of Spark and Hive are you using? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://talebzadehmich.wordpress.com

SPARK SQL READING FROM HIVE

2016-08-08 Thread manish jaiswal
Hi, I am not able to read data from hive transactional table using sparksql. (i don't want read via hive jdbc) Please help.

Getting a TreeNode Exception while saving into Hadoop

2016-08-08 Thread max square
Hey guys, I'm trying to save Dataframe in CSV format after performing unionAll operations on it. But I get this exception - Exception in thread "main" org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: TungstenExchange hashpartitioning(mId#430,200) I'm saving it by

Re: Have I done everything correctly when subscribing to Spark User List

2016-08-08 Thread Sivakumaran S
Does it have anything to do with the fact that the mail address is displayed as user @spark.apache.org ? There is a space before ‘@‘. This is as received in my mail client. Sivakumaran > On 08-Aug-2016, at 7:42 PM, Chris Mattmann wrote: > >

Re: Have I done everything correctly when subscribing to Spark User List

2016-08-08 Thread Chris Mattmann
Weird! On 8/8/16, 11:10 AM, "Sean Owen" wrote: >I also don't know what's going on with the "This post has NOT been >accepted by the mailing list yet" message, because actually the >messages always do post. In fact this has been sent to the list 4 >times: >

Re: Spark join and large temp files

2016-08-08 Thread Deepak Sharma
Register you dataframes as temp tables and then try the join on the temp table. This should resolve your issue. Thanks Deepak On Mon, Aug 8, 2016 at 11:47 PM, Ashic Mahtab wrote: > Hello, > We have two parquet inputs of the following form: > > a: id:String, Name:String (1.5TB)

Re: Have I done everything correctly when subscribing to Spark User List

2016-08-08 Thread Ovidiu-Cristian MARCU
Probably the yellow warning message can be confusing even more than not receiving an answer/opinion on his post. Best, Ovidiu > On 08 Aug 2016, at 20:10, Sean Owen wrote: > > I also don't know what's going on with the "This post has NOT been > accepted by the mailing list

Spark join and large temp files

2016-08-08 Thread Ashic Mahtab
Hello,We have two parquet inputs of the following form: a: id:String, Name:String (1.5TB)b: id:String, Number:Int (1.3GB) We need to join these two to get (id, Number, Name). We've tried two approaches: a.join(b, Seq("id"), "right_outer") where a and b are dataframes. We also tried taking the

Re: FW: Have I done everything correctly when subscribing to Spark User List

2016-08-08 Thread Sean Owen
I also don't know what's going on with the "This post has NOT been accepted by the mailing list yet" message, because actually the messages always do post. In fact this has been sent to the list 4 times: https://www.mail-archive.com/search?l=user%40spark.apache.org=dueckm=0=0 On Mon, Aug 8, 2016

using matrix as column datatype in SparkSQL Dataframe

2016-08-08 Thread Vadla, Karthik
Hello all, I'm trying to load set of medical images(dicom) into spark SQL dataframe. Here each image is loaded into matrix column of dataframe. I see spark recently added MatrixUDT to support this kind of cases, but i don't find a sample for using matrix as column in dataframe.

Re: Source format for Apache Spark logo

2016-08-08 Thread Sean Owen
In case the attachments don't come through, BTW those are indeed downloadable from the directory http://spark.apache.org/images/ On Mon, Aug 8, 2016 at 6:09 PM, Sivakumaran S wrote: > Found these from the spark.apache.org website. > > HTH, > > Sivakumaran S > > > > > > On

Re: Symbol HasInputCol is inaccesible from this place

2016-08-08 Thread janardhan shetty
you mean is it deprecated ? On Mon, Aug 8, 2016 at 5:02 AM, Strange, Nick wrote: > What possible reason do they have to think its fragmentation? > > > > *From:* janardhan shetty [mailto:janardhan...@gmail.com] > *Sent:* Saturday, August 06, 2016 2:01 PM > *To:* Ted Yu >

Source format for Apache Spark logo

2016-08-08 Thread Michael.Arndt
Hi, for a presentation I'd apreciate a vector version of the Apache Spark logo, unfortunately I cannot find it. Is the Logo available in a vector format somewhere? Virus checked by G Data MailSecurity Version: AVA 25.7800 dated 08.08.2016 Virus news: www.antiviruslab.com

Re: Best practises around spark-scala

2016-08-08 Thread Deepak Sharma
Thanks Vaquar. My intention is to find something which can help stress test the code in spark , measure the performance and suggest some improvements. Is there any such framework or tool I can use here ? Thanks Deepak On 8 Aug 2016 9:14 pm, "vaquar khan" wrote: > I found

Re: Best practises around spark-scala

2016-08-08 Thread vaquar khan
I found following links are good as I am using same. http://spark.apache.org/docs/latest/tuning.html https://spark-summit.org/2014/testing-spark-best-practices/ Regards, Vaquar khan On 8 Aug 2016 10:11, "Deepak Sharma" wrote: > Hi All, > Can anyone please give any

Best practises around spark-scala

2016-08-08 Thread Deepak Sharma
Hi All, Can anyone please give any documents that may be there around spark-scala best practises? -- Thanks Deepak www.bigdatabig.com www.keosha.net

Re: Machine learning question (suing spark)- removing redundant factors while doing clustering

2016-08-08 Thread Tony Lane
There must be an algorithmic way to figure out which of these factors contribute the least and remove them in the analysis. I am hoping same one can throw some insight on this. On Mon, Aug 8, 2016 at 7:41 PM, Sivakumaran S wrote: > Not an expert here, but the first step

Unsubscribe

2016-08-08 Thread bijuna
Sent from my iPhone - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

zip for pyspark

2016-08-08 Thread pseudo oduesp
hi, how i can export all project on pyspark like zip from local session to cluster and deploy with spark submit i mean i have a large project with all dependances and i want create zip containing all of dependecs and deploy it on cluster

Re: Machine learning question (suing spark)- removing redundant factors while doing clustering

2016-08-08 Thread Sivakumaran S
Not an expert here, but the first step would be devote some time and identify which of these 112 factors are actually causative. Some domain knowledge of the data may be required. Then, you can start of with PCA. HTH, Regards, Sivakumaran S > On 08-Aug-2016, at 3:01 PM, Tony Lane

FW: Have I done everything correctly when subscribing to Spark User List

2016-08-08 Thread Chris Mattmann
On 8/8/16, 2:03 AM, "matthias.du...@fiduciagad.de" wrote: >Hello, > >I write to you because I am not really sure whether I did everything right >when registering and subscribing to the spark user list. > >I posted the appended question to Spark User list

Re: Machine learning question (suing spark)- removing redundant factors while doing clustering

2016-08-08 Thread Tony Lane
Great question Rohit. I am in my early days of ML as well and it would be great if we get some idea on this from other experts on this group. I know we can reduce dimensions by using PCA, but i think that does not allow us to understand which factors from the original are we using in the end. -

Spark 2 and existing code with sqlContext

2016-08-08 Thread Mich Talebzadeh
Hi, In Spark 1.6.1 this worked scala> sqlContext.sql("SELECT FROM_unixtime(unix_timestamp(), 'dd/MM/ HH:mm:ss.ss') ").collect.foreach(println) [08/08/2016 14:07:22.22] Spark 2 should give due to backward compatibility? But I get cala> sqlContext.sql("SELECT FROM_unixtime(unix_timestamp(),

RE: Symbol HasInputCol is inaccesible from this place

2016-08-08 Thread Strange, Nick
What possible reason do they have to think its fragmentation? From: janardhan shetty [mailto:janardhan...@gmail.com] Sent: Saturday, August 06, 2016 2:01 PM To: Ted Yu Cc: user Subject: Re: Symbol HasInputCol is inaccesible from this place Yes seems like, wondering if this can be made public in

Machine learning question (suing spark)- removing redundant factors while doing clustering

2016-08-08 Thread Rohit Chaddha
I have a data-set where each data-point has 112 factors. I want to remove the factors which are not relevant, and say reduce to 20 factors out of these 112 and then do clustering of data-points using these 20 factors. How do I do these and how do I figure out which of the 20 factors are useful

Re: Multiple Sources Found for Parquet

2016-08-08 Thread Aseem Bansal
Seems that this is a common issue with Spark 2.0.0 I faced similar with CSV. Saw someone facing this with JSON. https://issues.apache.org/jira/browse/SPARK-16893 On Mon, Aug 8, 2016 at 4:08 PM, Ted Yu wrote: > Can you examine classpath to see where *DefaultSource comes

Re: Multiple Sources Found for Parquet

2016-08-08 Thread Ted Yu
Can you examine classpath to see where *DefaultSource comes from ?* *Thanks* On Mon, Aug 8, 2016 at 2:34 AM, 金国栋 wrote: > I'm using Spark2.0.0 to do sql analysis over parquet files, when using > `read().parquet("path")`, or `write().parquet("path")` in Java(I followed > the

Re: 答复: how to generate a column using mapParition and then add it back to the df?

2016-08-08 Thread ayan guha
Hi I think you should modify initModel() function to getOrCreateModel() and create the model as singleton object. You may want to refer this link On Mon, Aug 8, 2016 at 7:44 PM, 莫涛

答复: how to generate a column using mapParition and then add it back to the df?

2016-08-08 Thread 莫涛
Hi Ndjido, Thanks for your reply. Yes, it is good idea if the model can be broadcast. I'm working with a built library (on Linux, say classifier.so and classifier.h) and it requires the model file is in the local file system. As I don't have access to the library code, I write JNI to wrap the

Multiple Sources Found for Parquet

2016-08-08 Thread 金国栋
I'm using Spark2.0.0 to do sql analysis over parquet files, when using `read().parquet("path")`, or `write().parquet("path")` in Java(I followed the example java file in source code exactly), I always encountered *Exception in thread "main" java.lang.RuntimeException: Multiple sources found for

Re: why spark 2 shell console still sending warnings despite setting log4j.rootCategory=ERROR, console

2016-08-08 Thread Mich Talebzadeh
Sorted. The new section in log4j.properties has to be modified # Set the default spark-shell log level to WARN. When running the spark-shell, the # log level for this class is used to overwrite the root logger's log level, so that # the user can have different defaults for the shell and regular

Re: how to generate a column using mapParition and then add it back to the df?

2016-08-08 Thread ndjido
Hi MoTao, What about broadcasting the model? Cheers, Ndjido. > On 08 Aug 2016, at 11:00, MoTao wrote: > > Hi all, > > I'm trying to append a column to a df. > I understand that the new column must be created by > 1) using literals, > 2) transforming an existing column in

Re: hdfs persist rollbacks when spark job is killed

2016-08-08 Thread Gourav Sengupta
There is a mv command in GCS but I am not quite sure (because of limitation of data on which I work on it and lack my budget) whether the mv command actually copies and deletes or just re-points the files to a new directory by changing its meta-data. Yes the Data Quality checks are done after the

how to generate a column using mapParition and then add it back to the df?

2016-08-08 Thread MoTao
Hi all, I'm trying to append a column to a df. I understand that the new column must be created by 1) using literals, 2) transforming an existing column in df, or 3) generated from udf over this df In my case, the column to be appended is created by processing each row, like val df =

why spark 2 shell console still sending warnings despite setting log4j.rootCategory=ERROR, console

2016-08-08 Thread Mich Talebzadeh
Hi, Just doing some tests on Spark 2 version spark-shell using spark-shell, in my $SPARK_HOME/conf/log4j.properties, I have: log4j.rootCategory=ERROR, console I get spark-shell Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). I don't have this issue

Spark driver memory keeps growing

2016-08-08 Thread Pierre Villard
Hi, I'm running a job on Spark 1.5.2 and I get OutOfMemoryError on broadcast variables access. The thing is I am not sure to understand why the broadcast keeps growing and why it does at this place of code. Basically, I have a large input file, each line having a key. I group by key my lines to

Re: mapWithState handle timeout

2016-08-08 Thread jackerli
i get answer from stack overflow: http://stackoverflow.com/questions/38397688/spark-mapwithstate-api-explanation when key is timeout, the new value is None; in my case, i set value when timeout occurs -- View

Re: What are the configurations needs to connect spark and ms-sql server?

2016-08-08 Thread Deepak Sharma
Hi Devi Please make sure the jdbc jar is in the spark classpath. With spark-submit , you can use --jars option to specify the sql server jdbc jar. Thanks Deepak On Mon, Aug 8, 2016 at 1:14 PM, Devi P.V wrote: > Hi all, > > I am trying to write a spark dataframe into MS-Sql

Re: [SPARK-2.0][SQL] UDF containing non-serializable object does not work as expected

2016-08-08 Thread Hao Ren
Yes, it is. You can define a udf like that. Basically, it's a udf Int => Int which is a closure contains a non serializable object. The latter should cause Task not serializable exception. Hao On Mon, Aug 8, 2016 at 5:08 AM, Muthu Jayakumar wrote: > Hello Hao Ren, > >

What are the configurations needs to connect spark and ms-sql server?

2016-08-08 Thread Devi P.V
Hi all, I am trying to write a spark dataframe into MS-Sql Server.I have tried using the following code, val sqlprop = new java.util.Properties sqlprop.setProperty("user","uname") sqlprop.setProperty("password","pwd")

Re: Is Spark right for my use case?

2016-08-08 Thread Deepak Sharma
Hi Danellis For point 1 , spark streaming is something to look at. For point 2 , you can create DAO from cassandra on each stream processing.This may be costly operation though , but to do real time processing of data , you have to live with t. Point 3 is covered in point 2 above. Since you are

Re: [Spark1.6] Or (||) operator not working in DataFrame

2016-08-08 Thread Mich Talebzadeh
I am afraid logic is incorrect. that is the reason why it is not working. Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw *

Re: Spark 2.0.0 - Broadcast variable - What is ClassTag?

2016-08-08 Thread Holden Karau
If your using this from Java you might find it easier to construct a JavaSparkContext, the broadcast function will automatically use a fake class tag. On Sun, Aug 7, 2016 at 11:57 PM, Aseem Bansal wrote: > I am using the following to broadcast and it explicitly requires

Re: Spark 2.0.0 - Broadcast variable - What is ClassTag?

2016-08-08 Thread Aseem Bansal
I am using the following to broadcast and it explicitly requires classtag sparkSession.sparkContext().broadcast On Mon, Aug 8, 2016 at 12:09 PM, Holden Karau wrote: > Classtag is Scala concept (see http://docs.scala-lang. >

Re: hdfs persist rollbacks when spark job is killed

2016-08-08 Thread Chanh Le
Thank you Gourav, > Moving files from _temp folders to main folders is an additional overhead > when you are working on S3 as there is no move operation. Good catch. Is that GCS the same? > I generally have a set of Data Quality checks after each job to ascertain > whether everything went

Re: hdfs persist rollbacks when spark job is killed

2016-08-08 Thread Gourav Sengupta
But you have to be careful, that is the default setting. There is a way you can overwrite it so that the writing to _temp folder does not take place and you write directly to the main folder. Moving files from _temp folders to main folders is an additional overhead when you are working on S3 as

Re: hdfs persist rollbacks when spark job is killed

2016-08-08 Thread Chanh Le
It’s out of the box in Spark. When you write data into hfs or any storage it only creates a new parquet folder properly if your Spark job was success else only _temp folder inside to mark it’s still not success (spark was killed) or nothing inside (Spark job was failed). > On Aug 8, 2016,

Re: Spark 2.0.0 - Broadcast variable - What is ClassTag?

2016-08-08 Thread Holden Karau
Classtag is Scala concept (see http://docs.scala-lang.org/overviews/reflection/typetags-manifests.html) - although this should not be explicitly required - looking at http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext we can see that in Scala the classtag tag is

Re: Spark 2.0.0 - Apply schema on few columns of dataset

2016-08-08 Thread Aseem Bansal
Hi Ewan The .as function take a single encoder or a single string or a single Symbol. I have like more than 10 columns so I cannot use the tuple functions. Passing using bracket does not work. On Mon, Aug 8, 2016 at 11:26 AM, Ewan Leith wrote: > Looking at the

hdfs persist rollbacks when spark job is killed

2016-08-08 Thread Sumit Khanna
Hello, the use case is as follows : say I am inserting 200K rows as dataframe.write.formate("parquet") etc etc (like a basic write to hdfs command), but say due to some reason or rhyme my job got killed, when the run was in the mid of it, meaning lets say I was only able to insert 100K rows

Spark 2.0.0 - Broadcast variable - What is ClassTag?

2016-08-08 Thread Aseem Bansal
Earlier for broadcasting we just needed to use sparkcontext.broadcast(objectToBroadcast) But now it is sparkcontext.broadcast(objectToBroadcast, classTag) What is classTag here?

Is Spark right for my use case?

2016-08-08 Thread danellis
Spark n00b here. Working with online retailers, I start with a list of their products in Cassandra (with prices, stock levels, descriptions, etc) and then receive an HTTP request every time one of them changes. For each change, I update the product in Cassandra and store the change with the old

Re: submitting spark job with kerberized Hadoop issue

2016-08-08 Thread Aneela Saleem
Thanks Saisai and Ted, I have already configured HBase security and it's working fine. I have also done kinit before submitting job. Following is the code i'm trying to use System.setProperty("java.security.krb5.conf", "/etc/krb5.conf"); System.setProperty("java.security.auth.login.config",

map vs mapPartitions

2016-08-08 Thread rtijoriwala
Hi All, I am a newbie to spark and want to know if there is any performance difference between map vs mapPartitions if I am doing strictly a per item transformation? For e.g. reversedWords = words.map(w => w.reverse()); vs. reversedWords = words.mapPartitions(pwordsIterator => { List