Re: spark persistence doubt

2016-09-29 Thread Bedrytski Aliaksandr
u may lose the optimisations given by lining up the 3 steps in one operation). If there is a second action executed on any of the transformation, persisting the farthest common transformation would be a good idea. Regards, -- Bedrytski Aliaksandr sp...@bedryt.ski On Thu, Sep 29, 2016, at 07:09, Shus

Re: Issue with rogue data in csv file used in Spark application

2016-09-28 Thread Bedrytski Aliaksandr
lines") spark.sql("SELECT cast(value as FLOAT) from lines").show() +-+ |value| +-+ | null| | 1. | | null| | 8.6 | +-+ After it you may filter the DataFrame for values containing null. Regards, -- Bedrytski Aliaksandr sp...@bedryt.ski On Wed, Sep 28, 2016, at 10

Re: how to find NaN values of each row of spark dataframe to decide whether the rows is dropeed or not

2016-09-26 Thread Bedrytski Aliaksandr
Hi Muhammet, python also supports sql queries http://spark.apache.org/docs/latest/sql-programming-guide.html#running-sql-queries-programmatically Regards, -- Bedrytski Aliaksandr sp...@bedryt.ski On Mon, Sep 26, 2016, at 10:01, muhammet pakyürek wrote: > > > > but my requs

Re: how to find NaN values of each row of spark dataframe to decide whether the rows is dropeed or not

2016-09-26 Thread Bedrytski Aliaksandr
uot;") This query filters rows containing Nan for a table with 3 columns. Regards, -- Bedrytski Aliaksandr sp...@bedryt.ski On Mon, Sep 26, 2016, at 09:30, muhammet pakyürek wrote: > > is there any way to do this directly. if its not, is there any todo > this indirectly using another datastrcutures of spark >

Re: udf forces usage of Row for complex types?

2016-09-25 Thread Bedrytski Aliaksandr
how to read it as a table (by transforming it to a DataFrame) Regards -- Bedrytski Aliaksandr sp...@bedryt.ski On Sun, Sep 25, 2016, at 23:41, Koert Kuipers wrote: > after having gotten used to have case classes represent complex > structures in Datasets, i am surprised to find out tha

Re: Spark Application Log

2016-09-22 Thread Bedrytski Aliaksandr
in one output. Regards, -- Bedrytski Aliaksandr sp...@bedryt.ski On Thu, Sep 22, 2016, at 06:06, Divya Gehlot wrote: > Hi, > I have initialised the logging in my spark App > */*Initialize Logging */ **val **log *= Logger.*getLogger*(getClass.getName) > > Logger.*getLogger*(*&

Re: Get profile from sbt

2016-09-21 Thread Bedrytski Aliaksandr
Hi Saurabh, you may use BuildInfo[1] sbt plugin to access values defined in build.sbt Regards, -- Bedrytski Aliaksandr sp...@bedryt.ski On Mon, Sep 19, 2016, at 18:28, Saurabh Malviya (samalviy) wrote: > Hi, > > Is there any way equivalent to profiles in maven in sbt. I want spar

Re: SparkR error: reference is ambiguous.

2016-09-09 Thread Bedrytski Aliaksandr
s ambiguity problems. Regards -- Bedrytski Aliaksandr sp...@bedryt.ski On Fri, Sep 9, 2016, at 19:33, xingye wrote: > Not sure whether this is the right distribution list that I can ask > questions. If not, can someone give a distribution list that can find > someone to help? > > I

Re: Why does spark take so much time for simple task without calculation?

2016-09-09 Thread Bedrytski Aliaksandr
Hi xiefeng, Even if your RDDs are tiny and reduced to one partition, there is always orchestration overhead (sending tasks to executor(s), reducing results, etc., these things are not free). If you need fast, [near] real-time processing, look towards spark-streaming. Regards, -- Bedrytski

Re: Why does spark take so much time for simple task without calculation?

2016-08-31 Thread Bedrytski Aliaksandr
don't really matter. Regards, -- Bedrytski Aliaksandr sp...@bedryt.ski On Wed, Aug 31, 2016, at 11:45, xiefeng wrote: > I install a spark standalone and run the spark cluster(one master and one > worker) in a windows 2008 server with 16cores and 24GB memory. > > I have done a

Re: How to acess the WrappedArray

2016-08-29 Thread Bedrytski Aliaksandr
(if the file is expected to be larger than bash tools can handle) you could iterate over the resulting WrappedArray and create a case class for each line. PS: I wonder where the *meta* object from the json goes. -- Bedrytski Aliaksandr sp...@bedryt.ski On Mon, Aug 29, 2016, at 11:27, Sre

Re: Best way to calculate intermediate column statistics

2016-08-26 Thread Bedrytski Aliaksandr
Hi Mich, I was wondering what are the advantages of using helper methods instead of one SQL multiline string? (I rarely (if ever) use helper methods, but maybe I'm missing something) Regards -- Bedrytski Aliaksandr sp...@bedryt.ski On Thu, Aug 25, 2016, at 11:39, Mich Talebzadeh wrote

Re: Best way to calculate intermediate column statistics

2016-08-24 Thread Bedrytski Aliaksandr
dataframe. This way it won't hit performance too much. Regards -- Bedrytski Aliaksandr sp...@bedryt.ski On Wed, Aug 24, 2016, at 16:42, Richard Siebeling wrote: > Hi, > > what is the best way to calculate intermediate column statistics like > the number of empty values an

Re: DataFrame Data Manipulation - Based on a timestamp column Not Working

2016-08-24 Thread Bedrytski Aliaksandr
E,'__-MM-_dd_') >= > unix_timestamp(demand_timefence_end_date ,'__-MM-_dd_') > """) This is if demand_timefence_end_date has '__-MM-_dd_' date format Regards, -- Bedrytski Aliaksandr sp...@bedryt.ski On Wed, Aug 24, 2016, at 00:46, Subhajit Purka

Re: Plans for improved Spark DataFrame/Dataset unit testing?

2016-08-22 Thread Bedrytski Aliaksandr
wrong), if you already have >1 specs per test, the CPU will be already saturated, so total parallel execution of tests will not give additional gains. Regards -- Bedrytski Aliaksandr sp...@bedryt.ski On Sun, Aug 21, 2016, at 18:30, Everett Anderson wrote: > > > On Sun, Aug 21, 201

Re: Plans for improved Spark DataFrame/Dataset unit testing?

2016-08-21 Thread Bedrytski Aliaksandr
a temporary table, we add an unique, incremented, thread safe id (AtomicInteger) to its name so that there are only specific, non-shared temporary tables used for a test. -- Bedrytski Aliaksandr sp...@bedryt.ski > On Sat, Aug 20, 2016, at 01:25, Everett Anderson wrote: > Hi! > > Just

Re: Losing executors due to memory problems

2016-08-12 Thread Bedrytski Aliaksandr
f 6 nodes, 16 cores/node, 64 ram/node => Gives: 17 executors, > 19Gb/exec, 5 cores/exec > No more than 5 cores per exec > Leave some cores/Ram for the driver More on the matter here http://www.slideshare.net/cloudera/top-5-mistakes-to-avoid-when-writing-apache-spark-applications

Re: Random forest binary classification H20 difference Spark

2016-08-11 Thread Bedrytski Aliaksandr
Hi Samir, either use *dataframe.na.fill()* method or the *nvl()* UDF when selecting features: val train = sqlContext.sql("SELECT ... nvl(Field, 1.0) AS Field ... FROM test") -- Bedrytski Aliaksandr sp...@bedryt.ski On Wed, Aug 10, 2016, at 11:19, Yanbo Liang wrote: > Hi S