Re: how to us DataFrame.na.fill based on condition

2015-11-23 Thread Davies Liu
DataFrame.replace(to_replace, value, subset=None) http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.replace On Mon, Nov 23, 2015 at 11:05 AM, Vishnu Viswanath wrote: > Hi > > Can someone tell me if there is a way I can use the

how to us DataFrame.na.fill based on condition

2015-11-23 Thread Vishnu Viswanath
Hi Can someone tell me if there is a way I can use the fill method in DataFrameNaFunctions based on some condition. e.g., df.na.fill("value1","column1","condition1") df.na.fill("value2","column1","condition2") i want to fill nulls in column1 with values - either value 1 or value 2,

Re: Please add us to the Powered by Spark page

2015-11-23 Thread Sujit Pal
Sorry to be a nag, I realize folks with edit rights on the Powered by Spark page are very busy people, but its been 10 days since my original request, was wondering if maybe it just fell through the cracks. If I should submit via some other channel that will make sure it is looked at (or better

Re: how to us DataFrame.na.fill based on condition

2015-11-23 Thread Vishnu Viswanath
Thanks for the reply Davies I think replace, replaces a value with another value. But what I want to do is fill in the null value of a column.( I don't have a to_replace here ) Regards, Vishnu On Mon, Nov 23, 2015 at 1:37 PM, Davies Liu wrote: >

Re: newbie : why are thousands of empty files being created on HDFS?

2015-11-23 Thread Andy Davidson
Hi Sabarish I am but a simple padawan :-) I do not understand your answer. Why would Spark be creating so many empty partitions? My real problem is my application is very slow. I happened to notice thousands of empty files being created. I thought this is a hint to why my app is slow. My program

spark-csv on Amazon EMR

2015-11-23 Thread Daniel Lopes
Hi, Some know how to use spark-csv in create-cluster statement of Amazon EMR CLI? Best, -- *Daniel Lopes, B.Eng* Data Scientist - BankFacil CREA/SP 5069410560 Mob +55 (18) 99764-2733 Ph +55

Re: newbie : why are thousands of empty files being created on HDFS?

2015-11-23 Thread Xiao Li
In your case, maybe you can try to call the function coalesce? Good luck, Xiao Li 2015-11-23 12:15 GMT-08:00 Andy Davidson : > Hi Sabarish > > I am but a simple padawan :-) I do not understand your answer. Why would > Spark be creating so many empty partitions?

Re: SparkR DataFrame , Out of memory exception for very small file.

2015-11-23 Thread Vipul Rai
Hi Jeff, This is only part of the actual code. My questions are mentioned in comments near the code. SALES<- SparkR::sql(hiveContext, "select * from sales") PRICING<- SparkR::sql(hiveContext, "select * from pricing") ## renaming of columns ## #sales file# # Is this right ??? Do we have to

Re: how to use sc.hadoopConfiguration from pyspark

2015-11-23 Thread Eike von Seggern
2015-11-23 10:26 GMT+01:00 Tamas Szuromi : > Hello Eike, > > Thanks! Yes I'm using it with Hadoop 2.6 so I'll give a try to the 2.4 > build. > Have you tried it with 1.6 Snapshot or do you know JIRA tickets for this > missing libraries issues? I've not tried 1.6.

Re: SparkR DataFrame , Out of memory exception for very small file.

2015-11-23 Thread Jeff Zhang
>>> Do I need to create a new DataFrame for every update to the DataFrame like addition of new column or need to update the original sales DataFrame. Yes, DataFrame is immutable, and every mutation of DataFrame will produce a new DataFrame. On Mon, Nov 23, 2015 at 4:44 PM, Vipul Rai

Re: A Problem About Running Spark 1.5 on YARN with Dynamic Alloction

2015-11-23 Thread Saisai Shao
Hi Tingwen, Would you minding sharing your changes in ExecutorAllocationManager#addExecutors(). >From my understanding and test, dynamic allocation can be worked when you set the min to max number of executors to the same number. Please check your Spark and Yarn log to make sure the executors

Re:Re: RE: Error not found value sqlContext

2015-11-23 Thread prosp4300
So it is actually a compile time error in Eclipse, instead of jar generation from Eclipse, you can try to use sbt to assembly your jar, looks like your Eclipse does not recognize the Scala syntax properly. At 2015-11-20 21:36:55, "satish chandra j" wrote: HI All,

Re: SparkR DataFrame , Out of memory exception for very small file.

2015-11-23 Thread Vipul Rai
Hello Rui, Sorry , What I meant was the resultant of the original dataframe to which a new column was added gives a new DataFrame. Please check this for more https://spark.apache.org/docs/1.5.1/api/R/index.html Check for WithColumn Thanks, Vipul On 23 November 2015 at 12:42, Sun, Rui

Re: how to use sc.hadoopConfiguration from pyspark

2015-11-23 Thread Tamas Szuromi
Hello Eike, Thanks! Yes I'm using it with Hadoop 2.6 so I'll give a try to the 2.4 build. Have you tried it with 1.6 Snapshot or do you know JIRA tickets for this missing libraries issues? Tamas On 23 November 2015 at 10:21, Eike von Seggern wrote: > Hello

Re: SparkR DataFrame , Out of memory exception for very small file.

2015-11-23 Thread Vipul Rai
Hi Zeff, Thanks for the reply, but could you tell me why is it taking so much time. What could be wrong , also when I remove the DataFrame from memory using rm(). It does not clear the memory but the object is deleted. Also , What about the R functions which are not supported in SparkR. Like

Re: how to use sc.hadoopConfiguration from pyspark

2015-11-23 Thread Eike von Seggern
Hello Tamas, 2015-11-20 17:23 GMT+01:00 Tamas Szuromi : > > Hello, > > I've just wanted to use sc._jsc.hadoopConfiguration().set('key','value') in > pyspark 1.5.2 but I got set method not exists error. For me it's working with Spark 1.5.2 binary distribution built

Re: SparkR DataFrame , Out of memory exception for very small file.

2015-11-23 Thread Jeff Zhang
If possible, could you share your code ? What kind of operation are you doing on the dataframe ? On Mon, Nov 23, 2015 at 5:10 PM, Vipul Rai wrote: > Hi Zeff, > > Thanks for the reply, but could you tell me why is it taking so much time. > What could be wrong , also when

DateTime Support - Hive Parquet

2015-11-23 Thread Bryan Jeffrey
All, I am attempting to write objects that include a DateTime properties to a persistent table using Spark 1.5.2 / HiveContext. In 1.4.1 I was forced to convert the DateTime properties to Timestamp properties. I was under the impression that this issue was fixed in the default Hive supported

Re: Re: RE: Error not found value sqlContext

2015-11-23 Thread satish chandra j
Thanks for all the support. It was a code issue which I overlooked it Regards, Satish Chandra On Mon, Nov 23, 2015 at 3:49 PM, satish chandra j wrote: > Sorry, just to understand my issue.if Eclipse could not understand > Scala syntax properly than it should error

Re: Data in one partition after reduceByKey

2015-11-23 Thread Patrick McGloin
I will answer my own question, since I figured it out. Here is my answer in case anyone else has the same issue. My DateTimes were all without seconds and milliseconds since I wanted to group data belonging to the same minute. The hashCode() for Joda DateTimes which are one minute apart is a

Add Data Science Serbia meetup

2015-11-23 Thread Darko Marjanovic
Please add Data Science Serbia meetup group to list on the web site. http://www.meetup.com/Data-Science-Serbia/ Thank you. Best, Darko Marjanovic CEO & Co-Founder Things Solver *M: +381637052054* *E: da...@thingsolver.com *

Any workaround for Kafka couldn't find leaders for set?

2015-11-23 Thread Hudong Wang
Hi folks, We have a 10 node cluster and have several topics. Each topic has about 256 partitions with 3 replica factor. Now we run into an issue that in some topic, a few partition (< 10)'s leader is -1 and all of them has only one synced partition. Exception in thread "main"

Re: how to us DataFrame.na.fill based on condition

2015-11-23 Thread Davies Liu
You could create a new column based on the expression: IF (condition1, value1, old_column_value) On Mon, Nov 23, 2015 at 11:57 AM, Vishnu Viswanath wrote: > Thanks for the reply Davies > > I think replace, replaces a value with another value. But what I want to do >

Re: newbie : why are thousands of empty files being created on HDFS?

2015-11-23 Thread Andy Davidson
Hi Xiao and Sabarish Using the Stage tab on the UI. It turns out you can see how many partitions there are. If I did nothing I would have 228155 partition. (This confirms what Sabarish said). I tried coalesce(3). RDD.count() fails. I though given I have 3 workers and 1/3 of the data would easily

Re: Need Help Diagnosing/operating/tuning

2015-11-23 Thread Igor Berman
you should check why executor is killed. as soon as it's killed you can get all kind of strange exceptions... either give your executors more memory(4G is rather small for spark ) or try to decrease your input or maybe split it into more partitions in input format 23G in lzo might expand to x?

Spark Streaming idempotent writes to HDFS

2015-11-23 Thread Michael
Hi all, I'm working on project with spark streaming, the goal is to process log files from S3 and save them on hadoop to later analyze them with sparkSQL. Everything works well except when I kill the spark application and restart it: it picks up from the latest processed batch and reprocesses it

Re: A Problem About Running Spark 1.5 on YARN with Dynamic Alloction

2015-11-23 Thread Saisai Shao
I don't think it is a bug, maybe something wrong with your Spark / Yarn configurations. On Tue, Nov 24, 2015 at 12:13 PM, 谢廷稳 wrote: > OK,the YARN cluster was used by myself,it have 6 node witch can run over > 100 executor, and the YARN RM logs showed that the Spark

Re: load multiple directory using dataframe load

2015-11-23 Thread Fengdong Yu
hiveContext.read.format(“orc”).load(“bypath/*”) > On Nov 24, 2015, at 1:07 PM, Renu Yadav wrote: > > Hi , > > I am using dataframe and want to load orc file using multiple directory > like this: > hiveContext.read.format.load("mypath/3660,myPath/3661") > > but it is not

Re: Spark Streaming idempotent writes to HDFS

2015-11-23 Thread Burak Yavuz
Not sure if it would be the most efficient, but maybe you can think of the filesystem as a key value store, and write each batch to a sub-directory, where the directory name is the batch time. If the directory already exists, then you shouldn't write it. Then you may have a following batch job

Re: Relation between RDDs, DataFrames and Project Tungsten

2015-11-23 Thread Mark Hamstra
> > In the near future, I guess GUI interfaces of Spark will be available > soon. Spark users (e.g, CEOs) might not need to know what are RDDs at all. > They can analyze their data by clicking a few buttons, instead of writing > the programs. : ) That's not in the future. :) On Mon, Nov 23,

Re: How to have Kafka Direct Consumers show up in Kafka Consumer reporting?

2015-11-23 Thread swetha kasireddy
OK. I see the following to query the offsets. In our Kafka Stream, the offsets are stored in ZooKeeper and I am not updating Offsets in Zookeeper. How does Kafka Direct know which offsets to query? Does it calculate automatically as to which offsets to query?I have "auto.offset.reset" ->

load multiple directory using dataframe load

2015-11-23 Thread Renu Yadav
Hi , I am using dataframe and want to load orc file using multiple directory like this: hiveContext.read.format.load("mypath/3660,myPath/3661") but it is not working. Please suggest how to achieve this Thanks & Regards, Renu Yadav

Re: Spark Kafka Direct Error

2015-11-23 Thread swetha kasireddy
Does Kafka direct query the offsets from the zookeeper directly? From where does it get the offsets? There is data in those offsets, but somehow Kafka Direct does not seem to pick it up. Other Consumers that use Zoo Keeper Quorum of that Stream seems to be fine. Only Kafka Direct seems to have

Re: A Problem About Running Spark 1.5 on YARN with Dynamic Alloction

2015-11-23 Thread 谢廷稳
Hi Saisai, Would you mind giving me some tips about this problem? After check YARN RM logs, I think Spark application didn't request resources from it, So, I guess this problem is none of YARN's business. and the spark conf of my cluster will be list in the following:

Re: Re: A Problem About Running Spark 1.5 on YARN with Dynamic Alloction

2015-11-23 Thread cherrywayb...@gmail.com
can you show your parameter values in your env ? yarn.nodemanager.resource.cpu-vcores yarn.nodemanager.resource.memory-mb cherrywayb...@gmail.com From: 谢廷稳 Date: 2015-11-24 12:13 To: Saisai Shao CC: spark users Subject: Re: A Problem About Running Spark 1.5 on YARN with Dynamic

Re: DateTime Support - Hive Parquet

2015-11-23 Thread Cheng Lian
Hey Bryan, What do you mean by "DateTime properties"? Hive and Spark SQL both support DATE and TIMESTAMP types, but there's no DATETIME type. So I assume you are referring to Java class DateTime (possibly the one in joda)? Could you please provide a sample snippet that illustrates your

Relation between RDDs, DataFrames and Project Tungsten

2015-11-23 Thread Jakob Odersky
Hi everyone, I'm doing some reading-up on all the newer features of Spark such as DataFrames, DataSets and Project Tungsten. This got me a bit confused on the relation between all these concepts. When starting to learn Spark, I read a book and the original paper on RDDs, this lead me to

spark-ec2 script to launch cluster running Spark 1.5.2 built with HIVE?

2015-11-23 Thread Jeff Schecter
Hi all, As far as I can tell, the bundled spark-ec2 script provides no way to launch a cluster running Spark 1.5.2 pre-built with HIVE. That is to say, all of the pre-build versions of Spark 1.5.2 in the s3 bin spark-related-packages are missing HIVE. aws s3 ls s3://spark-related-packages/ |

How to have Kafka Direct Consumers show up in Kafka Consumer reporting?

2015-11-23 Thread swetha
Hi, We see a bunch of issues like the following in Our Spark Kafka Direct. Any idea as to how make Kafka Direct Consumers show up in Kafka Consumer reporting to debug this issue? Job aborted due to stage failure: Task 47 in stage 336.0 failed 4 times, most recent failure: Lost task 47.3 in

Re: Dataframe constructor

2015-11-23 Thread Fengdong Yu
just simple as: val df = sqlContext.sql(“select * from table”) or val df = sqlContext.read.json(“hdfs_path”) > On Nov 24, 2015, at 3:09 AM, spark_user_2015 wrote: > > Dear all, > > is the following usage of the Dataframe constructor correct or does it > trigger any side

Re: How to have Kafka Direct Consumers show up in Kafka Consumer reporting?

2015-11-23 Thread Cody Koeninger
What exactly do you mean by kafka consumer reporting? I'd log the offsets in your spark job and try running kafka-simple-consumer-shell.sh --partition $yourbadpartition --print-offsets at the same time your spark job is running On Mon, Nov 23, 2015 at 7:37 PM, swetha

Port Control for YARN-Aware Spark

2015-11-23 Thread gpriestley
Hello Community, I have what I hope to be a couple of quick questions regarding port control on Spark which is Yarn-aware (cluster & client modes). I'm aware that I can control port configurations by setting driver.port, executor.port, etc to use specified ports, but I'm not sure how/if that

Apache Cassandra Docker Images?

2015-11-23 Thread Renato Perini
Hello, any planned support for official Docker images? Would be great having some images using the cluster manager of choice (Standalone, Yarn, Mesos) with the latest Apache Spark distribution (ideally, using CentOS 7.x) for clusterizable containers. Regards, Renato Perini.

Re: A Problem About Running Spark 1.5 on YARN with Dynamic Alloction

2015-11-23 Thread 谢廷稳
Hi Saisai, I'm sorry for did not describe it clearly,YARN debug log said I have 50 executors,but ResourceManager showed that I only have 1 container for the AppMaster. I have checked YARN RM logs,after AppMaster changed state from ACCEPTED to RUNNING,it did not have log about this job any

Re: Relation between RDDs, DataFrames and Project Tungsten

2015-11-23 Thread Michael Armbrust
Here is how I view the relationship between the various components of Spark: - *RDDs - *a low level API for expressing DAGs that will be executed in parallel by Spark workers - *Catalyst -* an internal library for expressing trees that we use to build relational algebra and expression

Re: How to have Kafka Direct Consumers show up in Kafka Consumer reporting?

2015-11-23 Thread swetha kasireddy
I mean to show the Spark Kafka Direct consumers in Kafka Stream UI. Usually we create a consumer and the consumer gets shown in the Kafka Stream UI. How do I log the offsets in the Spark Job? On Mon, Nov 23, 2015 at 6:11 PM, Cody Koeninger wrote: > What exactly do you mean

Re: How to have Kafka Direct Consumers show up in Kafka Consumer reporting?

2015-11-23 Thread swetha kasireddy
Also, does Kafka direct query the offsets from the zookeeper directly? From where does it get the offsets? There is data in those offsets, but somehow Kafka Direct does not seem to pick it up? On Mon, Nov 23, 2015 at 6:18 PM, swetha kasireddy wrote: > I mean to show

Re: A Problem About Running Spark 1.5 on YARN with Dynamic Alloction

2015-11-23 Thread 谢廷稳
Hi SaiSai, I have changed "if (numExecutorsTarget >= maxNumExecutors)" to "if (numExecutorsTarget > maxNumExecutors)" of the first line in the ExecutorAllocationManager#addExecutors() and it rans well. In my opinion,when I was set minExecutors equals maxExecutors,when the first time to add

Re: Relation between RDDs, DataFrames and Project Tungsten

2015-11-23 Thread Xiao Li
Let me share my understanding. If we view Spark as analytics OS, RDD APIs are like OS system calls. These low-level system calls can be called in the program languages like C. DataFrame and Dataset APIs are like higher-level programming languages. They hide the low level complexity and the

Re: Relation between RDDs, DataFrames and Project Tungsten

2015-11-23 Thread Jakob Odersky
Thanks Michael, that helped me a lot! On 23 November 2015 at 17:47, Michael Armbrust wrote: > Here is how I view the relationship between the various components of > Spark: > > - *RDDs - *a low level API for expressing DAGs that will be executed in > parallel by Spark

Re: Any workaround for Kafka couldn't find leaders for set?

2015-11-23 Thread Cody Koeninger
If you really want to just not process the bad topicpartitions, you can use the version of createDirectStream that takes fromOffsets: Map[TopicAndPartition, Long] and exclude the broken topicpartitions from the map. On Mon, Nov 23, 2015 at 4:54 PM, Hudong Wang wrote: >

Re: spark-ec2 script to launch cluster running Spark 1.5.2 built with HIVE?

2015-11-23 Thread Nicholas Chammas
Don't the Hadoop builds include Hive already? Like spark-1.5.2-bin-hadoop2.6.tgz? On Mon, Nov 23, 2015 at 7:49 PM Jeff Schecter wrote: > Hi all, > > As far as I can tell, the bundled spark-ec2 script provides no way to > launch a cluster running Spark 1.5.2 pre-built with

Re: A Problem About Running Spark 1.5 on YARN with Dynamic Alloction

2015-11-23 Thread Saisai Shao
I think this behavior is expected, since you already have 50 executors launched, so no need to acquire additional executors. You change is not solid, it is just hiding the log. Again I think you should check the logs of Yarn and Spark to see if executors are started correctly. Why resource is

Re: Port Control for YARN-Aware Spark

2015-11-23 Thread Marcelo Vanzin
On Mon, Nov 23, 2015 at 6:24 PM, gpriestley wrote: > Questions I have are: > 1) How does the spark.yarn.am.port relate to defined ports within Spark > (driver, executor, block manager, etc.)? > 2) Doe the spark.yarn.am.port parameter only relate to the spark >

Re: newbie : why are thousands of empty files being created on HDFS?

2015-11-23 Thread Don Drake
I'm seeing similar slowness in saveAsTextFile(), but only in Python. I'm sorting data in a dataframe, then transform it and get a RDD, and then coalesce(1).saveAsTextFile(). I converted the Python to Scala and the run-times were similar, except for the saveAsTextFile() stage. The scala version

Re: RDD partition after calling mapToPair

2015-11-23 Thread Thúy Hằng Lê
Thanks Cody, I still have concerns about this. What's do you mean by saying Spark direct stream doesn't have a default partitioner? Could you please help me to explain more? When i assign 20 cores to 20 Kafka partitions, I am expecting each core will work on a partition. Is it correct? I'm

A question about sql clustering

2015-11-23 Thread Cesar Flores
Let's assume that I have a code like the following: val sqlQuery = "select * from whse.table_a cluster by user_id" val df = hc.sql(sqlQuery) My understanding is that the cluster function will partition the data frame by user_id and also sort inside each partition (something very useful for

Spark Kafka Direct Error

2015-11-23 Thread swetha
Hi, I see the following error in my Spark Kafka Direct. Would this mean that Kafka Direct is not able to catch up with the messages and is failing? Job aborted due to stage failure: Task 20 in stage 117.0 failed 4 times, most recent failure: Lost task 20.3 in stage 117.0 (TID 2114,

Re: Spark Kafka Direct Error

2015-11-23 Thread Cody Koeninger
No, that means that at the time the batch was scheduled, the kafka leader reported the ending offset was 221572238, but during processing, kafka stopped returning messages before reaching that ending offset. That probably means something got screwed up with Kafka - e.g. you lost a leader and lost

Dataframe constructor

2015-11-23 Thread spark_user_2015
Dear all, is the following usage of the Dataframe constructor correct or does it trigger any side effects that I should be aware of? My goal is to keep track of my dataframe's state and allow custom transformations accordingly. val df: Dataframe = ...some dataframe... val newDf = new

Getting different DESCRIBE results between SparkSQL and Hive

2015-11-23 Thread YaoPau
Example below. The partition columns show up as regular columns. I'll note that SHOW PARTITIONS works correctly in Spark SQL, so it's aware of the partitions but it does not show them in DESCRIBE. In Hive: "DESCRIBE pub.inventory_daily"

Re: RDD partition after calling mapToPair

2015-11-23 Thread Cody Koeninger
Partitioner is an optional field when defining an rdd. KafkaRDD doesn't define one, so you can't really assume anything about the way it's partitioned, because spark doesn't know anything about the way it's partitioned. If you want to rely on some property of how things were partitioned as they

spark 1.4.1 to oracle 11g write to an existing table

2015-11-23 Thread Siva Gudavalli
Hi, I am trying to write a dataframe from Spark 1.4.1 to oracle 11g I am using dataframe.write.mode(SaveMode.Append).jdbc(url,tablename, properties) this is always trying to create a Table. I would like to insert records to an existing table instead of creating a new one each single time.