RE: DataFrame select non-existing column

2016-11-18 Thread Mendelson, Assaf
In pyspark for example you would do something like: df.withColumn("newColName",pyspark.sql.functions.lit(None)) Assaf. -Original Message- From: Kristoffer Sjögren [mailto:sto...@gmail.com] Sent: Friday, November 18, 2016 9:19 PM To: Mendelson, Assaf Cc: user Subject: Re: DataFrame

Re: DataFrame select non-existing column

2016-11-18 Thread Muthu Jayakumar
Depending on your use case, 'df.withColumn("my_existing_or_new_col", lit(0l))' could work? On Fri, Nov 18, 2016 at 11:18 AM, Kristoffer Sjögren wrote: > Thanks for your answer. I have been searching the API for doing that > but I could not find how to do it? > > Could you give

Reading LZO files with Spark

2016-11-18 Thread learning_spark
Hi Users,I am not sure about the latest status of this issue:https://issues.apache.org/jira/browse/SPARK-2394However, I have seen the following link: https://github.com/awslabs/emr-bootstrap-actions/blob/master/spark/examples/reading-lzo-files.mdMy experience is limited, but I had had partial

Spark driver not reusing HConnection

2016-11-18 Thread Mukesh Jha
Hi I'm accessing multiple regions (~5k) of an HBase table using spark's newAPIHadoopRDD. But the driver is trying to calculate the region size of all the regions. It is not even reusing the hconnection and creting a new connection for every request (see below) which is taking lots of time. Is

Run spark with hadoop snapshot

2016-11-18 Thread lminer
I'm trying to figure out how to run spark with a snapshot of Hadoop 2.8 that I built myself. I'm unclear on the configuration needed to get spark to work with the snapshot. I'm running spark on mesos. Per the spark documentation, I run spark-submit as follows using the

How to expose Spark-Shell in the production?

2016-11-18 Thread kant kodali
How to expose Spark-Shell in the production? 1) Should we expose it on Master Nodes or Executer nodes? 2) Should we simple give access to those machines and Spark-Shell binary? what is the recommended way? Thanks!

Re: DataFrame select non-existing column

2016-11-18 Thread Kristoffer Sjögren
Thanks for your answer. I have been searching the API for doing that but I could not find how to do it? Could you give me a code snippet? On Fri, Nov 18, 2016 at 8:03 PM, Mendelson, Assaf wrote: > You can always add the columns to old dataframes giving them null (or

Re: Will spark cache table once even if I call read/cache on the same table multiple times

2016-11-18 Thread Rabin Banerjee
Hi Yong, But every time val tabdf = sqlContext.table(tablename) is called tabdf.rdd is having a new id which can be checked by calling tabdf.rdd.id . And, https://github.com/apache/spark/blob/b6de0c98c70960a97b07615b0b08fbd8f900fbe7/core/src/main/scala/org/apache/spark/SparkContext.scala#L268

RE: DataFrame select non-existing column

2016-11-18 Thread Mendelson, Assaf
You can always add the columns to old dataframes giving them null (or some literal) as a preprocessing. -Original Message- From: Kristoffer Sjögren [mailto:sto...@gmail.com] Sent: Friday, November 18, 2016 4:32 PM To: user Subject: DataFrame select non-existing column Hi We have

Successful streaming with ibm/ mq to flume then to kafka and finally spark streaming

2016-11-18 Thread Mich Talebzadeh
hi, can someone share their experience of feeding data from ibm/mq messages into flume, then from flume to kafka and using spark streaming on it? any issues and things to be aware of? thanks Dr Mich Talebzadeh LinkedIn *

Re: Issue in application deployment on spark cluster

2016-11-18 Thread Asmeet
Hello Anjali, According to the documentation at the following URL http://spark.apache.org/docs/latest/submitting-applications.html it says "Currently, standalone mode does not support cluster mode for Python applications." Does this relate to your problem/query. Regards, Asmeet > On

Re: How to load only the data of the last partition

2016-11-18 Thread Rabin Banerjee
HI , In order to do that you can write code to read/list a HDFS directory first , then list its sub-directories . In this way using custom logic ,first identify the latest year/month/version , then read the avro in that dir in a DF, then add year/month/version to that DF using withColumn.

Re: Will spark cache table once even if I call read/cache on the same table multiple times

2016-11-18 Thread Yong Zhang
That's correct, as long as you don't change the StorageLevel. https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L166 Yong From: Rabin Banerjee Sent: Friday, November 18, 2016 10:36 AM

Re: Long-running job OOMs driver process

2016-11-18 Thread Keith Bourgoin
Thanks for the input. I had read somewhere that s3:// was the way to go due to some recent changes, but apparently that was outdated. I’m working on creating some dummy data and a script to process it right now. I’ll post here with code and logs when I can successfully reproduce the issue on

Re: sort descending with multiple columns

2016-11-18 Thread Rabin Banerjee
++Stuart val colList = df.columns can be used On Fri, Nov 18, 2016 at 8:03 PM, Stuart White wrote: > Is this what you're looking for? > > val df = Seq( > (1, "A"), > (1, "B"), > (1, "C"), > (2, "D"), > (3, "E") > ).toDF("foo", "bar") > > val colList =

Will spark cache table once even if I call read/cache on the same table multiple times

2016-11-18 Thread Rabin Banerjee
Hi All , I am working in a project where code is divided into multiple reusable module . I am not able to understand spark persist/cache on that context. My Question is Will spark cache table once even if I call read/cache on the same table multiple times ?? Sample Code :: TableReader::

Re: Long-running job OOMs driver process

2016-11-18 Thread Yong Zhang
Just wondering, is it possible the memory usage keeping going up due to the web UI content? Yong From: Alexis Seigneurin Sent: Friday, November 18, 2016 10:17 AM To: Nathan Lande Cc: Keith Bourgoin; Irina Truong;

Re: Long-running job OOMs driver process

2016-11-18 Thread Alexis Seigneurin
+1 for using S3A. It would also depend on what format you're using. I agree with Steve that Parquet, for instance, is a good option. If you're using plain text files, some people use GZ files but they cannot be partitioned, thus putting a lot of pressure on the driver. It doesn't look like this

Issue in application deployment on spark cluster

2016-11-18 Thread Anjali Gautam
Hello Everybody, I am new to Apache spark. I have created an application in Python which works well on spark locally but is not working properly when deployed on standalone spark cluster. Can anybody comment on this behaviour of the application? Also if spark Python code requires making some

Re: Long-running job OOMs driver process

2016-11-18 Thread Nathan Lande
+1 to not threading. What does your load look like? If you are loading many files and cacheing them in N rdds rather than 1 rdd this could be an issue. If the above two things don't fix your oom issue, without knowing anything else about your job, I would focus on your cacheing strategy as a

Re: Long-running job OOMs driver process

2016-11-18 Thread Steve Loughran
On 18 Nov 2016, at 14:31, Keith Bourgoin > wrote: We thread the file processing to amortize the cost of things like getting files from S3. Define cost here: actual $ amount, or merely time to read the data? If it's read times, you should really be

Re: Any with S3 experience with Spark? Having ListBucket issues

2016-11-18 Thread Steve Loughran
On 16 Nov 2016, at 22:34, Edden Burrow > wrote: Anyone dealing with a lot of files with spark? We're trying s3a with 2.0.1 because we're seeing intermittent errors in S3 where jobs fail and saveAsText file fails. Using pyspark. How many

Re: sort descending with multiple columns

2016-11-18 Thread Stuart White
Is this what you're looking for? val df = Seq( (1, "A"), (1, "B"), (1, "C"), (2, "D"), (3, "E") ).toDF("foo", "bar") val colList = Seq("foo", "bar") df.sort(colList.map(col(_).desc): _*).show +---+---+ |foo|bar| +---+---+ | 3| E| | 2| D| | 1| C| | 1| B| | 1| A| +---+---+ On

DataFrame select non-existing column

2016-11-18 Thread Kristoffer Sjögren
Hi We have evolved a DataFrame by adding a few columns but cannot write select statements on these columns for older data that doesn't have them since they fail with a AnalysisException with message "No such struct field". We also tried dropping columns but this doesn't work for nested columns.

Re: Long-running job OOMs driver process

2016-11-18 Thread Keith Bourgoin
Hi Alexis, Thanks for the response. I've been working with Irina on trying to sort this issue out. We thread the file processing to amortize the cost of things like getting files from S3. It's a pattern we've seen recommended in many places, but I don't have any of those links handy. The

Sporadic ClassNotFoundException with Kryo

2016-11-18 Thread chrism
Regardless of the different ways we have tried deploying a jar together with Spark, when running a Spark Streaming job with Kryo as serializer on top of Mesos, we sporadically get the following error (I have truncated a bit): /16/11/18 08:39:10 ERROR OneForOneBlockFetcher: Failed while starting

RE: CSV to parquet preserving partitioning

2016-11-18 Thread benoitdr
This is more or less how I'm doing it now. Problem is that it creates shuffling in the cluster because the input data are not collocated according to the partition scheme. If a reload the output parquet files as a new dataframe, then everything is fine, but I'd like to avoid shuffling also during

Kafka direct approach,App UI shows wrong input rate

2016-11-18 Thread Julian Keppel
Hello, I use Spark 2.0.2 with Kafka integration 0-8. The Kafka version is 0.10.0.1 (Scala 2.11). I read data from Kafka with the direct approach. The complete infrastructure runs on Google Container Engine. I wonder why the corresponding application UI says the input rate is zero records per

Re: How do I flatten JSON blobs into a Data Frame using Spark/Spark SQL

2016-11-18 Thread kant kodali
This seem to work import org.apache.spark.sql._ val rdd = df2.rdd.map { case Row(j: String) => j } spark.read.json(rdd).show() However I wonder if this any inefficiency here ? since I have to apply this function for billion rows.

Re: How to load only the data of the last partition

2016-11-18 Thread Samy Dindane
Thank you Daniel. Unfortunately, we don't use Hive but bare (Avro) files. On 11/17/2016 08:47 PM, Daniel Haviv wrote: Hi Samy, If you're working with hive you could create a partitioned table and update it's partitions' locations to the last version so when you'll query it using spark,

Re: org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0

2016-11-18 Thread Phillip Henry
Looks like a standard "not enough memory" issue. I can only recommend the usual advice of increasing the number of partitions to give you a quick-win. Also, your JVMs have an enormous amount of memory. This may cause long GC pause times. You might like to try reducing the memory to about 20gb and