In pyspark for example you would do something like:
df.withColumn("newColName",pyspark.sql.functions.lit(None))
Assaf.
-Original Message-
From: Kristoffer Sjögren [mailto:sto...@gmail.com]
Sent: Friday, November 18, 2016 9:19 PM
To: Mendelson, Assaf
Cc: user
Subject: Re: DataFrame select
Depending on your use case, 'df.withColumn("my_existing_or_new_col",
lit(0l))' could work?
On Fri, Nov 18, 2016 at 11:18 AM, Kristoffer Sjögren
wrote:
> Thanks for your answer. I have been searching the API for doing that
> but I could not find how to do it?
>
> Could you give me a code snippet?
Hi Users,I am not sure about the latest status of this
issue:https://issues.apache.org/jira/browse/SPARK-2394However, I have seen
the following link:
https://github.com/awslabs/emr-bootstrap-actions/blob/master/spark/examples/reading-lzo-files.mdMy
experience is limited, but I had had partial succe
Hi
I'm accessing multiple regions (~5k) of an HBase table using spark's
newAPIHadoopRDD. But the driver is trying to calculate the region size of
all the regions.
It is not even reusing the hconnection and creting a new connection for
every request (see below) which is taking lots of time.
Is the
I'm trying to figure out how to run spark with a snapshot of Hadoop 2.8 that
I built myself. I'm unclear on the configuration needed to get spark to work
with the snapshot.
I'm running spark on mesos. Per the spark documentation, I run spark-submit
as follows using the `spark-2.0.2-bin-without-had
How to expose Spark-Shell in the production?
1) Should we expose it on Master Nodes or Executer nodes?
2) Should we simple give access to those machines and Spark-Shell binary?
what is the recommended way?
Thanks!
Thanks for your answer. I have been searching the API for doing that
but I could not find how to do it?
Could you give me a code snippet?
On Fri, Nov 18, 2016 at 8:03 PM, Mendelson, Assaf
wrote:
> You can always add the columns to old dataframes giving them null (or some
> literal) as a preproc
Hi Yong,
But every time val tabdf = sqlContext.table(tablename) is called tabdf.rdd
is having a new id which can be checked by calling tabdf.rdd.id .
And,
https://github.com/apache/spark/blob/b6de0c98c70960a97b07615b0b08fbd8f900fbe7/core/src/main/scala/org/apache/spark/SparkContext.scala#L268
You can always add the columns to old dataframes giving them null (or some
literal) as a preprocessing.
-Original Message-
From: Kristoffer Sjögren [mailto:sto...@gmail.com]
Sent: Friday, November 18, 2016 4:32 PM
To: user
Subject: DataFrame select non-existing column
Hi
We have evolve
hi,
can someone share their experience of feeding data from ibm/mq messages
into flume, then from flume to kafka and using spark streaming on it?
any issues and things to be aware of?
thanks
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPC
Hello Anjali,
According to the documentation at the following URL
http://spark.apache.org/docs/latest/submitting-applications.html
it says "Currently, standalone mode does not support cluster mode for Python
applications."
Does this relate to your problem/query.
Regards,
Asmeet
> On 18-
HI ,
In order to do that you can write code to read/list a HDFS directory first
, then list its sub-directories . In this way using custom logic ,first
identify the latest year/month/version , then read the avro in that dir in
a DF, then add year/month/version to that DF using withColumn.
Regard
That's correct, as long as you don't change the StorageLevel.
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L166
Yong
From: Rabin Banerjee
Sent: Friday, November 18, 2016 10:36 AM
To: user; Mich Talebzadeh; Tat
Thanks for the input. I had read somewhere that s3:// was the way to go due
to some recent changes, but apparently that was outdated. I’m working on
creating some dummy data and a script to process it right now. I’ll post
here with code and logs when I can successfully reproduce the issue on
non-pr
++Stuart
val colList = df.columns
can be used
On Fri, Nov 18, 2016 at 8:03 PM, Stuart White
wrote:
> Is this what you're looking for?
>
> val df = Seq(
> (1, "A"),
> (1, "B"),
> (1, "C"),
> (2, "D"),
> (3, "E")
> ).toDF("foo", "bar")
>
> val colList = Seq("foo", "bar")
> df.sort(colL
Hi All ,
I am working in a project where code is divided into multiple reusable
module . I am not able to understand spark persist/cache on that context.
My Question is Will spark cache table once even if I call read/cache on the
same table multiple times ??
Sample Code ::
TableReader::
Just wondering, is it possible the memory usage keeping going up due to the web
UI content?
Yong
From: Alexis Seigneurin
Sent: Friday, November 18, 2016 10:17 AM
To: Nathan Lande
Cc: Keith Bourgoin; Irina Truong; u...@spark.incubator.apache.org
Subject: Re: Lo
+1 for using S3A.
It would also depend on what format you're using. I agree with Steve that
Parquet, for instance, is a good option. If you're using plain text files,
some people use GZ files but they cannot be partitioned, thus putting a lot
of pressure on the driver. It doesn't look like this is
Hello Everybody,
I am new to Apache spark. I have created an application in Python which works
well on spark locally but is not working properly when deployed on standalone
spark cluster.
Can anybody comment on this behaviour of the application? Also if spark Python
code requires making some
+1 to not threading.
What does your load look like? If you are loading many files and cacheing
them in N rdds rather than 1 rdd this could be an issue.
If the above two things don't fix your oom issue, without knowing anything
else about your job, I would focus on your cacheing strategy as a pote
On 18 Nov 2016, at 14:31, Keith Bourgoin
mailto:ke...@parsely.com>> wrote:
We thread the file processing to amortize the cost of things like getting files
from S3.
Define cost here: actual $ amount, or merely time to read the data?
If it's read times, you should really be trying the new stuff
On 16 Nov 2016, at 22:34, Edden Burrow
mailto:eddenbur...@gmail.com>> wrote:
Anyone dealing with a lot of files with spark? We're trying s3a with 2.0.1
because we're seeing intermittent errors in S3 where jobs fail and saveAsText
file fails. Using pyspark.
How many files? Thousands? Millions
Is this what you're looking for?
val df = Seq(
(1, "A"),
(1, "B"),
(1, "C"),
(2, "D"),
(3, "E")
).toDF("foo", "bar")
val colList = Seq("foo", "bar")
df.sort(colList.map(col(_).desc): _*).show
+---+---+
|foo|bar|
+---+---+
| 3| E|
| 2| D|
| 1| C|
| 1| B|
| 1| A|
+---+---+
On
Hi
We have evolved a DataFrame by adding a few columns but cannot write
select statements on these columns for older data that doesn't have
them since they fail with a AnalysisException with message "No such
struct field".
We also tried dropping columns but this doesn't work for nested columns.
Hi Alexis,
Thanks for the response. I've been working with Irina on trying to sort
this issue out.
We thread the file processing to amortize the cost of things like getting
files from S3. It's a pattern we've seen recommended in many places, but I
don't have any of those links handy. The problem
Regardless of the different ways we have tried deploying a jar together with
Spark, when running a Spark Streaming job with Kryo as serializer on top of
Mesos, we sporadically get the following error (I have truncated a bit):
/16/11/18 08:39:10 ERROR OneForOneBlockFetcher: Failed while starting bl
This is more or less how I'm doing it now.
Problem is that it creates shuffling in the cluster because the input data
are not collocated according to the partition scheme.
If a reload the output parquet files as a new dataframe, then everything is
fine, but I'd like to avoid shuffling also during
Hello,
I use Spark 2.0.2 with Kafka integration 0-8. The Kafka version is 0.10.0.1
(Scala 2.11). I read data from Kafka with the direct approach. The complete
infrastructure runs on Google Container Engine.
I wonder why the corresponding application UI says the input rate is zero
records per seco
This seem to work
import org.apache.spark.sql._
val rdd = df2.rdd.map { case Row(j: String) => j }
spark.read.json(rdd).show()
However I wonder if this any inefficiency here ? since I have to apply this
function for billion rows.
Thank you Daniel. Unfortunately, we don't use Hive but bare (Avro) files.
On 11/17/2016 08:47 PM, Daniel Haviv wrote:
Hi Samy,
If you're working with hive you could create a partitioned table and update
it's partitions' locations to the last version so when you'll query it using
spark, you'll
Looks like a standard "not enough memory" issue. I can only recommend the
usual advice of increasing the number of partitions to give you a quick-win.
Also, your JVMs have an enormous amount of memory. This may cause long GC
pause times. You might like to try reducing the memory to about 20gb and
31 matches
Mail list logo