Hi
I have a RDD that needs to be sorted lexicographically and
then processed by partition. The partitions should be split in to
ranged blocks where sorted order is maintained and each partition
containing sequential, non-overlapping keys.
Given keys (1,2,3,4,5,6)
1. Correct
- 2
ss structure and work on that (then you can do withColumn("mobile",...)
> instead of "pass.mobile") but this would change the schema.
>
>
> -Original Message-
> From: Kristoffer Sjögren [mailto:sto...@gmail.com]
> Sent: Saturday, November 19, 2016 4:57
for example you would do something like:
>
> df.withColumn("newColName",pyspark.sql.functions.lit(None))
>
> Assaf.
> -Original Message-
> From: Kristoffer Sjögren [mailto:sto...@gmail.com]
> Sent: Friday, November 18, 2016 9:19 PM
> To: Mendelson, Assaf
> Cc:
them null (or some
> literal) as a preprocessing.
>
> -Original Message-
> From: Kristoffer Sjögren [mailto:sto...@gmail.com]
> Sent: Friday, November 18, 2016 4:32 PM
> To: user
> Subject: DataFrame select non-existing column
>
> Hi
>
> We have evolved a DataFram
Hi
We have evolved a DataFrame by adding a few columns but cannot write
select statements on these columns for older data that doesn't have
them since they fail with a AnalysisException with message "No such
struct field".
We also tried dropping columns but this doesn't work for nested columns.
for the actual split could to occur?
Any pointers?
On Tue, Jun 14, 2016 at 4:03 PM, Kristoffer Sjögren <sto...@gmail.com> wrote:
> I'm pretty confident the lines are encoded correctly since I can read
> them both locally and on Spark (by ignoring the faulty line and
> proceed to ne
Spark, save as
new text file, then try decoding again.
context.textFile("/orgfile").saveAsTextFile("/newfile");
Ok, not much left than to do some remote debugging.
On Tue, Jun 14, 2016 at 3:38 PM, Kristoffer Sjögren <sto...@gmail.com> wrote:
> Thanks for you help. R
Thanks for you help. Really appreciate it!
Give me some time i'll come back after I've tried your suggestions.
On Tue, Jun 14, 2016 at 3:28 PM, Kristoffer Sjögren <sto...@gmail.com> wrote:
> I cannot reproduce it by running the file through Spark in local mode
> on my machine. So it
I cannot reproduce it by running the file through Spark in local mode
on my machine. So it does indeed seems to be something related to
split across partitions.
On Tue, Jun 14, 2016 at 3:04 PM, Kristoffer Sjögren <sto...@gmail.com> wrote:
> Can you do remote debugging in Spark? Di
ging, that's where I would start if you can. Breakpoints
> around how TextInputFormat is parsing lines. See if you can catch it
> when it returns a line that doesn't contain what you expect.
>
> On Tue, Jun 14, 2016 at 1:38 PM, Kristoffer Sjögren <sto...@gmail.com> wrote:
>>
That's funny. The line after is the rest of the whole line that got
split in half. Every following lines after that are fine.
I managed to reproduce without gzip also so maybe it's no gzip's fault
after all..
I'm clueless...
On Tue, Jun 14, 2016 at 12:53 PM, Kristoffer Sjögren <sto...@gmail.
Hi
We have log files that are written in base64 encoded text files
(gzipped) where each line is ended with a new line character.
For some reason a particular line [1] is split by Spark [2] making it
unparsable by the base64 decoder. It does this consequently no matter
if I gives it the
Hi
We have a RDD that needs to be mapped with information from
HBase, where the exact key is the user id.
What's the different alternatives for doing this?
- Is it possible to do HBase.get() requests from a map function in Spark?
- Or should we join RDDs with all full HBase table scan?
I ask
/hbase-spark/src/main/scala/org/apache/hadoop/hbase/spark/HBaseRDDFunctions.scala
>
> Cheers
>
> On Thu, Jan 14, 2016 at 5:04 AM, Kristoffer Sjögren <sto...@gmail.com>
> wrote:
>>
>> Hi
>>
>> We have a RDD that needs to be mapped with information from
Hi
I'm trying to understand how to lookup certain id fields of RDDs to an
external mapping table. The table is accessed through a two-way binary
tcp client where an id is provided and entry returned. Entries cannot
be listed/scanned.
What's the simplest way of managing the tcp client and its
Hi
Is there a way to execute spark jobs with Java 8 lambdas instead of
using anonymous inner classes as seen in the examples?
I think I remember seeing real lambdas in the examples before and in
articles [1]?
Cheers,
-Kristoffer
[1]
Hi
I have trouble executing a really simple Java job on spark 1.0.0-cdh5.1.0
that runs inside a docker container:
SparkConf sparkConf = new
SparkConf().setAppName(TestApplication).setMaster(spark://localhost:7077);
JavaSparkContext ctx = new JavaSparkContext(sparkConf);
JavaRDDString lines =
under CDH5.
They could use mesos or the standalone scheduler to run them
On Tue, May 6, 2014 at 6:16 AM, Kristoffer Sjögren sto...@gmail.comwrote:
Hi
I just read an article [1] about Spark, CDH5 and Java 8 but did not get
exactly how Spark can run Java 8 on a YARN cluster at runtime. Is Spark
Hi
I just read an article [1] about Spark, CDH5 and Java 8 but did not get
exactly how Spark can run Java 8 on a YARN cluster at runtime. Is Spark
using a separate JVM that run on data nodes or is it reusing the YARN JVM
runtime somehow, like hadoop1?
CDH5 only supports Java 7 [2] as far as I
19 matches
Mail list logo