Sorted partition ranges without overlap

2017-03-13 Thread Kristoffer Sjögren
Hi I have a RDD that needs to be sorted lexicographically and then processed by partition. The partitions should be split in to ranged blocks where sorted order is maintained and each partition containing sequential, non-overlapping keys. Given keys (1,2,3,4,5,6) 1. Correct - 2

Re: DataFrame select non-existing column

2016-11-20 Thread Kristoffer Sjögren
ss structure and work on that (then you can do withColumn("mobile",...) > instead of "pass.mobile") but this would change the schema. > > > -Original Message- > From: Kristoffer Sjögren [mailto:sto...@gmail.com] > Sent: Saturday, November 19, 2016 4:57

Re: DataFrame select non-existing column

2016-11-19 Thread Kristoffer Sjögren
for example you would do something like: > > df.withColumn("newColName",pyspark.sql.functions.lit(None)) > > Assaf. > -Original Message- > From: Kristoffer Sjögren [mailto:sto...@gmail.com] > Sent: Friday, November 18, 2016 9:19 PM > To: Mendelson, Assaf > Cc:

Re: DataFrame select non-existing column

2016-11-18 Thread Kristoffer Sjögren
them null (or some > literal) as a preprocessing. > > -Original Message- > From: Kristoffer Sjögren [mailto:sto...@gmail.com] > Sent: Friday, November 18, 2016 4:32 PM > To: user > Subject: DataFrame select non-existing column > > Hi > > We have evolved a DataFram

DataFrame select non-existing column

2016-11-18 Thread Kristoffer Sjögren
Hi We have evolved a DataFrame by adding a few columns but cannot write select statements on these columns for older data that doesn't have them since they fail with a AnalysisException with message "No such struct field". We also tried dropping columns but this doesn't work for nested columns.

Re: Spark corrupts text lines

2016-06-14 Thread Kristoffer Sjögren
for the actual split could to occur? Any pointers? On Tue, Jun 14, 2016 at 4:03 PM, Kristoffer Sjögren <sto...@gmail.com> wrote: > I'm pretty confident the lines are encoded correctly since I can read > them both locally and on Spark (by ignoring the faulty line and > proceed to ne

Re: Spark corrupts text lines

2016-06-14 Thread Kristoffer Sjögren
Spark, save as new text file, then try decoding again. context.textFile("/orgfile").saveAsTextFile("/newfile"); Ok, not much left than to do some remote debugging. On Tue, Jun 14, 2016 at 3:38 PM, Kristoffer Sjögren <sto...@gmail.com> wrote: > Thanks for you help. R

Re: Spark corrupts text lines

2016-06-14 Thread Kristoffer Sjögren
Thanks for you help. Really appreciate it! Give me some time i'll come back after I've tried your suggestions. On Tue, Jun 14, 2016 at 3:28 PM, Kristoffer Sjögren <sto...@gmail.com> wrote: > I cannot reproduce it by running the file through Spark in local mode > on my machine. So it

Re: Spark corrupts text lines

2016-06-14 Thread Kristoffer Sjögren
I cannot reproduce it by running the file through Spark in local mode on my machine. So it does indeed seems to be something related to split across partitions. On Tue, Jun 14, 2016 at 3:04 PM, Kristoffer Sjögren <sto...@gmail.com> wrote: > Can you do remote debugging in Spark? Di

Re: Spark corrupts text lines

2016-06-14 Thread Kristoffer Sjögren
ging, that's where I would start if you can. Breakpoints > around how TextInputFormat is parsing lines. See if you can catch it > when it returns a line that doesn't contain what you expect. > > On Tue, Jun 14, 2016 at 1:38 PM, Kristoffer Sjögren <sto...@gmail.com> wrote: >>

Re: Spark corrupts text lines

2016-06-14 Thread Kristoffer Sjögren
That's funny. The line after is the rest of the whole line that got split in half. Every following lines after that are fine. I managed to reproduce without gzip also so maybe it's no gzip's fault after all.. I'm clueless... On Tue, Jun 14, 2016 at 12:53 PM, Kristoffer Sjögren <sto...@gmail.

Spark corrupts text lines

2016-06-14 Thread Kristoffer Sjögren
Hi We have log files that are written in base64 encoded text files (gzipped) where each line is ended with a new line character. For some reason a particular line [1] is split by Spark [2] making it unparsable by the base64 decoder. It does this consequently no matter if I gives it the

Spark and HBase RDD join/get

2016-01-14 Thread Kristoffer Sjögren
Hi We have a RDD that needs to be mapped with information from HBase, where the exact key is the user id. What's the different alternatives for doing this? - Is it possible to do HBase.get() requests from a map function in Spark? - Or should we join RDDs with all full HBase table scan? I ask

Re: Spark and HBase RDD join/get

2016-01-14 Thread Kristoffer Sjögren
/hbase-spark/src/main/scala/org/apache/hadoop/hbase/spark/HBaseRDDFunctions.scala > > Cheers > > On Thu, Jan 14, 2016 at 5:04 AM, Kristoffer Sjögren <sto...@gmail.com> > wrote: >> >> Hi >> >> We have a RDD that needs to be mapped with information from

Use TCP client for id lookup

2016-01-12 Thread Kristoffer Sjögren
Hi I'm trying to understand how to lookup certain id fields of RDDs to an external mapping table. The table is accessed through a two-way binary tcp client where an id is provided and entry returned. Entries cannot be listed/scanned. What's the simplest way of managing the tcp client and its

Java 8 lambdas

2015-08-18 Thread Kristoffer Sjögren
Hi Is there a way to execute spark jobs with Java 8 lambdas instead of using anonymous inner classes as seen in the examples? I think I remember seeing real lambdas in the examples before and in articles [1]? Cheers, -Kristoffer [1]

Job aborted due to stage failure: Master removed our application: FAILED

2014-08-21 Thread Kristoffer Sjögren
Hi I have trouble executing a really simple Java job on spark 1.0.0-cdh5.1.0 that runs inside a docker container: SparkConf sparkConf = new SparkConf().setAppName(TestApplication).setMaster(spark://localhost:7077); JavaSparkContext ctx = new JavaSparkContext(sparkConf); JavaRDDString lines =

Re: Spark and Java 8

2014-05-07 Thread Kristoffer Sjögren
under CDH5. They could use mesos or the standalone scheduler to run them On Tue, May 6, 2014 at 6:16 AM, Kristoffer Sjögren sto...@gmail.comwrote: Hi I just read an article [1] about Spark, CDH5 and Java 8 but did not get exactly how Spark can run Java 8 on a YARN cluster at runtime. Is Spark

Spark and Java 8

2014-05-06 Thread Kristoffer Sjögren
Hi I just read an article [1] about Spark, CDH5 and Java 8 but did not get exactly how Spark can run Java 8 on a YARN cluster at runtime. Is Spark using a separate JVM that run on data nodes or is it reusing the YARN JVM runtime somehow, like hadoop1? CDH5 only supports Java 7 [2] as far as I