dropping unused data from a stream

2019-01-22 Thread Paul Tremblay
I will be streaming data and am trying to understand how to get rid of old data from a stream so it does not become to large. I will stream in one large table of buying data and join that to another table of different data. I need the last 14 days from the second table. I will not need data that

Re: Problem in persisting file in S3 using Spark: xxx file does not exist Exception

2018-05-02 Thread Paul Tremblay
I would like to see the full error. However, S3 can give misleading messages if you don't have the correct permissions. On Tue, Apr 24, 2018, 2:28 PM Marco Mistroni wrote: > HI all > i am using the following code for persisting data into S3 (aws keys are > already stored

splitting a huge file

2017-04-21 Thread Paul Tremblay
We are tasked with loading a big file (possibly 2TB) into a data warehouse. In order to do this efficiently, we need to split the file into smaller files. I don't believe there is a way to do this with Spark, because in order for Spark to distribute the file to the worker nodes, it first has to

small job runs out of memory using wholeTextFiles

2017-04-07 Thread Paul Tremblay
As part of my processing, I have the following code: rdd = sc.wholeTextFiles("s3://paulhtremblay/noaa_tmp/", 10) rdd.count() The s3 directory has about 8GB of data and 61,878 files. I am using Spark 2.1, and running it with 15 modes of m3.xlarge nodes on EMR. The job fails with this error:

Re: bug with PYTHONHASHSEED

2017-04-05 Thread Paul Tremblay
r 4, 2017 at 7:49 AM Eike von Seggern <eike.seggern@seven >> cal.com> wrote: >> >> 2017-04-01 21:54 GMT+02:00 Paul Tremblay <paulhtremb...@gmail.com>: >> >> When I try to to do a groupByKey() in my spark environment, I get the >> error described her

Re: bug with PYTHONHASHSEED

2017-04-04 Thread Paul Tremblay
So that means I have to pass that bash variable to the EMR clusters when I spin them up, not afterwards. I'll give that a go. Thanks! Henry On Tue, Apr 4, 2017 at 7:49 AM, Eike von Seggern <eike.segg...@sevenval.com> wrote: > 2017-04-01 21:54 GMT+02:00 Paul Tremblay <paulhtremb.

Re: Alternatives for dataframe collectAsList()

2017-04-03 Thread Paul Tremblay
What do you want to do with the results of the query? Henry On Wed, Mar 29, 2017 at 12:00 PM, szep.laszlo.it wrote: > Hi, > > after I created a dataset > > Dataset df = sqlContext.sql("query"); > > I need to have a result values and I call a method: collectAsList() >

Re: Read file and represent rows as Vectors

2017-04-03 Thread Paul Tremblay
So if I am understanding your problem, you have the data in CSV files, but the CSV files are gunzipped? If so Spark can read a gunzip file directly. Sorry if I didn't understand your question. Henry On Mon, Apr 3, 2017 at 5:05 AM, Old-School wrote: > I have a

Re: Looking at EMR Logs

2017-04-02 Thread Paul Tremblay
for spark logs) and run the history server like: > ``` > cd /usr/local/src/spark-1.6.1-bin-hadoop2.6 > sbin/start-history-server.sh > ``` > and then open http://localhost:18080 > > > > > On Thu, Mar 30, 2017 at 8:45 PM, Paul Tremblay <paulhtremb...@gmail.com> > wrot

bug with PYTHONHASHSEED

2017-04-01 Thread Paul Tremblay
When I try to to do a groupByKey() in my spark environment, I get the error described here: http://stackoverflow.com/questions/36798833/what-does- exception-randomness-of-hash-of-string-should-be-disabled-via-pythonh In order to attempt to fix the problem, I set up my ipython environment with

pyspark bug with PYTHONHASHSEED

2017-04-01 Thread Paul Tremblay
When I try to to do a groupByKey() in my spark environment, I get the error described here: http://stackoverflow.com/questions/36798833/what-does-exception-randomness-of-hash-of-string-should-be-disabled-via-pythonh In order to attempt to fix the problem, I set up my ipython environment with the

Looking at EMR Logs

2017-03-30 Thread Paul Tremblay
I am looking for tips on evaluating my Spark job after it has run. I know that right now I can look at the history of jobs through the web ui. I also know how to look at the current resources being used by a similar web ui. However, I would like to look at the logs after the job is finished to

Re: wholeTextFiles fails, but textFile succeeds for same path

2017-02-11 Thread Paul Tremblay
://michaelryanbell.com/processing-whole-files-spark-s3.html Jon On Mon, Feb 6, 2017 at 6:38 PM, Paul Tremblay <paulhtremb...@gmail.com <mailto:paulhtremb...@gmail.com>> wrote: I've actually been able to trace the problem to the files being read in. If I change to a differe

Re: Turning rows into columns

2017-02-11 Thread Paul Tremblay
On Feb 4, 2017 16:25, "Paul Tremblay" <paulhtremb...@gmail.com <mailto:paulhtremb...@gmail.com>> wrote: I am using pyspark 2.1 and am wondering how to convert a flat file, with one record per row, into a columnar format. Here is an example of the data: u'

Re: wholeTextFiles fails, but textFile succeeds for same path

2017-02-06 Thread Paul Tremblay
I've actually been able to trace the problem to the files being read in. If I change to a different directory, then I don't get the error. Is one of the executors running out of memory? On 02/06/2017 02:35 PM, Paul Tremblay wrote: When I try to create an rdd using wholeTextFiles, I get

wholeTextFiles fails, but textFile succeeds for same path

2017-02-06 Thread Paul Tremblay
When I try to create an rdd using wholeTextFiles, I get an incomprehensible error. But when I use the same path with sc.textFile, I get no error. I am using pyspark with spark 2.1. in_path = 's3://commoncrawl/crawl-data/CC-MAIN-2016-50/segments/1480698542939.6/warc/ rdd =

Turning rows into columns

2017-02-04 Thread Paul Tremblay
I am using pyspark 2.1 and am wondering how to convert a flat file, with one record per row, into a columnar format. Here is an example of the data: u'WARC/1.0', u'WARC-Type: warcinfo', u'WARC-Date: 2016-12-08T13:00:23Z', u'WARC-Record-ID: ', u'Content-Length: 344', u'Content-Type: