ser @spark" <user@spark.apache.org>
Subject: Re: how to set up pyspark eclipse, pyDev, virtualenv? syntaxError:
yield from walk(
> FYI, there is a PR and JIRA for virtualEnv support in PySpark
>
> https://issues.apache.org/jira/browse/SPARK-13587
> https://github.com/apache/spark/pull/1359
FYI
http://www.learn4master.com/algorithms/pyspark-unit-test-set-up-sparkcontext
From: Andrew Davidson
Date: Wednesday, April 4, 2018 at 5:36 PM
To: "user @spark"
Subject: how to set up pyspark eclipse, pyDev, virtualenv? syntaxError:
Hi Ceasar
I have used Brandson approach in the past with out any problem
Andy
From: Brandon Geise
Date: Thursday, April 5, 2018 at 11:23 AM
To: Cesar , "user @spark"
Subject: Re: Union of multiple data frames
> Maybe
I am having a heck of a time setting up my development environment. I used
pip to install pyspark. I also downloaded spark from apache.
My eclipse pyDev intereperter is configured as a python3 virtualenv
I have a simple unit test that loads a small dataframe. Df.show() generates
the following
I am having trouble setting up my python3 virtualenv.
I created a virtualenv spark-2.3.0¹ Installed pyspark using pip how ever I
am not able to import pyspark.sql.functions. I get ³unresolved import² when
I try to import col() and lit()
from pyspark.sql.functions import *
I found if I
t; ++-+
> | a| b|
> ++-+
> |john| red|
> |john| blue|
> |john| red|
> |bill| blue|
> |bill| red|
> | sam|green|
> ++-+
>
>
> distData.as("tbl1").join(distData.as("tbl2"), Seq("a"),
> "fullouter&quo
I was a little sloppy when I created the sample output. Its missing a few
pairs
Assume for a given row I have [a, b, c] I want to create something like the
cartesian join
From: Andrew Davidson
Date: Friday, March 30, 2018 at 5:54 PM
To: "user @spark"
I have a dataframe and execute df.groupBy(³xyzy²).agg( collect_list(³abc²)
This produces a column of type array. Now for each row I want to create a
multiple pairs/tuples from the array so that I can create a contingency
table. Any idea how I can transform my data so that call crosstab() ? The
I am working on a deep learning project. Currently we do everything on a
single machine. I am trying to figure out how we might be able to move to a
clustered spark environment.
Clearly its possible a machine or job on the cluster might fail so I assume
that the data needs to be replicated to
I am starting a new deep learning project currently we do all of our work on
a single machine using a combination of Keras and Tensor flow.
https://databricks.github.io/spark-deep-learning/site/index.html looks very
promising. Any idea how performance is likely to improve as I add machines
to my
Spark-ec2 used to be part of the spark distribution. It now seems to be
split into a separate repo https://github.com/amplab/spark-ec2
It does not seem to be listed on https://spark-packages.org/
Does anyone know what the status is? There is a readme.md how ever I am
unable to find any release
pyspark unable to create UDF: java.lang.RuntimeException:
org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a
directory: /tmp tmp
> Do you have a file called tmp at / on HDFS?
>
>
>
>
>
> On Thu, Aug 18, 2016 at 2:57 PM -0700, "Andy Davidson"
> <a...@santacru
For unknown reason I can not create UDF when I run the attached notebook on
my cluster. I get the following error
Py4JJavaError: An error occurred while calling
None.org.apache.spark.sql.hive.HiveContext.
: java.lang.RuntimeException:
org.apache.hadoop.fs.FileAlreadyExistsException: Parent path
Hi I am using python3, Java8 and spark-1.6.1. I am running my code in
Jupyter notebook
The following code runs fine on my mac
udfRetType = ArrayType(StringType(), True)
findEmojiUDF = udf(lambda s : re.findall(emojiPattern2, s), udfRetType)
retDF = (emojiSpecialDF
# convert into a
I am still working on spark-1.6.1. I am using java8.
Any idea why
(df.select("*", functions.lower("rawTag").alias("tag²))
Would raise py4j.Py4JException: Method lower([class java.lang.String])
does
not exist
Thanks in advance
Andy
yter' data frame
problem with auto completion and accessing documentation
> Hi Andy,
>
> On 1 August 2016 at 22:46, Andy Davidson <a...@santacruzintegration.com>
> wrote:
>> I wonder if the auto completion problem as todo with the way code is
>> typically written in
Hi Freedafeng
I have been reading and writing to s3 using spark-1.6.x with out any
problems. Can you post a little code example and any error messages?
Andy
From: freedafeng
Date: Tuesday, August 2, 2016 at 9:26 AM
To: "user @spark"
Subject:
I started using python3 and jupyter in a chrome browser. I seem to be having
trouble with data frame code completion. Regular python functions seems to
work correctly.
I wonder if I need to import something so the notebook knows about data
frames?
Kind regards
Andy
For lack of a better solution I am using AWS s3 copy¹ to copy my files
locally and hadoop fs put ./tmp/* to transfer them. In general put works
much better with a smaller number of big files compared to a large number of
small files
Your milage may vary
Andy
From: Andrew Davidson
gt; found this worked, but lacking in a few ways so I started this project:
>>> https://github.com/EntilZha/spark-s3
>>>
>>> This takes that idea further by:
>>> 1. Rather than sc.parallelize, implement the RDD interface where each
>>> partition is d
cket", "file1",
>> "folder2").regularRDDOperationsHere or import implicits and do
>> sc.s3.textFileByPrefix
>>
>> At present, I am battle testing and benchmarking it at my current job and
>> results are promising with significant improvements to
Hi Freedafeng
Can you tells a little more? I.E. Can you paste your code and error message?
Andy
From: freedafeng
Date: Thursday, July 28, 2016 at 9:21 AM
To: "user @spark"
Subject: Re: spark 1.6.0 read s3 files error.
> The question is, what
m.
>
> Since I hadn't intended to advertise this quite yet the documentation is not
> super polished but exists here:
> http://spark-s3.entilzha.io/latest/api/#io.entilzha.spark.s3.S3Context
>
> I am completing the sonatype process for publishing artifacts on maven central
> (t
I have a relatively small data set however it is split into many small JSON
files. Each file is between maybe 4K and 400K
This is probably a very common issue for anyone using spark streaming. My
streaming app works fine, how ever my batch application takes several hours
to run.
All I am doing
Hi Ascot
When you run in cluster mode it means your cluster manager will cause your
driver to execute on one of the works in your cluster.
The advantage of this is you can log on to a machine in your cluster and
submit your application and then log out. The application will continue to
run.
I have a spark streaming app that saves JSON files to s3:// . It works fine
Now I need to calculate some basic summary stats and am running into
horrible performance problems.
I want to run a test to see if reading from hdfs instead of s3 makes
difference. I am able to quickly copy my the data
I currently have to configure spark-1.x to use Java 8 and python 3.x. I
noticed that
http://spark.apache.org/releases/spark-release-2-0-0.html#removals mentions
java 7 is deprecated.
Is the default now Java 8 ?
Thanks
Andy
Deprecations
The following features have been deprecated in Spark
Hi Freedafeng
The following works for me df will be a data frame. fullPath is lists list
of various part files stored in s3.
fullPath =
['s3n:///json/StreamingKafkaCollector/s1/2016-07-10/146817304/part-r
-0-a2121800-fa5b-44b1-a994-67795' ]
from pyspark.sql import SQLContext
Congratulations on releasing 2.0!
spark-2.0.0-bin-hadoop2.7 no longer includes the spark-ec2 script How ever
http://spark.apache.org/docs/latest/index.html has a link to the spark-ec2
github repo https://github.com/amplab/spark-ec2
Is this the right group to discuss spark-ec2?
Any idea how
Bellow is a very simple application. It runs very slowly. It does not look
like I am getting a lot of parallel execution. I image this is a very common
work flow. Periodically I want to runs some standard summary statistics
across several different data sets.
Any suggestions would be greatly
Yup in cluster mode you need to figure out what machine the driver is
running on. That is the machine the UI will run on
https://issues.apache.org/jira/browse/SPARK-15829
You may also have fire wall issues
Here are some notes I made about how to figure out what machine the driver
is running on
Hi Kevin
Just a heads up at the recent spark summit in S.F. There was a presentation
about streaming in 2.0. They said that streaming was not going to production
ready in 2.0.
I am not sure if the older 1.6.x version will be supported. My project will
not be able to upgrade with streaming
ead "dispatcher-event-loop-1"
java.lang.OutOfMemoryError: Java heap space
> How much heap memory do you give the driver ?
>
> On Fri, Jul 22, 2016 at 2:17 PM, Andy Davidson <a...@santacruzintegration.com>
> wrote:
>> Given I get a stack trace in my python notebook I am
Given I get a stack trace in my python notebook I am guessing the driver is
running out of memory?
My app is simple it creates a list of dataFrames from s3://, and counts each
one. I would not think this would take a lot of driver memory
I am not running my code locally. Its using 12 cores. Each
in our cluster (1.5.0) i can't get this.
>
> do you have any solution.
>
> Thanks
>
> 2016-07-21 18:44 GMT+02:00 Andy Davidson <a...@santacruzintegration.com>:
>> Hi Pseudo
>>
>> Plotting, graphing, data visualization, report generation are common ne
local:6066 (cluster mode)
>
> Thanks
> Saisai
>
> On Fri, Jul 22, 2016 at 6:44 AM, Andy Davidson <a...@santacruzintegration.com>
> wrote:
>> I have some very long lived streaming apps. They have been running for
>> several months. I wonder if something has change
I have some very long lived streaming apps. They have been running for
several months. I wonder if something has changed recently? I first started
working with spark-1.3 . I am using the stand alone cluster manager. The way
I would submit my app to run in cluster mode was port 6066
Looking at
Hi Pseudo
Plotting, graphing, data visualization, report generation are common needs
in scientific and enterprise computing.
Can you tell me more about your use case? What is it about the current
process / workflow do you think could be improved by pushing plotting (I
assume you mean plotting
Hi Divya
In general you will get better performance if you can minimize your use of
UDFs. Spark 2.0/ tungsten does a lot of code generation. It will have to
treat your UDF as a block box.
Andy
From: Rishabh Bhardwaj
Date: Wednesday, July 20, 2016 at 4:22 AM
To: Rabin
Hi Everett
I always do my initial data exploration and all our product development in
my local dev env. I typically select a small data set and copy it to my
local machine
My main() has an optional command line argument - - runLocal¹ Normally I
load data from either hdfs:/// or S3n:// . If the
Hi Hassan
Typically I log on to my master to submit my app.
[ec2-user@ip-172-31-11-222 bin]$ echo $SPARK_ROOT
/root/spark
[ec2-user@ip-172-31-11-222 bin]$echo $MASTER_URL
spark://ec2-54-215-11-222.us-west-1.compute.amazonaws.com:7077
[ec2-user@ip-172-31-11-222 bin]$
I created a cluster using spark-1.6.1-bin-hadoop2.6/ec2/spark-ec2 script.
The shows ganglia started how ever I am not able to access
http://ec2-54-215-230-73.us-west-1.compute.amazonaws.com:5080/ganglia. I
have tried using the private ip from with in my data center.
I d not see anything listing
Hi Pradeep
I can not comment about experimental or production, how ever I recently
started a POC using direct approach.
Its been running off and on for about 2 weeks. In general it seems to work
really well. One thing that is not clear to me is how the cursor is manage.
E.G. I have my topic set
I am running spark-1.6.1 and the stand alone cluster manager. I am running
into performance problems with spark streaming and added some extra metrics
to my log files. I submit my app in cluster mode. (I.e. The driver runs on a
slave not master)
I am not able to get the driver log files while
I am running into serious performance problems with my spark 1.6 streaming
app. As it runs it gets slower and slower.
My app is simple.
* It receives fairly large and complex JSON files. (twitter data)
* Converts the RDD to DataFrame
* Splits the data frame in to maybe 20 different data sets
*
My stream app is running into problems It seems to slow down over time. How
can I interpret the storage memory column. I wonder if I have a GC problem?
Any idea how I can get GC stats?
Thanks
Andy
Executors (3)
* Memory: 9.4 GB Used (1533.4 MB Total)
* Disk: 0.0 B Used
Executor IDAddressRDD
> -----Original Message-
> From: Cody Koeninger [mailto:c...@koeninger.org]
> Sent: 08 July 2016 15:31
> To: Andy Davidson <a...@santacruzintegration.com>
> Cc: user @spark <user@spark.apache.org>
> Subject: Re: is dataframe.write() async? Streaming performance prob
Kafka has an interesting model that might be applicable.
You can think of kafka as enabling a queue system. Writes are called
producers, and readers are called consumers. The server is called a broker.
A ³topic² is like a named queue.
Producer are independent. They can write to a ³topic² at
I am running Spark 1.6.1 built for Hadoop 2.0.0-mr1-cdh4.2.0 and using kafka
direct stream approach. I am running into performance problems. My
processing time is > than my window size. Changing window sizes, adding
cores and executor memory does not change performance. I am having a lot of
:38 PM
To: Andrew Davidson <a...@santacruzintegration.com>
Cc: "user @spark" <user@spark.apache.org>
Subject: Re: strange behavior when I chain data frame transformations
> In the structure shown, tag is under element.
>
> I wonder if that was a factor.
>
> On F
I am using spark-1.6.1.
I create a data frame from a very complicated JSON file. I would assume that
query planer would treat both version of my transformation chains the same
way.
// org.apache.spark.sql.AnalysisException: Cannot resolve column name "tag"
among (actor, body, generator, pip,
I have a streaming app that receives very complicated JSON (twitter status).
I would like to work with it as a hash map. It would be very difficult to
define a pojo for this JSON. (I can not use twitter4j)
// map json string to map
JavaRDD> jsonMapRDD =
Hi Tobias
I am very interested implemented rest based api on top of spark. My rest
based system would make predictions from data provided in the request using
models trained in batch. My SLA is 250 ms.
Would you mind sharing how you implemented your rest server?
I am using spark-1.6.1. I have
I started using
http://spark.apache.org/docs/latest/mllib-frequent-pattern-mining.html#fp-gr
owth in python. It was really easy to get the frequent items set.
Unfortunately associations is not implemented in python.
Here is my python code It works great
rawJsonRDD = jsonToPythonDictionaries(sc,
Someone recently asked me for a code example of how to to write a custom
pipeline transformer in Java
Enjoy, Share
Andy
https://github.com/AEDWIP/Spark-Naive-Bayes-text-classification/blob/260a6b9
b67d7da42c1d0f767417627da342c8a49/src/test/java/com/santacruzintegration/spa
Hi Ashok
In general if I was starting a new project and had not invested heavily in
hadoop (i.e. Had a large staff that was trained on hadoop, had a lot of
existing projects implemented on hadoop, ) I would probably start using
spark. Its faster and easier to use
Your mileage may vary
Andy
+1
From: Matei Zaharia
Date: Tuesday, April 5, 2016 at 4:58 PM
To: Xiangrui Meng
Cc: Shivaram Venkataraman , Sean Owen
, Xiangrui Meng , dev
, "user @spark"
t;
>
>
>
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>
>
>
> On 5 April 2016 at 23:59, Andy Davidson <a...@santacruzintegration.com> wrote:
>> In my experience my streaming I was getting tens of thousands of emp
t;
>
>
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>
>
>
> On 5 April 2016 at 23:35, Andy Davidson <a...@santacruzintegration.com> wrote:
>> Hi Mich
>>
>> Yup I was surprised to find empty files. Its easy to wor
Hi Mich
Yup I was surprised to find empty files. Its easy to work around. Note I
should probably use coalesce() and not repartition()
In general I found I almost always need to reparation. I was getting
thousands of empty partitions. It was really slowing my system down.
private static void
Hi Marco
You might consider setting up some sort of ELT pipe line. One of your stages
might be to create a file of all the FTP URL. You could then write a spark
app that just fetches the urls and stores the data in some sort of data base
or on the file system (hdfs?)
My guess would be to maybe
:6
17)
... 1 more
From: Jeff Zhang <zjf...@gmail.com>
Date: Tuesday, March 29, 2016 at 10:34 PM
To: Andrew Davidson <a...@santacruzintegration.com>
Cc: "user @spark" <user@spark.apache.org>
Subject: Re: pyspark unable to convert dataframe column to a vector: Unabl
I have a requirement to write my results out into a series of CSV files. No
file may have more than 100 rows of data. In the past my data was not
sorted, and I was able to use reparation() or coalesce() to ensure the file
length requirement.
I realize that reparation() cause the data to be
I am using pyspark 1.6.1 and python3
Any idea what my bug is? Clearly the indices are being sorted?
Could it be the numDimensions = 713912692155621377 and my indices are longs
not ints?
import numpy as np
from pyspark.mllib.linalg import Vectors
from pyspark.mllib.linalg import VectorUDT
#sv1
>> 713912692155621376
>>>
>>> From: Alexander Krasnukhin <the.malk...@gmail.com>
>>> Date: Monday, March 28, 2016 at 5:55 PM
>>> To: Andrew Davidson <a...@santacruzintegration.com>
>>> Cc: "user @spark" <user@spa
ct(max(col("foo"))).show()
>
> On Tue, Mar 29, 2016 at 2:15 AM, Andy Davidson <a...@santacruzintegration.com>
> wrote:
>> I am using pyspark 1.6.1 and python3.
>>
>>
>> Given:
>>
>> idDF2 = idDF.select(idDF.id, idDF.col.id <http
Hi Fanoos
I would be careful about using collect(). You need to make sure you local
computer has enough memory to hold your entire data set.
Eventually I will need to do something similar. I have to written the code
yet. My plan is to load the data into a data frame and then write a UDF that
I am using pyspark spark-1.6.1-bin-hadoop2.6 and python3. I have a data
frame with a column I need to convert to a sparse vector. I get an exception
Any idea what my bug is?
Kind regards
Andy
Py4JJavaError: An error occurred while calling
None.org.apache.spark.sql.hive.HiveContext.
:
I am using pyspark 1.6.1 and python3.
Given:
idDF2 = idDF.select(idDF.id, idDF.col.id )
idDF2.printSchema()
idDF2.show()
root
|-- id: string (nullable = true)
|-- col[id]: long (nullable = true)
+--+--+
|id| col[id]|
+--+--+
|1008930924| 534494917|
Hi Russell
I use Jupyter python notebooks a lot. Here is how I start the server
set -x # turn debugging on
#set +x # turn debugging off
# https://github.com/databricks/spark-csv
# http://spark-packages.org/package/datastax/spark-cassandra-connector
<user@spark.apache.org>
Subject: Re: pyspark sql convert long to timestamp?
> Have a look at the from_unixtime() functions.
> https://spark.apache.org/docs/1.5.0/api/python/_modules/pyspark/sql/functions.
> html#from_unixtime
>
> Thanks
> Best Regards
>
Any idea how I have a col in a data frame that is of type long any idea how
I create a column who¹s type is time stamp?
The long is unix epoch in ms
Thanks
Andy
Here is a nice analysis of the issue from the Cassandra mail list. (Datastax
is the Databricks for Cassandra)
Should I fill a bug?
Kind regards
Andy
http://stackoverflow.com/questions/2305973/java-util-date-vs-java-sql-date
and this one
On Fri, Mar 18, 2016 at 11:35 AM Russell Spitzer
Hi Davies
>
> What's the type of `created`? TimestampType?
The created¹ column in cassandra is a timestamp
https://docs.datastax.com/en/cql/3.0/cql/cql_reference/timestamp_type_r.html
In the spark data frame it is a a timestamp
>
> If yes, when created is compared to a string, it will be
For completeness. Clearly spark sql returned a different data set
In [4]:
rawDF.selectExpr("count(row_key) as num_samples",
"sum(count) as total",
"max(count) as max ").show()
+---++-+
|num_samples|total|max|
I am using pyspark 1.6.0 and
datastax:spark-cassandra-connector:1.6.0-M1-s_2.10 to analyze time series
data
The data is originally captured by a spark streaming app and written to
Cassandra. The value of the timestamp comes from
Rdd.foreachRDD(new VoidFunction2()
We are considering deploying a notebook server for use by two kinds of users
1. interactive dashboard.
> 1. I.e. Forms allow users to select data sets and visualizations
> 2. Review real time graphs of data captured by our spark streams
2. General notebooks for Data Scientists
My concern is
I am using python spark 1.6 and the --packages
datastax:spark-cassandra-connector:1.6.0-M1-s_2.10
I need to convert a time stamp string into a unix epoch time stamp. The
function unix_timestamp() function assume current time zone. How ever my
string data is UTC and encodes the time zone as zero.
Thanks
Andy
ls files.
>
> Frank Austin Nothaft
> fnoth...@berkeley.edu
> fnoth...@eecs.berkeley.edu
> 202-340-0466
>
>> On Mar 15, 2016, at 11:45 AM, Andy Davidson <a...@santacruzintegration.com>
>> wrote:
>>
>> We use the spark-ec2 script to create AWS clusters as needed (we
We use the spark-ec2 script to create AWS clusters as needed (we do not use
AWS EMR)
1. will we get better performance if we copy data to HDFS before we run
instead of reading directly from S3?
2. What is a good way to move results from HDFS to S3?
It seems like there are many ways to bulk copy
; Date: Wednesday, March 9, 2016 at 3:28 PM
> To: Andrew Davidson <a...@santacruzintegration.com>
> Cc: "user @spark" <user@spark.apache.org>
> Subject: Re: trouble with NUMPY constructor in UDF
>
>> bq. epoch2numUDF = udf(foo, FloatType())
>>
>
"user @spark" <user@spark.apache.org>
Subject: Re: trouble with NUMPY constructor in UDF
> bq. epoch2numUDF = udf(foo, FloatType())
>
> Is it possible that return value from foo is not FloatType ?
>
> On Wed, Mar 9, 2016 at 3:09 PM, Andy Davidson <a...@santacruz
In my experience I would try the following
I use the standalone cluster manager. Each app gets it own performance web
page . The streaming tab is really helpful. If processing time is greater
than then your mini batch length you are going to run into performance
problems
Use the ³stages² tab to
t settings separately
> spark.cassandra.connection.host
> spark.cassandra.connection.port
> https://github.com/datastax/spark-cassandra-connector/blob/master/doc/referenc
> e.md#cassandra-connection-parameters
>
> Looking at the logs, it seems your port config is not being set and it'
onnector java.io.IOException: Failed
to open native connection to Cassandra at {192.168.1.126}:9042
> Have you contacted spark-cassandra-connector related mailing list ?
>
> I wonder where the port 9042 came from.
>
> Cheers
>
> On Tue, Mar 8, 2016 at 6:02 PM, Andy Davidson
I am using spark-1.6.0-bin-hadoop2.6. I am trying to write a python notebook
that reads a data frame from Cassandra.
I connect to cassadra using an ssh tunnel running on port 9043. CQLSH works
how ever I can not figure out how to configure my notebook. I have tried
various hacks any idea what I
http://spark.apache.org/docs/latest/streaming-programming-guide.html#deployi
ng-applications
Gives a brief discussion about max rate and back pressure
Its not clear to me what will happen. I use an unreliable reciever. Let say
me app is running and process time is less then window length. Happy
One of the challenges we need to prepare for with streaming apps is bursty
data. Typically we need to estimate our worst case data load and make sure
we have enough capacity
It not obvious what best practices are with spark streaming.
* we have implemented check pointing as described in the
We just deployed our first streaming apps. The next step is running them so
they run reliably
We have spend a lot of time reading the various prog guides looking at the
standalone cluster manager app performance web pages.
Looking at the streaming tab and the stages tab have been the most
would access my
> bucket this way s3n://bucketname/foldername.
>
> You can test privileges using the s3 cmd line client.
>
> Also, if you are using instance profiles you don't need to specify access and
> secret keys. No harm in specifying though.
>
&
Currently our stream apps write results to hdfs. We are running into
problems with HDFS becoming corrupted and running out of space. It seems
like a better solution might be to write directly to S3. Is this a good
idea?
We plan to continue to write our checkpoints to hdfs
Are there any issues to
I do not have any hadoop legacy code. My goal is to run spark on top of
HDFS.
Recently I have been have hdfs corruption problem. I was also never able to
access S3 even though I used --copy-aws-credentials. I noticed that by
default the spark-ec2 script uses hadoop 1.0.4. I ran help and
Hi Michael
From: Michael Armbrust
Date: Saturday, February 13, 2016 at 9:31 PM
To: Koert Kuipers
Cc: "user @spark"
Subject: Re: GroupedDataset needs a mapValues
> Instead of grouping with a lambda function, you can do it
gt; or applications errors, but I think it is general dev ops work not
> very specific to spark or hadoop.
>
> BR,
>
> Arkadiusz Bicz
> https://www.linkedin.com/in/arkadiuszbicz
>
> On Thu, Feb 11, 2016 at 7:09 PM, Andy Davidson
> <a...@santacruzintegration.com> wrot
uot;s3n://s3-us-west-1.amazonaws.com/com.pws.twitter/
> <http://s3-us-west-1.amazonaws.com/com.pws.twitter/> json²
>
> not sure, but
> can you try to remove s3-us-west-1.amazonaws.com
> <http://s3-us-west-1.amazonaws.com/com.pws.twitter/> from path ?
>
> On 11 February
From: Mich Talebzadeh
Date: Thursday, February 11, 2016 at 2:30 PM
To: "user @spark"
Subject: Question on Spark architecture and DAG
> Hi,
>
> I have used Hive on Spark engine and of course Hive tables and its pretty
> impressive comparing Hive
I am trying to add a column with a constant value to my data frame. Any idea
what I am doing wrong?
Kind regards
Andy
DataFrame result =
String exprStr = "lit(" + time.milliseconds()+ ") as ms";
logger.warn("AEDWIP expr: {}", exprStr);
result.selectExpr("*", exprStr).show(false);
We recently started a Spark/Spark Streaming POC. We wrote a simple streaming
app in java to collect tweets. We choose twitter because we new we get a lot
of data and probably lots of burst. Good for stress testing
We spun up a couple of small clusters using the spark-ec2 script. In one
cluster
I am using spark 1.6.0 in a cluster created using the spark-ec2 script. I am
using the standalone cluster manager
My java streaming app is not able to write to s3. It appears to be some for
of permission problem.
Any idea what the problem might be?
I tried use the IAM simulator to test the
1 - 100 of 214 matches
Mail list logo