Hi!
How can I tell IO duration for a Spark application doing R/W from S3 (using
S3 as a filesystem sc.textFile("s3a://...")?
I would like to know the % of time doing IO of the overall app execution
time.
Gili.
After all, I switched back to LSH implementation that I used before (
https://github.com/karlhigley/spark-neighbors ). I can run on my dataset
now. If someone has any suggestion, please tell me.
Thanks.
2017-02-12 9:25 GMT+07:00 nguyen duc Tuan :
> Hi Timur,
> 1) Our data is transformed to datase
Hi,
In my spark streaming application I'm trying to partition a data stream into
multiple substreams. I read data from a Kafka producer and process the data
received real-time. The data is taken in through JavaInputDStream as a
directStream. Data is received without any loss. The need is to partit
I mean if you are running a script instead of exiting with a code it could
print out something.
Sounds like checkCode is what you want though.
_
From: Xuchen Yao mailto:yaoxuc...@gmail.com>>
Sent: Sunday, February 12, 2017 8:33 AM
Subject: Re: Getting exit code of pi
Hello,
I suspect that your need isn't parallel execution but parallel data access.
In that case, use Alluxio or Ignite.
Or, more exotic, one Spark job writes to Kafka and the other ones read from
Kafka.
Sincerely yours, Timur
On Sun, Feb 12, 2017 at 2:30 PM, Mendelson, Assaf
wrote:
> There is
Yup I ended up doing just that thank you both
On Sun, 12 Feb 2017 at 18:33, Miguel Morales
wrote:
> You can parallelize the collection of s3 keys and then pass that to your
> map function so that files are read in parallel.
>
> Sent from my iPhone
>
> On Feb 12, 2017, at 9:41 AM, Sam Elamin wrot
You can parallelize the collection of s3 keys and then pass that to your map
function so that files are read in parallel.
Sent from my iPhone
> On Feb 12, 2017, at 9:41 AM, Sam Elamin wrote:
>
> thanks Ayan but i was hoping to remove the dependency on a file and just use
> in memory list or d
thanks Ayan but i was hoping to remove the dependency on a file and just
use in memory list or dictionary
So from the reading I've done today it seems.the concept of a bespoke async
method doesn't really apply in spsrk since the cluster deals with
distributing the work load
Am I mistaken?
Regar
Hi,
I have multiple hive configurations(hive-site.xml) and because of that only
I am not able to add any hive configuration in spark *conf* directory. I
want to add this configuration file at start of any *spark-submit* or
*spark-shell*. This conf file is huge so *--conf* is not a option for me.
Cool that's exactly what I was looking for! Thanks!
How does one output the status into stdout? I mean, how does one capture
the status output of pipe() command?
On Sat, Feb 11, 2017 at 9:50 AM, Felix Cheung
wrote:
> Do you want the job to fail if there is an error exit code?
>
> You could set
There is no threads within maps here. The idea is to have two jobs on two
different threads which use the same dataframe (which is cached btw).
This does not override spark’s parallel execution of transformation or any
such. The documentation (job scheduling) actually hints at this option but
do
Cf. also https://spark.apache.org/docs/latest/job-scheduling.html
> On 12 Feb 2017, at 11:30, Jörn Franke wrote:
>
> I think you should have a look at the spark documentation. It has something
> called scheduler who does exactly this. In more sophisticated environments
> yarn or mesos do this
I did not doubt that the submission of several jobs of one application makes
sense. However, he want to create threads within maps etc., which looks like
calling for issues (not only for running the application itself, but also for
operating it in production within a shared cluster). I would rel
DataFrame is immutable, so it should be thread safe, right?
On Sun, Feb 12, 2017 at 6:45 PM, Sean Owen wrote:
> No this use case is perfectly sensible. Yes it is thread safe.
>
>
> On Sun, Feb 12, 2017, 10:30 Jörn Franke wrote:
>
>> I think you should have a look at the spark documentation. It
You can store the list of keys (I believe you use them in source file path,
right?) in a file, one key per line. Then you can read the file using
sc.textFile (So you will get a RDD of file paths) and then apply your
function as a map.
r = sc.textFile(list_file).map(your_function)
HTH
On Sun, Feb
Hey folks
Really simple question here. I currently have an etl pipeline that reads
from s3 and saves the data to an endstore
I have to read from a list of keys in s3 but I am doing a raw extract then
saving. Only some of the extracts have a simple transformation but overall
the code looks the sa
No this use case is perfectly sensible. Yes it is thread safe.
On Sun, Feb 12, 2017, 10:30 Jörn Franke wrote:
> I think you should have a look at the spark documentation. It has
> something called scheduler who does exactly this. In more sophisticated
> environments yarn or mesos do this for you
I think you should have a look at the spark documentation. It has something
called scheduler who does exactly this. In more sophisticated environments yarn
or mesos do this for you.
Using threads for transformations does not make sense.
> On 12 Feb 2017, at 09:50, Mendelson, Assaf wrote:
>
>
How about adding more NFS storage?
On Sun, 12 Feb 2017 at 8:14 pm, Sean Owen wrote:
> Data has to live somewhere -- how do you not add storage but store more
> data? Alluxio is not persistent storage, and S3 isn't on your premises.
>
> On Sun, Feb 12, 2017 at 4:29 AM Benjamin Kim wrote:
>
> Ha
Data has to live somewhere -- how do you not add storage but store more
data? Alluxio is not persistent storage, and S3 isn't on your premises.
On Sun, Feb 12, 2017 at 4:29 AM Benjamin Kim wrote:
> Has anyone got some advice on how to remove the reliance on HDFS for
> storing persistent data. W
You're have to carefully choose if your strategy makes sense given your users
workloads. Hence, I am not sure your reasoning makes sense.
However, You can , for example, install openstack swift as an object store and
use this as storage. HDFS in this case can be used as a temporary store and/or
I know spark takes care of executing everything in a distributed manner,
however, spark also supports having multiple threads on the same spark
session/context and knows (Through fair scheduler) to distribute the tasks from
them in a round robin.
The question is, can those two actions (with a d
I am not sure what you are trying to achieve here. Spark is taking care of
executing the transformations in a distributed fashion. This means you must not
use threads - it does not make sense. Hence, you do not find documentation
about it.
> On 12 Feb 2017, at 09:06, Mendelson, Assaf wrote:
>
Hi,
I was wondering if dataframe is considered thread safe. I know the spark
session and spark context are thread safe (and actually have tools to manage
jobs from different threads) but the question is, can I use the same dataframe
in both threads.
The idea would be to create a dataframe in the
25 matches
Mail list logo