How to measure IO time in Spark over S3

2017-02-12 Thread Gili Nachum
Hi! How can I tell IO duration for a Spark application doing R/W from S3 (using S3 as a filesystem sc.textFile("s3a://...")? I would like to know the % of time doing IO of the overall app execution time. Gili.

Re: Practical configuration to run LSH in Spark 2.1.0

2017-02-12 Thread nguyen duc Tuan
After all, I switched back to LSH implementation that I used before ( https://github.com/karlhigley/spark-neighbors ). I can run on my dataset now. If someone has any suggestion, please tell me. Thanks. 2017-02-12 9:25 GMT+07:00 nguyen duc Tuan : > Hi Timur, > 1) Our data

Repartition function duplicates data

2017-02-12 Thread F. Amara
Hi, In my spark streaming application I'm trying to partition a data stream into multiple substreams. I read data from a Kafka producer and process the data received real-time. The data is taken in through JavaInputDStream as a directStream. Data is received without any loss. The need is to

Re: Getting exit code of pipe()

2017-02-12 Thread Felix Cheung
I mean if you are running a script instead of exiting with a code it could print out something. Sounds like checkCode is what you want though. _ From: Xuchen Yao > Sent: Sunday, February 12, 2017 8:33 AM Subject: Re:

Re: is dataframe thread safe?

2017-02-12 Thread Timur Shenkao
Hello, I suspect that your need isn't parallel execution but parallel data access. In that case, use Alluxio or Ignite. Or, more exotic, one Spark job writes to Kafka and the other ones read from Kafka. Sincerely yours, Timur On Sun, Feb 12, 2017 at 2:30 PM, Mendelson, Assaf

Re: Etl with spark

2017-02-12 Thread Sam Elamin
Yup I ended up doing just that thank you both On Sun, 12 Feb 2017 at 18:33, Miguel Morales wrote: > You can parallelize the collection of s3 keys and then pass that to your > map function so that files are read in parallel. > > Sent from my iPhone > > On Feb 12, 2017, at

Re: Etl with spark

2017-02-12 Thread Miguel Morales
You can parallelize the collection of s3 keys and then pass that to your map function so that files are read in parallel. Sent from my iPhone > On Feb 12, 2017, at 9:41 AM, Sam Elamin wrote: > > thanks Ayan but i was hoping to remove the dependency on a file and just

Re: Etl with spark

2017-02-12 Thread Sam Elamin
thanks Ayan but i was hoping to remove the dependency on a file and just use in memory list or dictionary So from the reading I've done today it seems.the concept of a bespoke async method doesn't really apply in spsrk since the cluster deals with distributing the work load Am I mistaken?

Add hive-site.xml at runtime

2017-02-12 Thread Shivam Sharma
Hi, I have multiple hive configurations(hive-site.xml) and because of that only I am not able to add any hive configuration in spark *conf* directory. I want to add this configuration file at start of any *spark-submit* or *spark-shell*. This conf file is huge so *--conf* is not a option for me.

Unsubscribe

2017-02-12 Thread Vitásek , Ladislav

Re: Getting exit code of pipe()

2017-02-12 Thread Xuchen Yao
Cool that's exactly what I was looking for! Thanks! How does one output the status into stdout? I mean, how does one capture the status output of pipe() command? On Sat, Feb 11, 2017 at 9:50 AM, Felix Cheung wrote: > Do you want the job to fail if there is an error

RE: is dataframe thread safe?

2017-02-12 Thread Mendelson, Assaf
There is no threads within maps here. The idea is to have two jobs on two different threads which use the same dataframe (which is cached btw). This does not override spark’s parallel execution of transformation or any such. The documentation (job scheduling) actually hints at this option but

Re: is dataframe thread safe?

2017-02-12 Thread Jörn Franke
Cf. also https://spark.apache.org/docs/latest/job-scheduling.html > On 12 Feb 2017, at 11:30, Jörn Franke wrote: > > I think you should have a look at the spark documentation. It has something > called scheduler who does exactly this. In more sophisticated environments >

Re: is dataframe thread safe?

2017-02-12 Thread Jörn Franke
I did not doubt that the submission of several jobs of one application makes sense. However, he want to create threads within maps etc., which looks like calling for issues (not only for running the application itself, but also for operating it in production within a shared cluster). I would

Re: is dataframe thread safe?

2017-02-12 Thread Yan Facai
DataFrame is immutable, so it should be thread safe, right? On Sun, Feb 12, 2017 at 6:45 PM, Sean Owen wrote: > No this use case is perfectly sensible. Yes it is thread safe. > > > On Sun, Feb 12, 2017, 10:30 Jörn Franke wrote: > >> I think you should

Re: Etl with spark

2017-02-12 Thread ayan guha
You can store the list of keys (I believe you use them in source file path, right?) in a file, one key per line. Then you can read the file using sc.textFile (So you will get a RDD of file paths) and then apply your function as a map. r = sc.textFile(list_file).map(your_function) HTH On Sun,

Etl with spark

2017-02-12 Thread Sam Elamin
Hey folks Really simple question here. I currently have an etl pipeline that reads from s3 and saves the data to an endstore I have to read from a list of keys in s3 but I am doing a raw extract then saving. Only some of the extracts have a simple transformation but overall the code looks the

Re: is dataframe thread safe?

2017-02-12 Thread Sean Owen
No this use case is perfectly sensible. Yes it is thread safe. On Sun, Feb 12, 2017, 10:30 Jörn Franke wrote: > I think you should have a look at the spark documentation. It has > something called scheduler who does exactly this. In more sophisticated > environments yarn

Re: is dataframe thread safe?

2017-02-12 Thread Jörn Franke
I think you should have a look at the spark documentation. It has something called scheduler who does exactly this. In more sophisticated environments yarn or mesos do this for you. Using threads for transformations does not make sense. > On 12 Feb 2017, at 09:50, Mendelson, Assaf

Re: Remove dependence on HDFS

2017-02-12 Thread ayan guha
How about adding more NFS storage? On Sun, 12 Feb 2017 at 8:14 pm, Sean Owen wrote: > Data has to live somewhere -- how do you not add storage but store more > data? Alluxio is not persistent storage, and S3 isn't on your premises. > > On Sun, Feb 12, 2017 at 4:29 AM

Re: Remove dependence on HDFS

2017-02-12 Thread Sean Owen
Data has to live somewhere -- how do you not add storage but store more data? Alluxio is not persistent storage, and S3 isn't on your premises. On Sun, Feb 12, 2017 at 4:29 AM Benjamin Kim wrote: > Has anyone got some advice on how to remove the reliance on HDFS for >

Re: Remove dependence on HDFS

2017-02-12 Thread Jörn Franke
You're have to carefully choose if your strategy makes sense given your users workloads. Hence, I am not sure your reasoning makes sense. However, You can , for example, install openstack swift as an object store and use this as storage. HDFS in this case can be used as a temporary store

RE: is dataframe thread safe?

2017-02-12 Thread Mendelson, Assaf
I know spark takes care of executing everything in a distributed manner, however, spark also supports having multiple threads on the same spark session/context and knows (Through fair scheduler) to distribute the tasks from them in a round robin. The question is, can those two actions (with a

Re: is dataframe thread safe?

2017-02-12 Thread Jörn Franke
I am not sure what you are trying to achieve here. Spark is taking care of executing the transformations in a distributed fashion. This means you must not use threads - it does not make sense. Hence, you do not find documentation about it. > On 12 Feb 2017, at 09:06, Mendelson, Assaf

is dataframe thread safe?

2017-02-12 Thread Mendelson, Assaf
Hi, I was wondering if dataframe is considered thread safe. I know the spark session and spark context are thread safe (and actually have tools to manage jobs from different threads) but the question is, can I use the same dataframe in both threads. The idea would be to create a dataframe in