Re: Spark UI doesn't give visibility on which stage job actually failed (due to lazy eval nature)

2016-05-25 Thread Nirav Patel
I think it does because user doesn't exactly see their application logic and flow as spark internal does. Off course we follow general guidelines for performance but we shouldn't care really how exactly spark decide to execute DAG. Spark scheduler or core can keep changing over time to optimize

Re: unsure how to create 2 outputs from spark-sql udf expression

2016-05-25 Thread Takeshi Yamamuro
Hi, How about this? -- val func = udf((i: Int) => Tuple2(i, i)) val df = Seq((1, 0), (2, 5)).toDF("a", "b") df.select(func($"a").as("r")).select($"r._1", $"r._2") // maropu On Thu, May 26, 2016 at 5:11 AM, Koert Kuipers wrote: > hello all, > > i have a single udf that

Re: Release Announcement: XGBoost4J - Portable Distributed XGBoost in Spark, Flink and Dataflow

2016-05-25 Thread Selvam Raman
XGBoost4J could integrate with spark from 1.6 version. Currently I am using spark 1.5.2. Can I use XGBoost instead of XGBoost4j. Will both provides same result. Thanks, Selvam R +91-97877-87724 On Mar 15, 2016 9:23 PM, "Nan Zhu" wrote: > Dear Spark Users and

Re: Spark UI doesn't give visibility on which stage job actually failed (due to lazy eval nature)

2016-05-25 Thread Mark Hamstra
But when you talk about optimizing the DAG, it really doesn't make sense to also talk about transformation steps as separate entities. The DAGScheduler knows about Jobs, Stages, TaskSets and Tasks. The TaskScheduler knows about TaskSets ad Tasks. Neither of them understands the transformation

Kafka connection logs in Spark

2016-05-25 Thread Mail.com
Hi All, I am connecting Spark 1.6 streaming to Kafka 0.8.2 with Kerberos. I ran spark streaming in debug mode, but do not see any log saying it connected to Kafka or topic etc. How could I enable that. My spark streaming job runs but no messages are fetched from the RDD. Please suggest.

Re: Spark UI doesn't give visibility on which stage job actually failed (due to lazy eval nature)

2016-05-25 Thread Nirav Patel
Hi Mark, I might have said stage instead of step in my last statement "UI just says Collect failed but in fact it could be any stage in that lazy chain of evaluation." Anyways even you agree that this visibility of underlaying steps wont't be available. which does pose difficulties in terms of

Re: Spark UI doesn't give visibility on which stage job actually failed (due to lazy eval nature)

2016-05-25 Thread Mark Hamstra
You appear to be misunderstanding the nature of a Stage. Individual transformation steps such as `map` do not define the boundaries of Stages. Rather, a sequence of transformations in which there is only a NarrowDependency between each of the transformations will be pipelined into a single Stage.

User impersonation with Kerberos and Delegation tokens

2016-05-25 Thread Sudarshan Rangarajan
Hi there, I'm using SparkLauncher API from Spark v1.6.1, to submit a Spark job to YARN. The service from where this API will be invoked will need to talk to other services on our cluster via Kerberos (ex. HDFS, YARN etc.). Also, my service expects to impersonate its then logged-in user during job

Spark UI doesn't give visibility on which stage job actually failed (due to lazy eval nature)

2016-05-25 Thread Nirav Patel
It's great that spark scheduler does optimized DAG processing and only does lazy eval when some action is performed or shuffle dependency is encountered. Sometime it goes further after shuffle dep before executing anything. e.g. if there are map steps after shuffle then it doesn't stop at shuffle

Re: never understand

2016-05-25 Thread Andrew Ehrlich
- Try doing less in each transformation - Try using different data structures within the transformations - Try not caching anything to free up more memory On Wed, May 25, 2016 at 1:32 AM, pseudo oduesp wrote: > hi guys , > -i get this errors with pyspark 1.5.0 under

Re: Preference and confidence in ALS implicit preferences output?

2016-05-25 Thread Sean Owen
This isn't specific to Spark, but there is not a direct relation. An input preference is a count-like value, which is converted into to confidence values via the 1 + alpha*value formula. But the matrix that is factored is the 0/1 matrix mentioned in the paper, and the resulting 'prediction' are

Re: feedback on dataset api explode

2016-05-25 Thread Koert Kuipers
oh yes, this was by accident, it should have gone to dev On Wed, May 25, 2016 at 4:20 PM, Reynold Xin wrote: > Created JIRA ticket: https://issues.apache.org/jira/browse/SPARK-15533 > > @Koert - Please keep API feedback coming. One thing - in the future, can > you send api

Re: feedback on dataset api explode

2016-05-25 Thread Reynold Xin
Created JIRA ticket: https://issues.apache.org/jira/browse/SPARK-15533 @Koert - Please keep API feedback coming. One thing - in the future, can you send api feedbacks to the dev@ list instead of user@? On Wed, May 25, 2016 at 1:05 PM, Cheng Lian wrote: > Agree, since

Re: Pros and Cons

2016-05-25 Thread Reynold Xin
On Wed, May 25, 2016 at 9:52 AM, Jörn Franke wrote: > Spark is more for machine learning working iteravely over the whole same > dataset in memory. Additionally it has streaming and graph processing > capabilities that can be used together. > Hi Jörn, The first part is

unsure how to create 2 outputs from spark-sql udf expression

2016-05-25 Thread Koert Kuipers
hello all, i have a single udf that creates 2 outputs (so a tuple 2). i would like to add these 2 columns to my dataframe. my current solution is along these lines: df .withColumn("_temp_", udf(inputColumns)) .withColumn("x", col("_temp_)("_1")) .withColumn("y", col("_temp_")("_2"))

Re: feedback on dataset api explode

2016-05-25 Thread Cheng Lian
Agree, since they can be easily replaced by .flatMap (to do explosion) and .select (to rename output columns) Cheng On 5/25/16 12:30 PM, Reynold Xin wrote: Based on this discussion I'm thinking we should deprecate the two explode functions. On Wednesday, May 25, 2016, Koert Kuipers

Re: feedback on dataset api explode

2016-05-25 Thread Reynold Xin
Based on this discussion I'm thinking we should deprecate the two explode functions. On Wednesday, May 25, 2016, Koert Kuipers wrote: > wenchen, > that definition of explode seems identical to flatMap, so you dont need it > either? > > michael, > i didn't know about the

Re: feedback on dataset api explode

2016-05-25 Thread Koert Kuipers
wenchen, that definition of explode seems identical to flatMap, so you dont need it either? michael, i didn't know about the column expression version of explode, that makes sense. i will experiment with that instead. On Wed, May 25, 2016 at 3:03 PM, Wenchen Fan wrote:

Re: feedback on dataset api explode

2016-05-25 Thread Michael Armbrust
These APIs predate Datasets / encoders, so that is why they are Row instead of objects. We should probably rethink that. Honestly, I usually end up using the column expression version of explode now that it exists (i.e. explode($"arrayCol").as("Item")). It would be great to understand more why

Re: Spark Streaming - Kafka Direct Approach: re-compute from specific time

2016-05-25 Thread trung kien
Ah right i see. Thank you very much. On May 25, 2016 11:11 AM, "Cody Koeninger" wrote: > There's an overloaded createDirectStream method that takes a map from > topicpartition to offset for the starting point of the stream. > > On Wed, May 25, 2016 at 9:59 AM, trung kien

Re: Not able to write output to local filsystem from Standalone mode.

2016-05-25 Thread Mathieu Longtin
Experience. I don't use Mesos or Yarn or Hadoop, so I don't know. On Wed, May 25, 2016 at 2:51 AM Jacek Laskowski wrote: > Hi Mathieu, > > Thanks a lot for the answer! I did *not* know it's the driver to > create the directory. > > You said "standalone mode", is this the case

The 7th and Largest Spark Summit is less than 2 weeks away!

2016-05-25 Thread Scott walent
*With every Spark Summit, an Apache Spark Community event, increasing numbers of users and developers attend. This is the seventh Summit, and whether you believe that “Seven” is the world’s most popular number, we are offering a special promo code* for all Apache Spark users and developers on this

Re: Cartesian join on RDDs taking too much time

2016-05-25 Thread Sonal Goyal
You can look at ways to group records from both rdds together instead of doing Cartesian. Say generate pair rdd from each with first letter as key. Then do a partition and a join. On May 25, 2016 8:04 PM, "Priya Ch" wrote: > Hi, > RDD A is of size 30MB and RDD B

GraphFrame graph partitioning

2016-05-25 Thread rohit13k
How to do graph partition in GraphFrames similar to the partitionBy feature in GraphX? Can we use the Dataframe's repartition feature in 1.6 to provide a graph partitioning in graphFrames? -- View this message in context:

Re: Pros and Cons

2016-05-25 Thread Jörn Franke
Hive has a little bit more emphasis on the case that your data that is queried is much bigger than available memory or when you need to query many different small data subsets or recently interactively queries (llap etc.). Spark is more for machine learning working iteravely over the whole

Preference and confidence in ALS implicit preferences output?

2016-05-25 Thread edezhath
The original paper that the implicit preferences version of ALS is based on, mentions a "preference" and "confidence" for each user-item pair. But spark.ml.recommender.ALS only outputs a "prediction". How is this related to preference and confidence? -- View this message in context:

sparkApp on standalone/local mode with multithreading

2016-05-25 Thread sujeet jog
I had few questions w.r.t to Spark deployment & and way i want to use, It would be helpful if you can answer few. I plan to use Spark on a embedded switch, which has limited set of resources, like say 1 or 2 dedicated cores and 1.5GB of memory, want to model a network traffic with time series

Re: Accumulators displayed in SparkUI in 1.4.1?

2016-05-25 Thread Jacek Laskowski
On 25 May 2016 6:00 p.m., "Daniel Barclay" wrote: > > Was the feature of displaying accumulators in the Spark UI implemented in Spark 1.4.1, or was that added later? Dunno, but only *named* *accumulators* are displayed in Spark’s webUI (under Stages tab for a given

Re: Spark Streaming - Kafka Direct Approach: re-compute from specific time

2016-05-25 Thread Cody Koeninger
There's an overloaded createDirectStream method that takes a map from topicpartition to offset for the starting point of the stream. On Wed, May 25, 2016 at 9:59 AM, trung kien wrote: > Thank Cody. > > I can build the mapping from time ->offset. However how can i pass this >

Re: Pros and Cons

2016-05-25 Thread Mich Talebzadeh
Can you be a bit more specific how are you going to use Spark. For example as a powerful query tool, Analytics, Data migration. Spark SQL and Spark-shell provide a subset of Hive SQL (depending on which version of Hive and Spark you have in mind). As a query tool Spark is very powerful as it

Accumulators displayed in SparkUI in 1.4.1?

2016-05-25 Thread Daniel Barclay
Was the feature of displaying accumulators in the Spark UI implemented in Spark 1.4.1, or was that added later? Thanks, Daniel - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail:

feedback on dataset api explode

2016-05-25 Thread Koert Kuipers
we currently have 2 explode definitions in Dataset: def explode[A <: Product : TypeTag](input: Column*)(f: Row => TraversableOnce[A]): DataFrame def explode[A, B : TypeTag](inputColumn: String, outputColumn: String)(f: A => TraversableOnce[B]): DataFrame 1) the separation of the functions

Pros and Cons

2016-05-25 Thread Aakash Basu
Hi, I’m new to the Spark Ecosystem, need to understand the *Pros and Cons *of fetching data using *SparkSQL vs Hive in Spark vs Spark API.* *PLEASE HELP!* Thanks, Aakash Basu.

Re: about an exception when receiving data from kafka topic using Direct mode of Spark Streaming

2016-05-25 Thread Alonso Isidoro Roman
Hi Matthias and Cody, thanks for the answer. This is the code that is raising the runtime exception: try{ messages.foreachRDD( rdd =>{ val count = rdd.count() if (count > 0){ //someMessages should be AmazonRating... val someMessages = rdd.take(count.toInt)

Re: Spark Streaming - Kafka Direct Approach: re-compute from specific time

2016-05-25 Thread trung kien
Thank Cody. I can build the mapping from time ->offset. However how can i pass this offset to Spark Streaming job using that offset? ( using Direct Approach) On May 25, 2016 9:42 AM, "Cody Koeninger" wrote: > Kafka does not yet have meaningful time indexing, there's a kafka

Re: about an exception when receiving data from kafka topic using Direct mode of Spark Streaming

2016-05-25 Thread Matthias Niehoff
Hi, you register some output actions (in this case foreachRDD) after starting the streaming context. StreamingContext.start() has to be called after all! output actions. 2016-05-25 15:59 GMT+02:00 Alonso : > Hi, i am receiving this exception when direct spark streaming

Re: Spark Streaming - Kafka Direct Approach: re-compute from specific time

2016-05-25 Thread Cody Koeninger
Kafka does not yet have meaningful time indexing, there's a kafka improvement proposal for it but it has gotten pushed back to at least 0.10.1 If you want to do this kind of thing, you will need to maintain your own index from time to offset. On Wed, May 25, 2016 at 8:15 AM, trung kien

Re: Spark Streaming - Kafka - java.nio.BufferUnderflowException

2016-05-25 Thread Cody Koeninger
I'd fix the kafka version on the executor classpath (should be 0.8.2.1) before trying anything else, even if it may be unrelated to the actual error. Definitely don't upgrade your brokers to 0.9 On Wed, May 25, 2016 at 2:30 AM, Scott W wrote: > I'm running into below error

Re: about an exception when receiving data from kafka topic using Direct mode of Spark Streaming

2016-05-25 Thread Cody Koeninger
Am I reading this correctly that you're calling messages.foreachRDD inside of the messages.foreachRDD block? Don't do that. On Wed, May 25, 2016 at 8:59 AM, Alonso wrote: > Hi, i am receiving this exception when direct spark streaming process > tries to pull data from kafka

Re: Cartesian join on RDDs taking too much time

2016-05-25 Thread Priya Ch
Hi, RDD A is of size 30MB and RDD B is of size 8 MB. Upon matching, we would like to filter out the strings that have greater than 85% match and generate a score for it which is used in the susbsequent calculations. I tried generating pair rdd from both rdds A and B with same key for all the

Re: job build cost more and more time

2016-05-25 Thread nguyen duc tuan
Take a look in here: http://stackoverflow.com/questions/33424445/is-there-a-way-to-checkpoint-apache-spark-dataframes So all you have to do create a checkpoint for a dataframe is as follow: df.rdd.checkpoint df.rdd.count // or any action 2016-05-25 8:43 GMT+07:00 naliazheli <754405...@qq.com>: >

Re: Cartesian join on RDDs taking too much time

2016-05-25 Thread Jörn Franke
Solr or Elastic search provide much more functionality and are faster in this context. The decision for or against them depends on your current and future use cases. Your current use case is still very abstract so in order to get a more proper recommendation you need to provide more details

about an exception when receiving data from kafka topic using Direct mode of Spark Streaming

2016-05-25 Thread Alonso
Hi, i am receiving this exception when direct spark streaming process tries to pull data from kafka topic: 16/05/25 11:30:30 INFO CheckpointWriter: Checkpoint for time 146416863 ms saved to file 'file:/Users/aironman/my-recommendation-spark-engine/checkpoint/checkpoint-146416863', took

about an exception when receiving data from kafka topic using Direct mode of Spark Streaming

2016-05-25 Thread Alonso Isidoro Roman
Hi, i am receiving this exception when direct spark streaming process tries to pull data from kafka topic: 16/05/25 11:30:30 INFO CheckpointWriter: Checkpoint for time 146416863 ms saved to file 'file:/Users/aironman/my-recommendation-spark-engine/checkpoint/checkpoint-146416863', took

Spark Streaming - Kafka Direct Approach: re-compute from specific time

2016-05-25 Thread trung kien
Hi all, Is there any way to re-compute using Spark Streaming - Kafka Direct Approach from specific time? In some cases, I want to re-compute again from specific time (e.g beginning of day)? is that possible? -- Thanks Kien

StackOverflow in Spark

2016-05-25 Thread Michel Hubert
Hi, I have an Spark application which generates StackOverflowError exceptions after 30+ min. Anyone any ideas? Seems like problems with deserialization of checkpoint data? 16/05/25 10:48:51 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 55449.0 (TID 5584,

Re: Facing issues while reading parquet file in spark 1.2.1

2016-05-25 Thread Takeshi Yamamuro
Hi, You need to describe more to make others easily understood; what's the version of spark and what's the query you use? // maropu On Wed, May 25, 2016 at 8:27 PM, vaibhav srivastava wrote: > Hi All, > > I am facing below stack traces while reading data from parquet

Re: Cartesian join on RDDs taking too much time

2016-05-25 Thread Priya Ch
Why do i need to deploy solr for text anaytics...i have files placed in HDFS. just need to look for matches against each string in both files and generate those records whose match is > 85%. We trying to Fuzzy match logic. How can use map/reduce operations across 2 rdds ? Thanks, Padma Ch On

Facing issues while reading parquet file in spark 1.2.1

2016-05-25 Thread vaibhav srivastava
Hi All, I am facing below stack traces while reading data from parquet file Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7 at parquet.bytes.BytesUtils.bytesToLong(BytesUtils.java:247) at

Re: Cartesian join on RDDs taking too much time

2016-05-25 Thread Jörn Franke
Alternatively depending on the exact use case you may employ solr on Hadoop for text analytics > On 25 May 2016, at 12:57, Priya Ch wrote: > > Lets say i have rdd A of strings as {"hi","bye","ch"} and another RDD B of > strings as

Re: Cartesian join on RDDs taking too much time

2016-05-25 Thread Jörn Franke
No this is not needed, look at the map / reduce operations and the standard spark word count > On 25 May 2016, at 12:57, Priya Ch wrote: > > Lets say i have rdd A of strings as {"hi","bye","ch"} and another RDD B of > strings as {"padma","hihi","chch","priya"}.

Re: Cartesian join on RDDs taking too much time

2016-05-25 Thread Priya Ch
Lets say i have rdd A of strings as {"hi","bye","ch"} and another RDD B of strings as {"padma","hihi","chch","priya"}. For every string rdd A i need to check the matches found in rdd B as such for string "hi" i have to check the matches against all strings in RDD B which means I need generate

Re: Cartesian join on RDDs taking too much time

2016-05-25 Thread Jörn Franke
What is the use case of this ? A Cartesian product is by definition slow in any system. Why do you need this? How long does your application take now? > On 25 May 2016, at 12:42, Priya Ch wrote: > > I tried >

Re: Cartesian join on RDDs taking too much time

2016-05-25 Thread Priya Ch
I tried dataframe.write.format("com.databricks.spark.csv").save("/hdfs_path"). Even this is taking too much time. Thanks, Padma Ch On Wed, May 25, 2016 at 3:47 PM, Takeshi Yamamuro wrote: > Why did you use Rdd#saveAsTextFile instead of DataFrame#save writing as >

Re: Using Java in Spark shell

2016-05-25 Thread Keith
There is no java shell in spark. > On May 25, 2016, at 1:11 AM, Ashok Kumar wrote: > > Hello, > > A newbie question. > > Is it possible to use java code directly in spark shell without using maven > to build a jar file? > > How can I switch from scala to java

Re: Using Java in Spark shell

2016-05-25 Thread Ted Yu
I found this: :javap disassemble a file or class name But no direct interpretation of Java code. On Tue, May 24, 2016 at 10:11 PM, Ashok Kumar wrote: > Hello, > > A newbie question. > > Is it possible to use java code directly in spark shell

Re: run multiple spark jobs yarn-client mode

2016-05-25 Thread spark.raj
Thank you for your help Mich. ThanksRajesh   Sent from Yahoo Mail. Get the app On Wednesday, May 25, 2016 3:14 PM, Mich Talebzadeh wrote: You may have some memory issues OOM etc that terminated the process. Dr Mich Talebzadeh LinkedIn  

Re: Cartesian join on RDDs taking too much time

2016-05-25 Thread Takeshi Yamamuro
Why did you use Rdd#saveAsTextFile instead of DataFrame#save writing as parquet, orc, ...? // maropu On Wed, May 25, 2016 at 7:10 PM, Priya Ch wrote: > Hi , Yes I have joined using DataFrame join. Now to save this into hdfs .I > am converting the joined dataframe

Re: Cartesian join on RDDs taking too much time

2016-05-25 Thread Priya Ch
Hi , Yes I have joined using DataFrame join. Now to save this into hdfs .I am converting the joined dataframe to rdd (dataframe.rdd) and using saveAsTextFile, trying to save it. However, this is also taking too much time. Thanks, Padma Ch On Wed, May 25, 2016 at 1:32 PM, Takeshi Yamamuro

Re: run multiple spark jobs yarn-client mode

2016-05-25 Thread Mich Talebzadeh
You may have some memory issues OOM etc that terminated the process. Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw *

Re: run multiple spark jobs yarn-client mode

2016-05-25 Thread spark.raj
Hi Friends, In the yarn log files of the nodemanager i can see the error below. Can i know why i am getting this error. ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: RECEIVED SIGNAL 15: SIGTERM ThanksRajesh   Sent from Yahoo Mail. Get the app On Wednesday, May 25, 2016 1:08 PM,

Re: Is it a bug?

2016-05-25 Thread Zheng Wendell
Any update? On Sun, May 8, 2016 at 3:17 PM, Ted Yu wrote: > I don't think so. > RDD is immutable. > > > On May 8, 2016, at 2:14 AM, Sisyphuss wrote: > > > > > > > > > > > > --

Re: How does Spark set task indexes?

2016-05-25 Thread Adrien Mogenet
Yes I've noticed this one and its related cousin, but not sure this is the same issue there; our job "properly" ends after 6 attempts. We'll try with disabled speculative mode anyway! On 25 May 2016 at 00:13, Ted Yu wrote: > Have you taken a look at SPARK-14915 ? > > On

never understand

2016-05-25 Thread pseudo oduesp
hi guys , -i get this errors with pyspark 1.5.0 under cloudera CDH 5.5 (yarn) -i use yarn to deploy job on cluster. -i use hive context and parquet file to save my data. limit container 16 GB number of executor i tested befor it s 12 GB (executor memory) -i tested to increase number of

Re: Cartesian join on RDDs taking too much time

2016-05-25 Thread Takeshi Yamamuro
Hi, Seems you'd be better off using DataFrame#join instead of RDD.cartesian because it always needs shuffle operations which have alot of overheads such as reflection, serialization, ... In your case, since the smaller table is 7mb, DataFrame#join uses a broadcast strategy. This is a little

Re: run multiple spark jobs yarn-client mode

2016-05-25 Thread Mich Talebzadeh
Yes check the yarn log files both resourcemanager and nodemanager. Also ensure that you have set up work directories consistently, especially yarn.nodemanager.local-dirs HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Spark Streaming - Kafka - java.nio.BufferUnderflowException

2016-05-25 Thread Scott W
I'm running into below error while trying to consume message from Kafka through Spark streaming (Kafka direct API). This used to work OK when using Spark standalone cluster manager. We're just switching to using Cloudera 5.7 using Yarn to manage Spark cluster and started to see the below error.

Re: run multiple spark jobs yarn-client mode

2016-05-25 Thread Jeff Zhang
Could you check the yarn app logs ? On Wed, May 25, 2016 at 3:23 PM, wrote: > Hi, > > I am running spark streaming job on yarn-client mode. If run muliple jobs, > some of the jobs failing and giving below error message. Is there any > configuration missing? > >

run multiple spark jobs yarn-client mode

2016-05-25 Thread spark.raj
Hi, I am running spark streaming job on yarn-client mode. If run muliple jobs, some of the jobs failing and giving below error message. Is there any configuration missing? ERROR apache.spark.util.Utils - Uncaught exception in thread main java.lang.NullPointerException     at

Re: Cartesian join on RDDs taking too much time

2016-05-25 Thread Mich Talebzadeh
It is basically a Cartesian join like RDBMS Example: SELECT * FROM FinancialCodes, FinancialData The results of this query matches every row in the FinancialCodes table with every row in the FinancialData table. Each row consists of all columns from the FinancialCodes table followed by all

python application cluster mode in standalone spark cluster

2016-05-25 Thread Jan Sourek
A the official documentation states 'Currently only YARN supports cluster mode for Python applications.' I would like to know if work is being done or planned to support cluster mode for Python applications on standalone spark clusters? -- View this message in context:

Cartesian join on RDDs taking too much time

2016-05-25 Thread Priya Ch
Hi All, I have two RDDs A and B where in A is of size 30 MB and B is of size 7 MB, A.cartesian(B) is taking too much time. Is there any bottleneck in cartesian operation ? I am using spark 1.6.0 version Regards, Padma Ch

Re: Not able to write output to local filsystem from Standalone mode.

2016-05-25 Thread Jacek Laskowski
Hi Mathieu, Thanks a lot for the answer! I did *not* know it's the driver to create the directory. You said "standalone mode", is this the case for the other modes - yarn and mesos? p.s. Did you find it in the code or...just experienced before? #curious Pozdrawiam, Jacek Laskowski

Re: why spark 1.6 use Netty instead of Akka?

2016-05-25 Thread Chaoqiang
Todd, Jakob, Thank you for your replies. But there are still some questions I can't understand. Just like below: 1. What's the effect about Akka's dependencies to Spark? 2. What's the detail conflicts with projects using both Spark and Akka? 3. Which portion of Akka's features was used in Spark,