RE: Do we really need mesos or yarn? or is standalone sufficent?

2016-12-16 Thread Saif.A.Ellafi
In my experience, Standalone works very well in small cluster where there isn’t anything else running. Bigger cluster or shared resources, YARN takes a win, surpassing the overhead of spawning containers as opposed to a background running worker. Best is if you try both, if standalone is good

Cluster deploy mode driver location

2016-11-21 Thread Saif.A.Ellafi
Hello there, I have a Spark program in 1.6.1, however, when I submit it to cluster, it randomly picks the driver. I know there is a driver specification option, but along with it it is mandatory to define many other options I am not familiar with. The trouble is, the .jars I am launching need

RE: Applying a limit after orderBy of big dataframe hangs spark

2016-08-05 Thread Saif.A.Ellafi
Hi thanks for the assistance, 1. Standalone 2. df.orderBy(field).limit(5000).write.parquet(...) From: Mich Talebzadeh [mailto:mich.talebza...@gmail.com] Sent: Friday, August 05, 2016 4:33 PM To: Ellafi, Saif A. Cc: user @spark Subject: Re: Applying a limit after orderBy of big

Applying a limit after orderBy of big dataframe hangs spark

2016-08-05 Thread Saif.A.Ellafi
Hi all, I am working with a 1.5 billon rows dataframe in a small cluster and trying to apply an orderBy operation by one of the Long Types columns. If I limit such output to some number, say 5 millon, then trying to count, persist or store the dataframe makes spark crash with losing executors

multiple SPARK_LOCAL_DIRS causing strange behavior in parallelism

2016-07-29 Thread Saif.A.Ellafi
Hi all, I was currently playing around with spark-env around SPARK_LOCAL_DIRS in order to add additional shuffle storage. But since I did this, I am getting too many open files error if total executor cores is high. I am also getting low parallelism, by monitoring the running tasks on some

RE: Strategies for propery load-balanced partitioning

2016-06-03 Thread Saif.A.Ellafi
Appreciate the follow up. I am not entirely sure how or why my question is related to bucketization capabilities. It indeeds sounds like a powerful feature to avoid shuffling, but in my case, I am referring to straight forward processes of reading data and writing to parquet. If bucket tables

Strategies for propery load-balanced partitioning

2016-06-03 Thread Saif.A.Ellafi
Hello everyone! I was noticing that, when reading parquet files or actually any kind of source data frame data (spark-csv, etc), default partinioning is not fair. Action tasks usually act very fast on some partitions and very slow on some others, and frequently, even fast on all but last

spark-submit not adding application jar to classpath

2016-04-18 Thread Saif.A.Ellafi
Hi, I am submitting a jar file to spark submit, which has some content inside src/main/resources. I am unable to access such resources, since the application jar is not being added to the classpath. This works fine if I include the application jar also in the -driver-class-path entry. Is

RE: Strange bug: Filter problem with parenthesis

2016-04-14 Thread Saif.A.Ellafi
Appreciated Michael, but this doesn’t help my case, the filter string is being submitted from outside my program, is there any other alternative? some literal string parser or anything I can do before? Saif From: Michael Armbrust [mailto:mich...@databricks.com] Sent: Wednesday, April 13, 2016

Strange bug: Filter problem with parenthesis

2016-04-13 Thread Saif.A.Ellafi
Hi, I am debugging a program, and for some reason, a line calling the following is failing: df.filter("sum(OpenAccounts) > 5").show It says it cannot find the column OpenAccounts, as if it was applying the sum() function and looking for a column called like that, where there is not. This

Can we load csv partitioned data into one DF?

2016-02-22 Thread Saif.A.Ellafi
Hello all, I am facing a silly data question. If I have +100 csv files which are part of the same data, but each csv is for example, a year on a timeframe column (i.e. partitioned by year), what would you suggest instead of loading all those files and joining them? Final target would be

jobs much slower in cluster mode vs local

2016-01-15 Thread Saif.A.Ellafi
Hello, In general, I am usually able to run spark submit jobs in local mode, in a 32-cores node with plenty of memory ram. The performance is significantly faster in local mode than when using a cluster of spark workers. How can this be explained and what measures can one take in order to

RE: jobs much slower in cluster mode vs local

2016-01-15 Thread Saif.A.Ellafi
Thank you, this looks useful indeed for what I have in mind. Saif From: Jiří Syrový [mailto:syrovy.j...@gmail.com] Sent: Friday, January 15, 2016 12:06 PM To: Ellafi, Saif A. Cc: user@spark.apache.org Subject: Re: jobs much slower in cluster mode vs local Hi, you can try to use spark job

Big data job only finishes with Legacy memory management

2016-01-12 Thread Saif.A.Ellafi
Hello, I am tinkering with Spark 1.6. I have this 1.5 Billion rows data, to which I apply several window functions such as lag, first, etc. The job is quite expensive, I am running a small cluster with executors running with 70GB of ram. Using new memory management system, the job fails around

RE: Is Spark 1.6 released?

2016-01-04 Thread Saif.A.Ellafi
Where can I read more about the dataset api on a user layer? I am failing to get an API doc or understand when to use DataFrame or DataSet, advantages, etc. Thanks, Saif -Original Message- From: Jean-Baptiste Onofré [mailto:j...@nanthrax.net] Sent: Monday, January 04, 2016 2:01 PM To:

Limit of application submission to cluster

2015-12-18 Thread Saif.A.Ellafi
Hello everyone, I am testing some parallel program submission to a stand alone cluster. Everything works alright, the problem is, for some reason, I can't submit more than 3 programs to the cluster. The fourth one, whether legacy or REST, simply hangs until one of the first three completes. I

ML - LinearRegression: is this a bug ????

2015-12-02 Thread Saif.A.Ellafi
Data: +---++ | label|features| +---++ |0.13271745268556925|[-0.2006809895664...| |0.23956421080605234|[-0.0938342314459...| |0.47464690691431843|[0.14124846466227...| |

RE: Inconsistent Persistence of DataFrames in Spark 1.5

2015-10-28 Thread Saif.A.Ellafi
Hi, just a couple cents. are your joining columns StringTypes (id field)? I have recently reported a bug where having inconsistent results when filtering String fields in group operations. Saif From: Colin Alstad [mailto:colin.als...@pokitdok.com] Sent: Wednesday, October 28, 2015 12:39 PM

Kryo makes String data invalid

2015-10-26 Thread Saif.A.Ellafi
Hi all, I have a parquet file, which I am loading in a shell. When I launch the shell with -driver-java-options ="-Dspark.serializer=...kryo", makes a couple fields look like: 03-?? ??-?? ??-??? when calling > data.first I will confirm briefly, but I am utterly sure it happens

Results change in group by operation

2015-10-26 Thread Saif.A.Ellafi
Hello Everyone, I would need urgent help with a data consistency issue I am having. Stand alone Cluster of five servers. sqlContext instance of HiveContext (default in spark-shell) No special options other than driver memory and executor memory. Parquet partitions are 512 where there are 160

Spark groupby and agg inconsistent and missing data

2015-10-22 Thread Saif.A.Ellafi
Hello everyone, I am doing some analytics experiments under a 4 server stand-alone cluster in a spark shell, mostly involving a huge database with groupBy and aggregations. I am picking 6 groupBy columns and returning various aggregated results in a dataframe. GroupBy fields are of two types,

RE: Spark groupby and agg inconsistent and missing data

2015-10-22 Thread Saif.A.Ellafi
Thanks, sorry I cannot share the data and not sure how much significant it will be for you. I am reproducing the issue on a smaller piece of the content and see wether I find a reason on the inconsistence. val res2 = data.filter($"closed" === $"ever_closed").groupBy("product", "band ", "aget",

RE: Spark groupby and agg inconsistent and missing data

2015-10-22 Thread Saif.A.Ellafi
nevermind my last email. res2 is filtered so my test does not make sense. The issue is not reproduced there. I have the problem somwhere else. From: Ellafi, Saif A. Sent: Thursday, October 22, 2015 12:57 PM To: 'Xiao Li' Cc: user Subject: RE: Spark groupby and agg inconsistent and missing data

How to speed up reading from file?

2015-10-16 Thread Saif.A.Ellafi
Hello, Is there an optimal number of partitions per number of rows, when writing into disk, so we can re-read later from source in a distributed way? Any thoughts? Thanks Saif

thriftserver: access temp dataframe from in-memory of spark-shell

2015-10-14 Thread Saif.A.Ellafi
Hi, Is it possible to load a spark-shell, in which we do any number of operations in a dataframe, then register it as a temporary table and get to see it through thriftserver? ps. or even better, submit a full job and store the dataframe in thriftserver in-memory before the job completes. I

Spark shuffle service does not work in stand alone

2015-10-13 Thread Saif.A.Ellafi
Has anyone tried shuffle service in Stand Alone cluster mode? I want to enable it for d.a. but my jobs never start when I submit them. This happens with all my jobs. 15/10/13 08:29:45 INFO DAGScheduler: Job 0 failed: json at DataLoader.scala:86, took 16.318615 s Exception in thread "main"

RE: Spark shuffle service does not work in stand alone

2015-10-13 Thread Saif.A.Ellafi
Hi, thanks Executors are simply failing to connect to a shuffle server: 15/10/13 08:29:34 INFO BlockManagerMaster: Registered BlockManager 15/10/13 08:29:34 INFO BlockManager: Registering executor with local external shuffle service. 15/10/13 08:29:34 ERROR BlockManager: Failed to connect to

RE: Spark shuffle service does not work in stand alone

2015-10-13 Thread Saif.A.Ellafi
I believe the confusion here is self-answered. The thing is that in the documentation, the spark shuffle service runs only under YARN, while here we are speaking about a stand alone cluster. The proper question is, how to launch a shuffle service for stand alone? Saif From:

RE: Spark shuffle service does not work in stand alone

2015-10-13 Thread Saif.A.Ellafi
Thanks, I missed that one. From: Marcelo Vanzin [mailto:van...@cloudera.com] Sent: Tuesday, October 13, 2015 2:36 PM To: Ellafi, Saif A. Cc: user@spark.apache.org Subject: Re: Spark shuffle service does not work in stand alone You have to manually start the shuffle service if you're not running

Best storage format for intermediate process

2015-10-09 Thread Saif.A.Ellafi
Hi all, I am in the procss of learning big data. Right now, I am bringing huge databases through JDBC to Spark (a 250 million rows table can take around 3 hours), and then re-saving it into JSON, which is fast, simple, distributed, fail-safe and stores data types, although without any

RE: How to calculate percentile of a column of DataFrame?

2015-10-09 Thread Saif.A.Ellafi
Where can we find other available functions such as lit() ? I can’t find lit in the api. Thanks From: Michael Armbrust [mailto:mich...@databricks.com] Sent: Friday, October 09, 2015 4:04 PM To: unk1102 Cc: user Subject: Re: How to calculate percentile of a column of DataFrame? You can use

RE: How to calculate percentile of a column of DataFrame?

2015-10-09 Thread Saif.A.Ellafi
Yes but I mean, this is rather curious. How is def lit(literal:Any) --> becomes a percentile function lit(25) Thanks for clarification Saif From: Umesh Kacha [mailto:umesh.ka...@gmail.com] Sent: Friday, October 09, 2015 4:10 PM To: Ellafi, Saif A. Cc: Michael Armbrust; user Subject: Re: How to

RowNumber in HiveContext returns null or negative values

2015-10-08 Thread Saif.A.Ellafi
Hi all, would this be a bug?? val ws = Window. partitionBy("clrty_id"). orderBy("filemonth_dtt") val nm = "repeatMe" df.select(df.col("*"), rowNumber().over(ws).cast("int").as(nm))

RE: RowNumber in HiveContext returns null or negative values

2015-10-08 Thread Saif.A.Ellafi
Hi, thanks for looking into. v1.5.1. I am really worried. I dont have hive/hadoop for real in the environment. Saif From: Michael Armbrust [mailto:mich...@databricks.com] Sent: Thursday, October 08, 2015 2:57 PM To: Ellafi, Saif A. Cc: user Subject: Re: RowNumber in HiveContext returns null or

RE: RowNumber in HiveContext returns null or negative values

2015-10-08 Thread Saif.A.Ellafi
It turns out this does not happen in local[32] mode. Only happens when submiting to standalone cluster. Don’t have YARN/MESOS to compare. Will keep diagnosing. Saif From: saif.a.ell...@wellsfargo.com [mailto:saif.a.ell...@wellsfargo.com] Sent: Thursday, October 08, 2015 3:01 PM To:

RE: RowNumber in HiveContext returns null or negative values

2015-10-08 Thread Saif.A.Ellafi
Repartition and default parallelism to 1, in cluster mode, is still broken. So the problem is not the parallelism, but the cluster mode itself. Something wrong with HiveContext + cluster mode. Saif From: saif.a.ell...@wellsfargo.com [mailto:saif.a.ell...@wellsfargo.com] Sent: Thursday, October

RE: Spark standalone hangup during shuffle flatMap or explode in cluster

2015-10-07 Thread Saif.A.Ellafi
It can be large yes. But, that still does not resolve the question of why it works in smaller environment, i.e. Local[32] or in cluster mode when using SQLContext instead of HiveContext. The process in general, is a RowNumber() hiveQL operation, that is why I need HiveContext. I have the

Spark standalone hangup during shuffle flatMap or explode in cluster

2015-10-07 Thread Saif.A.Ellafi
When running stand-alone cluster mode job, the process hangs up randomly during a DataFrame flatMap or explode operation, in HiveContext: -->> df.flatMap(r => for (n <- 1 to r.getInt(ind)) yield r) This does not happen either with SQLContext in cluster, or Hive/SQL in local mode, where it

Help with big data operation performance

2015-10-06 Thread Saif.A.Ellafi
Hi all, In a stand-alone cluster operation, with more than 80 gbs of ram in each node, am trying to: 1. load a partitioned json dataframe which weights around 100GB as input 2. apply transformations such as cast some column types 3. get some percentiles which involves sort by,

Please help: Processes with HiveContext slower in cluster

2015-10-05 Thread Saif.A.Ellafi
Hi, I have a HiveContext job which takes less than 1 minute to complete in local mode with 16 cores. However, when I launch it over stand-alone cluster, it takes for ever, probably can't even finish. Even when I have the same only node running up in which I execute it locally. How could I

How to change verbosity level and redirect verbosity to file?

2015-10-05 Thread Saif.A.Ellafi
Hi, I would like to read the full spark-submit log once a job is completed, since the output is not stdout, how could I redirect spark output to a file in linux? Thanks, Saif

RE: Accumulator of rows?

2015-10-02 Thread Saif.A.Ellafi
Thank you, exactly what I was looking for. I have read of it before but never associated. Saif From: Adrian Tanase [mailto:atan...@adobe.com] Sent: Friday, October 02, 2015 8:24 AM To: Ellafi, Saif A.; user@spark.apache.org Subject: Re: Accumulator of rows? Have you seen window functions?

from groupBy return a DataFrame without aggregation?

2015-10-02 Thread Saif.A.Ellafi
Given ID, DATE, I need all sorted dates per ID, what is the easiest way? I got this but I don't like it: val l = zz.groupBy("id", " dt").agg($"dt".as("dummy")).sort($"dt".asc) Saif

RE: how to broadcast huge lookup table?

2015-10-02 Thread Saif.A.Ellafi
Hi, thank you I would prefer to leave writing-to-disk as a last resort. Is it a last resort? Saif From: Ted Yu [mailto:yuzhih...@gmail.com] Sent: Friday, October 02, 2015 3:54 PM To: Ellafi, Saif A. Cc: user Subject: Re: how to broadcast huge lookup table? Have you considered using external

how to broadcast huge lookup table?

2015-10-02 Thread Saif.A.Ellafi
I tried broadcasting a key-value rdd, but then I cannot perform any rdd-actions inside a map/foreach function of another rdd. any tips? If going into scala collections I end up with huge memory bottlenecks. Saif

Accumulator of rows?

2015-10-01 Thread Saif.A.Ellafi
Hi all, I need to repeat a couple rows from a dataframe by n times each. To do so, I plan to create a new Data Frame, but I am being unable to find a way to accumulate "Rows" somewhere, as this might get huge, I can't accumulate into a mutable Array, I think?. Thanks, Saif

What is the best way to submit multiple tasks?

2015-09-30 Thread Saif.A.Ellafi
Hi all, I have a process where I do some calculations on each one of the columns of a dataframe. Intrinsecally, I run across each column with a for loop. On the other hand, each process itself is non-entirely-distributable. To speed up the process, I would like to submit a spark program for

Cant perform full outer join

2015-09-29 Thread Saif.A.Ellafi
Hi all, So I Have two dataframes, with two columns: DATE and VALUE. Performing this: data = data.join(cur_data, data("DATE") === cur_data("DATE"), "outer") returns Exception in thread "main" org.apache.spark.sql.AnalysisException: Reference 'DATE' is ambiguous, could be: DATE#0, DATE#3.; But

Where can I learn how to write udf?

2015-09-14 Thread Saif.A.Ellafi
Hi all, I am failing to find a proper guide or tutorial onto how to write proper udf functions in scala. Appreciate the effort saving, Saif

Compress JSON dataframes

2015-09-08 Thread Saif.A.Ellafi
Hi, I am trying to figure out a way to compress df.write.json() but have not been succesful, even changing spark.io.compression. Any thoughts? Saif

RE: What is the current status of ML ?

2015-09-01 Thread Saif.A.Ellafi
Thank you, so I was inversely confused. At first I thought MLLIB was the future, but based on what you say. MLLIB will be the past. Intersting. This means that if I look forward over using the pipelines system, I shouldn't be obsolete. Any more insights welcome, Saif -Original Message-

What is the current status of ML ?

2015-09-01 Thread Saif.A.Ellafi
Hi all, I am little bit confused, as getting introduced to Spark recently. What is going on with ML? Is it going to be deprecated? Or are all of its features valid and constructed over? It has a set of features and ML Constructors which I like to use, but need to confirm wether the future of

Best way to filter null on any column?

2015-08-27 Thread Saif.A.Ellafi
Hi all, What would be a good way to filter rows in a dataframe, where any value in a row has null? I wouldn't want to go through each column manually. Thanks, Saif

RE: Help! Stuck using withColumn

2015-08-27 Thread Saif.A.Ellafi
Hello, thank you for the response. I found a blog where a guy explains that it is not possible to join columns from different data frames. I was trying to modify one column’s information, so selecting it and then trying to replace the original dataframe column. Found another way, Thanks Saif

Is there a way to store RDD and load it with its original format?

2015-08-27 Thread Saif.A.Ellafi
Hi, Any way to store/load RDDs keeping their original object instead of string? I am having trouble with parquet (there is always some error at class conversion), and don't use hadoop. Looking for alternatives. Thanks in advance Saif

Help! Stuck using withColumn

2015-08-26 Thread Saif.A.Ellafi
This simple comand call: val final_df = data.select(month_balance).withColumn(month_date, data.col(month_date_curr)) Is throwing: org.apache.spark.sql.AnalysisException: resolved attribute(s) month_date_curr#324 missing from month_balance#234 in operator !Project [month_balance#234,

RE: Help! Stuck using withColumn

2015-08-26 Thread Saif.A.Ellafi
I can reproduce this even simpler with the following: val gf = sc.parallelize(Array(3,6,4,7,3,4,5,5,31,4,5,2)).toDF(ASD) val ff = sc.parallelize(Array(4,6,2,3,5,1,4,6,23,6,4,7)).toDF(GFD) gf.withColumn(DSA, ff.col(GFD)) org.apache.spark.sql.AnalysisException: resolved attribute(s) GFD#421

Scala: Overload method by its class type

2015-08-25 Thread Saif.A.Ellafi
Hi all, I have SomeClass[TYPE] { def some_method(args: fixed_type_args): TYPE } And on runtime, I create instances of this class with different AnyVal + String types, but the return type of some_method varies. I know I could do this with an implicit object, IF some_method received a type, but

Want to install lz4 compression

2015-08-21 Thread Saif.A.Ellafi
Hi all, I am using pre-compiled spark with hadoop 2.6. LZ4 Codec is not on hadoop's native libraries, so I am not being able to use it. Can anyone suggest on how to proceed? Hopefully I wont have to recompile hadoop. I tried changing the --driver-library-path to point directly into lz4 stand

Set custm worker id ?

2015-08-21 Thread Saif.A.Ellafi
Hi, Is it possible in standalone to set up worker ID names? to avoid the worker-19248891237482379-ip..-port ?? Thanks, Saif

RE: Scala: How to match a java object????

2015-08-19 Thread Saif.A.Ellafi
Hi, thank you all for the asssistance. It is odd, it works when creating a new java.mathBigDecimal object, but not if I work directly with scala 5 match { case x: java.math.BigDecimal = 2 } console:23: error: scrutinee is incompatible with pattern type; found : java.math.BigDecimal required:

RE: Scala: How to match a java object????

2015-08-19 Thread Saif.A.Ellafi
It is okay. this methodology works very well for mapping objects of my Seq[Any]. It is indeed very cool :-) Saif From: Ted Yu [mailto:yuzhih...@gmail.com] Sent: Wednesday, August 19, 2015 10:47 AM To: Ellafi, Saif A. Cc: sujitatgt...@gmail.com; wrbri...@gmail.com; user@spark.apache.org Subject:

RE: Scala: How to match a java object????

2015-08-18 Thread Saif.A.Ellafi
Hi, thank you for further assistance you can reproduce this by simply running 5 match { case java.math.BigDecimal = 2 } In my personal case, I am applying a map acton to a Seq[Any], so the elements inside are of type any, to which I need to apply a proper .asInstanceOf[WhoYouShouldBe]. Saif

RE: Scala: How to match a java object????

2015-08-18 Thread Saif.A.Ellafi
Hi, Can you please elaborate? I am confused :-) Saif -Original Message- From: Marcelo Vanzin [mailto:van...@cloudera.com] Sent: Tuesday, August 18, 2015 5:15 PM To: Ellafi, Saif A. Cc: wrbri...@gmail.com; user@spark.apache.org Subject: Re: Scala: How to match a java object On Tue,

Scala: How to match a java object????

2015-08-18 Thread Saif.A.Ellafi
Hi all, I am trying to run a spark job, in which I receive java.math.BigDecimal objects, instead of the scala equivalents, and I am trying to convert them into Doubles. If I try to match-case this object class, I get: error: object java.math.BigDecimal is not a value How could I get around

Cannot cast to Tuple when running in cluster mode

2015-08-14 Thread Saif.A.Ellafi
Hi All, I have a working program, in which I create two big tuples2 out of the data. This seems to work in local but when I switch over cluster standalone mode, I get this error at the very beggining: 15/08/14 10:22:25 WARN TaskSetManager: Lost task 4.0 in stage 1.0 (TID 10, 162.101.194.44):

Help with persist: Data is requested again

2015-08-14 Thread Saif.A.Ellafi
Hello all, I am writing a program which calls from a database. A run a couple computations, but in the end I have a while loop, in which I make a modification to the persisted thata. eg: val data = PairRDD... persist() var i = 0 while (i 10) { val data_mod = data.map(_._1 + 1, _._2)

Parquet without hadoop: Possible?

2015-08-11 Thread Saif.A.Ellafi
Hi all, I don't have any hadoop fs installed on my environment, but I would like to store dataframes in parquet files. I am failing to do so, if possible, anyone have any pointers? Thank you, Saif

RE: Parquet without hadoop: Possible?

2015-08-11 Thread Saif.A.Ellafi
I am launching my spark-shell spark-1.4.1-bin-hadoop2.6/bin/spark-shell 15/08/11 09:43:32 INFO SparkILoop: Created sql context (with Hive support).. SQL context available as sqlContext. scala val data = sc.parallelize(Array(2,3,5,7,2,3,6,1)).toDF scala data.write.parquet(/var/ data/Saif/pq)

RE: Parquet without hadoop: Possible?

2015-08-11 Thread Saif.A.Ellafi
I confirm that it works, I was just having this issue: https://issues.apache.org/jira/browse/SPARK-8450 Saif From: Ellafi, Saif A. Sent: Tuesday, August 11, 2015 12:01 PM To: Ellafi, Saif A.; deanwamp...@gmail.com Cc: user@spark.apache.org Subject: RE: Parquet without hadoop: Possible? Sorry,

RE: Parquet without hadoop: Possible?

2015-08-11 Thread Saif.A.Ellafi
Sorry, I provided bad information. This example worked fine with reduced parallelism. It seems my problem have to do with something specific with the real data frame at reading point. Saif From: saif.a.ell...@wellsfargo.com [mailto:saif.a.ell...@wellsfargo.com] Sent: Tuesday, August 11, 2015

Does print/event logging affect performance?

2015-08-11 Thread Saif.A.Ellafi
Hi all, silly question. Does logging info messages, both print or to file, or event logging, cause any impact to general performance of spark? Saif

Spark master driver UI: How to keep it after process finished?

2015-08-07 Thread Saif.A.Ellafi
Hi, A silly question here. The Driver Web UI dies when the spark-submit program finish. I would like some time to analyze after the program ends, as the page does not refresh it self, when I hit F5 I lose all the info. Thanks, Saif

RE: Spark master driver UI: How to keep it after process finished?

2015-08-07 Thread Saif.A.Ellafi
Hello, thank you, but that port is unreachable for me. Can you please share where can I find that port equivalent in my environment? Thank you Saif From: François Pelletier [mailto:newslett...@francoispelletier.org] Sent: Friday, August 07, 2015 4:38 PM To: user@spark.apache.org Subject: Re:

Possible bug: JDBC with Speculative mode launches orphan queries

2015-08-07 Thread Saif.A.Ellafi
Hello, When enabling speculation, my first job is to launch a partitioned JDBC dataframe query, in which some partitions take longer than others to respond. This causes speculation and creates new nodes to launch the query. When one of those nodes finish the query, the speculative one remains

RE: Too many open files

2015-07-29 Thread Saif.A.Ellafi
Thank you both, I will take a look, but 1. For high-shuffle tasks, is this right for the system to have the size and thresholds high? I hope there is no bad consequences. 2. I will try to overlook admin access and see if I can get anything with only user rights From: Ted Yu

Fighting against performance: JDBC RDD badly distributed

2015-07-28 Thread Saif.A.Ellafi
Hi all, I am experimenting and learning performance on big tasks locally, with a 32 cores node and more than 64GB of Ram, data is loaded from a database through JDBC driver, and launching heavy computations against it. I am presented with two questions: 1. My RDD is poorly distributed. I

RE: Fighting against performance: JDBC RDD badly distributed

2015-07-28 Thread Saif.A.Ellafi
Thank you for your response Zhen, I am using some vendor specific JDBC driver JAR file (honestly I dont know where it came from). It’s api is NOT like JdbcRDD, instead, more like jdbc from DataFrameReader

CPU Parallelization not being used (local mode)

2015-07-27 Thread Saif.A.Ellafi
Hi all, would like some insight. I am currently computing huge databases, and playing with monitoring and tunning. When monitoring the multiple cores I have, I see that even when RDDs are parallelized, computation on the RDD jump from core to core sporadically ( I guess, depending on where

[MLLIB] Anyone tried correlation with RDD[Vector] ?

2015-07-23 Thread Saif.A.Ellafi
I tried with a RDD[DenseVector] but RDDs are not transformable, so T+ RDD[DenseVector] not : RDD[Vector] and can't get to use the RDD input method of correlation. Thanks, Saif

RE: Select all columns except some

2015-07-17 Thread Saif.A.Ellafi
Hello, thank you for your time. Seq[String] works perfectly fine. I also tried running a for loop through all elements to see if any access to a value was broken, but no, they are alright. For now, I solved it properly calling this. Sadly, it takes a lot of time, but works: var data_sas =

Select all columns except some

2015-07-16 Thread Saif.A.Ellafi
Hi, In a hundred columns dataframe, I wish to either select all of them except or drop the ones I dont want. I am failing in doing such simple task, tried two ways val clean_cols = df.columns.filterNot(col_name = col_name.startWith(STATE_).mkString(, ) df.select(clean_cols) But this throws

RE: java.io.InvalidClassException

2015-07-13 Thread Saif.A.Ellafi
Thank you, extending Serializable solved the issue. I am left with more questions than answers though :-). Regards, Saif From: Yana Kadiyska [mailto:yana.kadiy...@gmail.com] Sent: Monday, July 13, 2015 2:49 PM To: Ellafi, Saif A. Cc: user@spark.apache.org; Liu, Weicheng Subject: Re:

RE: java.io.InvalidClassException

2015-07-13 Thread Saif.A.Ellafi
Thank you very much for your time, here is how I designed the case classes, as far as I know they apply properly. Ps: By the way, what do you mean by “The programming guide?” abstract class Validator { // positions to access with Row.getInt(x) val shortsale_in_pos = 10 val

java.io.InvalidClassException

2015-07-13 Thread Saif.A.Ellafi
Hi, For some experiment I am doing, I am trying to do the following. 1.Created an abstract class Validator. Created case objects from Validator with validate(row: Row): Boolean method. 2. Adding in a list all case objects 3. Each validate takes a Row into account, returns itself if validate

MLLIB RDD segmentation for logistic regression

2015-07-13 Thread Saif.A.Ellafi
Hello all, I have one big RDD, in which there is a column of groups A1, A2, B1, B2, B3, C1, D1, ..., XY. Out of it, I am using map() to transform into RDD[LabeledPoint] with dense vectors for later use into Logistic Regression, which takes RDD[LabeledPoint] I would like to run a logistic

How to deal with null values on LabeledPoint

2015-07-07 Thread Saif.A.Ellafi
Hello, reading from spark-csv, got some lines with missing data (not invalid). applying map() to create a LabeledPoint with denseVector. Using map( Row = Row.getDouble(col_index) ) To this point: res173: org.apache.spark.mllib.regression.LabeledPoint =

Spark-csv into labeled points with null values

2015-07-03 Thread Saif.A.Ellafi
Hello all, I am learning scala spark and going through some applications with data I have. Please allow me to put a couple questions: spark-csv: The data I have, ain't malformed, but there are empty values in some rows, properly comma-sepparated and not catched by DROPMALFORMED mode These