setJobGroup inside mapped futures

2017-06-28 Thread Saif.A.Ellafi
Hi, Got a question here around! I am setting a setJobGroup to be cancelled within a Future. Thing is, such future is affected by several map, flatMap, for comprehensions and even andThen side effects. I am being unable to cancel the job properly within a group, and I am unsure whether this is

RE: Do we really need mesos or yarn? or is standalone sufficent?

2016-12-16 Thread Saif.A.Ellafi
In my experience, Standalone works very well in small cluster where there isn’t anything else running. Bigger cluster or shared resources, YARN takes a win, surpassing the overhead of spawning containers as opposed to a background running worker. Best is if you try both, if standalone is good e

Spark with YARN

2016-12-13 Thread Saif.A.Ellafi
Hello all, I am trying to setup a Spark interpreter who hits YARN on Zeppelin. We have both HDP and Standalone Spark running. Stand alone Spark works fine but fails if interpreter is set to yarn master (YARN log): Error: Could not find or load main class org.apache.spark.deploy.yarn.ExecutorL

Help sqoop import parquet hive

2016-12-08 Thread Saif.A.Ellafi
Hello all, I am currently struggling a lot to ingest data from Teradata into HDFS Hive in Parquet format. 1. I was expecting sqoop to create the tables automatically, but then I get an error of Import Hive table's column schema is missing. 2. Instead of import, I troubleshooted and ju

Sqoop import from Teradata

2016-12-07 Thread Saif.A.Ellafi
Hello, One of the teradata databases I am trying to import from using does not allow setting transaction level, I get java.sql.SQLException: [AsterData][JDBC](11975) Unsupported transaction isolation level: 2 I know I can try to change the level here, but the server database simply does not a

Cluster deploy mode driver location

2016-11-21 Thread Saif.A.Ellafi
Hello there, I have a Spark program in 1.6.1, however, when I submit it to cluster, it randomly picks the driver. I know there is a driver specification option, but along with it it is mandatory to define many other options I am not familiar with. The trouble is, the .jars I am launching need

RE: Applying a limit after orderBy of big dataframe hangs spark

2016-08-05 Thread Saif.A.Ellafi
Hi thanks for the assistance, 1. Standalone 2. df.orderBy(field).limit(5000).write.parquet(...) From: Mich Talebzadeh [mailto:mich.talebza...@gmail.com] Sent: Friday, August 05, 2016 4:33 PM To: Ellafi, Saif A. Cc: user @spark Subject: Re: Applying a limit after orderBy of big dataf

Applying a limit after orderBy of big dataframe hangs spark

2016-08-05 Thread Saif.A.Ellafi
Hi all, I am working with a 1.5 billon rows dataframe in a small cluster and trying to apply an orderBy operation by one of the Long Types columns. If I limit such output to some number, say 5 millon, then trying to count, persist or store the dataframe makes spark crash with losing executors a

multiple SPARK_LOCAL_DIRS causing strange behavior in parallelism

2016-07-29 Thread Saif.A.Ellafi
Hi all, I was currently playing around with spark-env around SPARK_LOCAL_DIRS in order to add additional shuffle storage. But since I did this, I am getting too many open files error if total executor cores is high. I am also getting low parallelism, by monitoring the running tasks on some big

RE: Strategies for propery load-balanced partitioning

2016-06-03 Thread Saif.A.Ellafi
Appreciate the follow up. I am not entirely sure how or why my question is related to bucketization capabilities. It indeeds sounds like a powerful feature to avoid shuffling, but in my case, I am referring to straight forward processes of reading data and writing to parquet. If bucket tables a

Strategies for propery load-balanced partitioning

2016-06-03 Thread Saif.A.Ellafi
Hello everyone! I was noticing that, when reading parquet files or actually any kind of source data frame data (spark-csv, etc), default partinioning is not fair. Action tasks usually act very fast on some partitions and very slow on some others, and frequently, even fast on all but last partiti

spark-submit not adding application jar to classpath

2016-04-18 Thread Saif.A.Ellafi
Hi, I am submitting a jar file to spark submit, which has some content inside src/main/resources. I am unable to access such resources, since the application jar is not being added to the classpath. This works fine if I include the application jar also in the -driver-class-path entry. Is this

RE: Strange bug: Filter problem with parenthesis

2016-04-14 Thread Saif.A.Ellafi
Appreciated Michael, but this doesn’t help my case, the filter string is being submitted from outside my program, is there any other alternative? some literal string parser or anything I can do before? Saif From: Michael Armbrust [mailto:mich...@databricks.com] Sent: Wednesday, April 13, 2016 6

Strange bug: Filter problem with parenthesis

2016-04-13 Thread Saif.A.Ellafi
Hi, I am debugging a program, and for some reason, a line calling the following is failing: df.filter("sum(OpenAccounts) > 5").show It says it cannot find the column OpenAccounts, as if it was applying the sum() function and looking for a column called like that, where there is not. This work

Can we load csv partitioned data into one DF?

2016-02-22 Thread Saif.A.Ellafi
Hello all, I am facing a silly data question. If I have +100 csv files which are part of the same data, but each csv is for example, a year on a timeframe column (i.e. partitioned by year), what would you suggest instead of loading all those files and joining them? Final target would be parquet.

RE: jobs much slower in cluster mode vs local

2016-01-15 Thread Saif.A.Ellafi
Thank you, this looks useful indeed for what I have in mind. Saif From: Jiří Syrový [mailto:syrovy.j...@gmail.com] Sent: Friday, January 15, 2016 12:06 PM To: Ellafi, Saif A. Cc: user@spark.apache.org Subject: Re: jobs much slower in cluster mode vs local Hi, you can try to use spark job server

jobs much slower in cluster mode vs local

2016-01-15 Thread Saif.A.Ellafi
Hello, In general, I am usually able to run spark submit jobs in local mode, in a 32-cores node with plenty of memory ram. The performance is significantly faster in local mode than when using a cluster of spark workers. How can this be explained and what measures can one take in order to impro

Big data job only finishes with Legacy memory management

2016-01-12 Thread Saif.A.Ellafi
Hello, I am tinkering with Spark 1.6. I have this 1.5 Billion rows data, to which I apply several window functions such as lag, first, etc. The job is quite expensive, I am running a small cluster with executors running with 70GB of ram. Using new memory management system, the job fails around

RE: Is Spark 1.6 released?

2016-01-04 Thread Saif.A.Ellafi
Where can I read more about the dataset api on a user layer? I am failing to get an API doc or understand when to use DataFrame or DataSet, advantages, etc. Thanks, Saif -Original Message- From: Jean-Baptiste Onofré [mailto:j...@nanthrax.net] Sent: Monday, January 04, 2016 2:01 PM To: u

Limit of application submission to cluster

2015-12-18 Thread Saif.A.Ellafi
Hello everyone, I am testing some parallel program submission to a stand alone cluster. Everything works alright, the problem is, for some reason, I can't submit more than 3 programs to the cluster. The fourth one, whether legacy or REST, simply hangs until one of the first three completes. I a

ML - LinearRegression: is this a bug ????

2015-12-02 Thread Saif.A.Ellafi
Data: +---++ | label|features| +---++ |0.13271745268556925|[-0.2006809895664...| |0.23956421080605234|[-0.0938342314459...| |0.47464690691431843|[0.14124846466227...| | 0.0941426858669834|[-0.2392557563850...

Strategy for large amount of small tasks

2015-12-02 Thread Saif.A.Ellafi
Hello all, I am running a 3~4 cluster nodes under yarn. I have a small dataset (500k~) but a huge amount of internal tasks, for example loop for different segments of the data and run many computations inside each. It looks like strategies such as disabling serialization, and increasing the am

RE: Inconsistent Persistence of DataFrames in Spark 1.5

2015-10-28 Thread Saif.A.Ellafi
Hi, just a couple cents. are your joining columns StringTypes (id field)? I have recently reported a bug where having inconsistent results when filtering String fields in group operations. Saif From: Colin Alstad [mailto:colin.als...@pokitdok.com] Sent: Wednesday, October 28, 2015 12:39 PM To:

Results change in group by operation

2015-10-26 Thread Saif.A.Ellafi
Hello Everyone, I would need urgent help with a data consistency issue I am having. Stand alone Cluster of five servers. sqlContext instance of HiveContext (default in spark-shell) No special options other than driver memory and executor memory. Parquet partitions are 512 where there are 160 core

Kryo makes String data invalid

2015-10-26 Thread Saif.A.Ellafi
Hi all, I have a parquet file, which I am loading in a shell. When I launch the shell with -driver-java-options ="-Dspark.serializer=...kryo", makes a couple fields look like: 03-?? ??-?? ??-??? when calling > data.first I will confirm briefly, but I am utterly sure it happens only

RE: Spark groupby and agg inconsistent and missing data

2015-10-22 Thread Saif.A.Ellafi
nevermind my last email. res2 is filtered so my test does not make sense. The issue is not reproduced there. I have the problem somwhere else. From: Ellafi, Saif A. Sent: Thursday, October 22, 2015 12:57 PM To: 'Xiao Li' Cc: user Subject: RE: Spark groupby and agg inconsistent and missing data T

RE: Spark groupby and agg inconsistent and missing data

2015-10-22 Thread Saif.A.Ellafi
Thanks, sorry I cannot share the data and not sure how much significant it will be for you. I am reproducing the issue on a smaller piece of the content and see wether I find a reason on the inconsistence. val res2 = data.filter($"closed" === $"ever_closed").groupBy("product", "band ", "aget",

Spark groupby and agg inconsistent and missing data

2015-10-22 Thread Saif.A.Ellafi
Hello everyone, I am doing some analytics experiments under a 4 server stand-alone cluster in a spark shell, mostly involving a huge database with groupBy and aggregations. I am picking 6 groupBy columns and returning various aggregated results in a dataframe. GroupBy fields are of two types, m

How to speed up reading from file?

2015-10-16 Thread Saif.A.Ellafi
Hello, Is there an optimal number of partitions per number of rows, when writing into disk, so we can re-read later from source in a distributed way? Any thoughts? Thanks Saif

thriftserver: access temp dataframe from in-memory of spark-shell

2015-10-14 Thread Saif.A.Ellafi
Hi, Is it possible to load a spark-shell, in which we do any number of operations in a dataframe, then register it as a temporary table and get to see it through thriftserver? ps. or even better, submit a full job and store the dataframe in thriftserver in-memory before the job completes. I ha

RE: Spark shuffle service does not work in stand alone

2015-10-13 Thread Saif.A.Ellafi
Thanks, I missed that one. From: Marcelo Vanzin [mailto:van...@cloudera.com] Sent: Tuesday, October 13, 2015 2:36 PM To: Ellafi, Saif A. Cc: user@spark.apache.org Subject: Re: Spark shuffle service does not work in stand alone You have to manually start the shuffle service if you're not running Y

RE: Spark shuffle service does not work in stand alone

2015-10-13 Thread Saif.A.Ellafi
I believe the confusion here is self-answered. The thing is that in the documentation, the spark shuffle service runs only under YARN, while here we are speaking about a stand alone cluster. The proper question is, how to launch a shuffle service for stand alone? Saif From: saif.a.ell...@wellsf

RE: Spark shuffle service does not work in stand alone

2015-10-13 Thread Saif.A.Ellafi
Hi, thanks Executors are simply failing to connect to a shuffle server: 15/10/13 08:29:34 INFO BlockManagerMaster: Registered BlockManager 15/10/13 08:29:34 INFO BlockManager: Registering executor with local external shuffle service. 15/10/13 08:29:34 ERROR BlockManager: Failed to connect to ext

Spark shuffle service does not work in stand alone

2015-10-13 Thread Saif.A.Ellafi
Has anyone tried shuffle service in Stand Alone cluster mode? I want to enable it for d.a. but my jobs never start when I submit them. This happens with all my jobs. 15/10/13 08:29:45 INFO DAGScheduler: Job 0 failed: json at DataLoader.scala:86, took 16.318615 s Exception in thread "main" org.

RE: How to calculate percentile of a column of DataFrame?

2015-10-09 Thread Saif.A.Ellafi
Yes but I mean, this is rather curious. How is def lit(literal:Any) --> becomes a percentile function lit(25) Thanks for clarification Saif From: Umesh Kacha [mailto:umesh.ka...@gmail.com] Sent: Friday, October 09, 2015 4:10 PM To: Ellafi, Saif A. Cc: Michael Armbrust; user Subject: Re: How to c

RE: How to calculate percentile of a column of DataFrame?

2015-10-09 Thread Saif.A.Ellafi
Where can we find other available functions such as lit() ? I can’t find lit in the api. Thanks From: Michael Armbrust [mailto:mich...@databricks.com] Sent: Friday, October 09, 2015 4:04 PM To: unk1102 Cc: user Subject: Re: How to calculate percentile of a column of DataFrame? You can use callU

Best storage format for intermediate process

2015-10-09 Thread Saif.A.Ellafi
Hi all, I am in the procss of learning big data. Right now, I am bringing huge databases through JDBC to Spark (a 250 million rows table can take around 3 hours), and then re-saving it into JSON, which is fast, simple, distributed, fail-safe and stores data types, although without any compressi

RE: RowNumber in HiveContext returns null or negative values

2015-10-08 Thread Saif.A.Ellafi
Repartition and default parallelism to 1, in cluster mode, is still broken. So the problem is not the parallelism, but the cluster mode itself. Something wrong with HiveContext + cluster mode. Saif From: saif.a.ell...@wellsfargo.com [mailto:saif.a.ell...@wellsfargo.com] Sent: Thursday, October

RE: RowNumber in HiveContext returns null or negative values

2015-10-08 Thread Saif.A.Ellafi
It turns out this does not happen in local[32] mode. Only happens when submiting to standalone cluster. Don’t have YARN/MESOS to compare. Will keep diagnosing. Saif From: saif.a.ell...@wellsfargo.com [mailto:saif.a.ell...@wellsfargo.com] Sent: Thursday, October 08, 2015 3:01 PM To: mich...@data

RE: RowNumber in HiveContext returns null or negative values

2015-10-08 Thread Saif.A.Ellafi
Hi, thanks for looking into. v1.5.1. I am really worried. I dont have hive/hadoop for real in the environment. Saif From: Michael Armbrust [mailto:mich...@databricks.com] Sent: Thursday, October 08, 2015 2:57 PM To: Ellafi, Saif A. Cc: user Subject: Re: RowNumber in HiveContext returns null or ne

RE: RowNumber in HiveContext returns null, negative numbers or huge

2015-10-08 Thread Saif.A.Ellafi
Hi, I have figured this only happens in cluster mode. working properly in local[32] From: saif.a.ell...@wellsfargo.com [mailto:saif.a.ell...@wellsfargo.com] Sent: Thursday, October 08, 2015 10:23 AM To: dev@spark.apache.org Subject: RowNumber in HiveContext returns null, negative numbers or huge

RowNumber in HiveContext returns null or negative values

2015-10-08 Thread Saif.A.Ellafi
Hi all, would this be a bug?? val ws = Window. partitionBy("clrty_id"). orderBy("filemonth_dtt") val nm = "repeatMe" df.select(df.col("*"), rowNumber().over(ws).cast("int").as(nm)) stacked_data.filter(stacked_data("repeatMe").isNotNull).or

RowNumber in HiveContext returns null, negative numbers or huge

2015-10-08 Thread Saif.A.Ellafi
Hi all, would this be a bug?? val ws = Window. partitionBy("clrty_id"). orderBy("filemonth_dtt") val nm = "repeatMe" df.select(df.col("*"), rowNumber().over(ws).cast("int").as(nm)) stacked_data.filter(stacked_data("repeatMe").isNotNull).or

RE: Spark standalone hangup during shuffle flatMap or explode in cluster

2015-10-07 Thread Saif.A.Ellafi
It can be large yes. But, that still does not resolve the question of why it works in smaller environment, i.e. Local[32] or in cluster mode when using SQLContext instead of HiveContext. The process in general, is a RowNumber() hiveQL operation, that is why I need HiveContext. I have the feeli

Spark standalone hangup during shuffle flatMap or explode in cluster

2015-10-07 Thread Saif.A.Ellafi
When running stand-alone cluster mode job, the process hangs up randomly during a DataFrame flatMap or explode operation, in HiveContext: -->> df.flatMap(r => for (n <- 1 to r.getInt(ind)) yield r) This does not happen either with SQLContext in cluster, or Hive/SQL in local mode, where it works

Spark standalone hangup during shuffle flatMap or explode in cluster

2015-10-07 Thread Saif.A.Ellafi
When running stand-alone cluster mode job, the process hangs up randomly during a DataFrame flatMap or explode operation, in HiveContext: -->> df.flatMap(r => for (n <- 1 to r.getInt(ind)) yield r) This does not happen either with SQLContext in cluster, or Hive/SQL in local mode, where it works

Help with big data operation performance

2015-10-06 Thread Saif.A.Ellafi
Hi all, In a stand-alone cluster operation, with more than 80 gbs of ram in each node, am trying to: 1. load a partitioned json dataframe which weights around 100GB as input 2. apply transformations such as cast some column types 3. get some percentiles which involves sort by, rdd

HiveContext in standalone mode: shuffle hang ups

2015-10-05 Thread Saif.A.Ellafi
Hi all, I have a process where local mode takes only 40 seconds. While the same on stand-alone mode, being the same node used for local mode the only available node, is taking up for ever. rdd actions hang up. I could only "sort this out" by turning speculation on, so the same task hanging is

Please help: Processes with HiveContext slower in cluster

2015-10-05 Thread Saif.A.Ellafi
Hi, I have a HiveContext job which takes less than 1 minute to complete in local mode with 16 cores. However, when I launch it over stand-alone cluster, it takes for ever, probably can't even finish. Even when I have the same only node running up in which I execute it locally. How could I diag

How to change verbosity level and redirect verbosity to file?

2015-10-05 Thread Saif.A.Ellafi
Hi, I would like to read the full spark-submit log once a job is completed, since the output is not stdout, how could I redirect spark output to a file in linux? Thanks, Saif

RE: how to broadcast huge lookup table?

2015-10-02 Thread Saif.A.Ellafi
Hi, thank you I would prefer to leave writing-to-disk as a last resort. Is it a last resort? Saif From: Ted Yu [mailto:yuzhih...@gmail.com] Sent: Friday, October 02, 2015 3:54 PM To: Ellafi, Saif A. Cc: user Subject: Re: how to broadcast huge lookup table? Have you considered using external sto

how to broadcast huge lookup table?

2015-10-02 Thread Saif.A.Ellafi
I tried broadcasting a key-value rdd, but then I cannot perform any rdd-actions inside a map/foreach function of another rdd. any tips? If going into scala collections I end up with huge memory bottlenecks. Saif

from groupBy return a DataFrame without aggregation?

2015-10-02 Thread Saif.A.Ellafi
Given ID, DATE, I need all sorted dates per ID, what is the easiest way? I got this but I don't like it: val l = zz.groupBy("id", " dt").agg($"dt".as("dummy")).sort($"dt".asc) Saif

RE: Accumulator of rows?

2015-10-02 Thread Saif.A.Ellafi
Thank you, exactly what I was looking for. I have read of it before but never associated. Saif From: Adrian Tanase [mailto:atan...@adobe.com] Sent: Friday, October 02, 2015 8:24 AM To: Ellafi, Saif A.; user@spark.apache.org Subject: Re: Accumulator of rows? Have you seen window functions? https

Accumulator of rows?

2015-10-01 Thread Saif.A.Ellafi
Hi all, I need to repeat a couple rows from a dataframe by n times each. To do so, I plan to create a new Data Frame, but I am being unable to find a way to accumulate "Rows" somewhere, as this might get huge, I can't accumulate into a mutable Array, I think?. Thanks, Saif

What is the best way to submit multiple tasks?

2015-09-30 Thread Saif.A.Ellafi
Hi all, I have a process where I do some calculations on each one of the columns of a dataframe. Intrinsecally, I run across each column with a for loop. On the other hand, each process itself is non-entirely-distributable. To speed up the process, I would like to submit a spark program for eac

Cant perform full outer join

2015-09-29 Thread Saif.A.Ellafi
Hi all, So I Have two dataframes, with two columns: DATE and VALUE. Performing this: data = data.join(cur_data, data("DATE") === cur_data("DATE"), "outer") returns Exception in thread "main" org.apache.spark.sql.AnalysisException: Reference 'DATE' is ambiguous, could be: DATE#0, DATE#3.; But i

RowMatrix tallSkinnyQR - ERROR: Second call to constructor of static parser

2015-09-22 Thread Saif.A.Ellafi
Hi all, wondering if any could make the new 1.5.0 stallSkinnyQR to work. Follows my output, which is a big loop of the same errors until the shell dies. I am curious since im failing to load any implementations from BLAS, LAPACK, etc. scala> mat.tallSkinnyQR(false) 15/09/22 10:18:11 WARN LAPACK:

RE: ML: embed a transformer

2015-09-14 Thread Saif.A.Ellafi
Thank you, I will do as you suggested. Ps: I read that in this random user archive I found: http://mail-archives.us.apache.org/mod_mbox/spark-user/201506.mbox/%3c55709f7b.2090...@gmail.com%3E Saif From: Feynman Liang [mailto:fli...@databricks.com] Sent: Monday, September 14, 2015 4:08 PM To: Ell

ML: embed a transformer

2015-09-14 Thread Saif.A.Ellafi
Hi all, I'm very new to spark and looking forward to get deep into the topic. Right now I am trying to inherit my own transformer, by what I am reading so far, it is not very public that we can apply to this practice as "users". I am defining my transformer based on the Binarizer, but simply fail

Where can I learn how to write udf?

2015-09-14 Thread Saif.A.Ellafi
Hi all, I am failing to find a proper guide or tutorial onto how to write proper udf functions in scala. Appreciate the effort saving, Saif

Compress JSON dataframes

2015-09-08 Thread Saif.A.Ellafi
Hi, I am trying to figure out a way to compress df.write.json() but have not been succesful, even changing spark.io.compression. Any thoughts? Saif

RE: What is the current status of ML ?

2015-09-01 Thread Saif.A.Ellafi
Thank you, so I was inversely confused. At first I thought MLLIB was the future, but based on what you say. MLLIB will be the past. Intersting. This means that if I look forward over using the pipelines system, I shouldn't be obsolete. Any more insights welcome, Saif -Original Message-

What is the current status of ML ?

2015-09-01 Thread Saif.A.Ellafi
Hi all, I am little bit confused, as getting introduced to Spark recently. What is going on with ML? Is it going to be deprecated? Or are all of its features valid and constructed over? It has a set of features and ML Constructors which I like to use, but need to confirm wether the future of th

Is there a way to store RDD and load it with its original format?

2015-08-27 Thread Saif.A.Ellafi
Hi, Any way to store/load RDDs keeping their original object instead of string? I am having trouble with parquet (there is always some error at class conversion), and don't use hadoop. Looking for alternatives. Thanks in advance Saif

Best way to filter null on "any" column?

2015-08-27 Thread Saif.A.Ellafi
Hi all, What would be a good way to filter rows in a dataframe, where any value in a row has null? I wouldn't want to go through each column manually. Thanks, Saif

RE: Help! Stuck using withColumn

2015-08-27 Thread Saif.A.Ellafi
Hello, thank you for the response. I found a blog where a guy explains that it is not possible to join columns from different data frames. I was trying to modify one column’s information, so selecting it and then trying to replace the original dataframe column. Found another way, Thanks Saif

RE: Help! Stuck using withColumn

2015-08-26 Thread Saif.A.Ellafi
I can reproduce this even simpler with the following: val gf = sc.parallelize(Array(3,6,4,7,3,4,5,5,31,4,5,2)).toDF("ASD") val ff = sc.parallelize(Array(4,6,2,3,5,1,4,6,23,6,4,7)).toDF("GFD") gf.withColumn("DSA", ff.col("GFD")) org.apache.spark.sql.AnalysisException: resolved attribute(s) GFD#42

Help! Stuck using withColumn

2015-08-26 Thread Saif.A.Ellafi
This simple comand call: val final_df = data.select("month_balance").withColumn("month_date", data.col("month_date_curr")) Is throwing: org.apache.spark.sql.AnalysisException: resolved attribute(s) month_date_curr#324 missing from month_balance#234 in operator !Project [month_balance#234, mon

Scala: Overload method by its class type

2015-08-25 Thread Saif.A.Ellafi
Hi all, I have SomeClass[TYPE] { def some_method(args: fixed_type_args): TYPE } And on runtime, I create instances of this class with different AnyVal + String types, but the return type of some_method varies. I know I could do this with an implicit object, IF some_method received a type, but

Set custm worker id ?

2015-08-21 Thread Saif.A.Ellafi
Hi, Is it possible in standalone to set up worker ID names? to avoid the worker-19248891237482379-ip..-port ?? Thanks, Saif

Want to install lz4 compression

2015-08-21 Thread Saif.A.Ellafi
Hi all, I am using pre-compiled spark with hadoop 2.6. LZ4 Codec is not on hadoop's native libraries, so I am not being able to use it. Can anyone suggest on how to proceed? Hopefully I wont have to recompile hadoop. I tried changing the --driver-library-path to point directly into lz4 stand a

RE: Scala: How to match a java object????

2015-08-19 Thread Saif.A.Ellafi
It is okay. this methodology works very well for mapping objects of my Seq[Any]. It is indeed very cool :-) Saif From: Ted Yu [mailto:yuzhih...@gmail.com] Sent: Wednesday, August 19, 2015 10:47 AM To: Ellafi, Saif A. Cc: ; ; Subject: Re: Scala: How to match a java object Saif: In your exam

RE: Scala: How to match a java object????

2015-08-19 Thread Saif.A.Ellafi
Hi, thank you all for the asssistance. It is odd, it works when creating a new java.mathBigDecimal object, but not if I work directly with scala> 5 match { case x: java.math.BigDecimal => 2 } :23: error: scrutinee is incompatible with pattern type; found : java.math.BigDecimal required: Int

RE: Scala: How to match a java object????

2015-08-18 Thread Saif.A.Ellafi
Hi, Can you please elaborate? I am confused :-) Saif -Original Message- From: Marcelo Vanzin [mailto:van...@cloudera.com] Sent: Tuesday, August 18, 2015 5:15 PM To: Ellafi, Saif A. Cc: wrbri...@gmail.com; user@spark.apache.org Subject: Re: Scala: How to match a java object On Tue, A

RE: Scala: How to match a java object????

2015-08-18 Thread Saif.A.Ellafi
Hi, thank you for further assistance you can reproduce this by simply running 5 match { case java.math.BigDecimal => 2 } In my personal case, I am applying a map acton to a Seq[Any], so the elements inside are of type any, to which I need to apply a proper .asInstanceOf[WhoYouShouldBe]. Saif

Scala: How to match a java object????

2015-08-18 Thread Saif.A.Ellafi
Hi all, I am trying to run a spark job, in which I receive java.math.BigDecimal objects, instead of the scala equivalents, and I am trying to convert them into Doubles. If I try to match-case this object class, I get: "error: object java.math.BigDecimal is not a value" How could I get around m

Help with persist: Data is requested again

2015-08-14 Thread Saif.A.Ellafi
Hello all, I am writing a program which calls from a database. A run a couple computations, but in the end I have a while loop, in which I make a modification to the persisted thata. eg: val data = PairRDD... persist() var i = 0 while (i < 10) { val data_mod = data.map(_._1 + 1, _._2)

Cannot cast to Tuple when running in cluster mode

2015-08-14 Thread Saif.A.Ellafi
Hi All, I have a working program, in which I create two big tuples2 out of the data. This seems to work in local but when I switch over cluster standalone mode, I get this error at the very beggining: 15/08/14 10:22:25 WARN TaskSetManager: Lost task 4.0 in stage 1.0 (TID 10, 162.101.194.44): j

Does print/event logging affect performance?

2015-08-11 Thread Saif.A.Ellafi
Hi all, silly question. Does logging info messages, both print or to file, or event logging, cause any impact to general performance of spark? Saif

RE: Parquet without hadoop: Possible?

2015-08-11 Thread Saif.A.Ellafi
I confirm that it works, I was just having this issue: https://issues.apache.org/jira/browse/SPARK-8450 Saif From: Ellafi, Saif A. Sent: Tuesday, August 11, 2015 12:01 PM To: Ellafi, Saif A.; deanwamp...@gmail.com Cc: user@spark.apache.org Subject: RE: Parquet without hadoop: Possible? Sorry, I

RE: Parquet without hadoop: Possible?

2015-08-11 Thread Saif.A.Ellafi
Sorry, I provided bad information. This example worked fine with reduced parallelism. It seems my problem have to do with something specific with the real data frame at reading point. Saif From: saif.a.ell...@wellsfargo.com [mailto:saif.a.ell...@wellsfargo.com] Sent: Tuesday, August 11, 2015

RE: Parquet without hadoop: Possible?

2015-08-11 Thread Saif.A.Ellafi
I am launching my spark-shell spark-1.4.1-bin-hadoop2.6/bin/spark-shell 15/08/11 09:43:32 INFO SparkILoop: Created sql context (with Hive support).. SQL context available as sqlContext. scala> val data = sc.parallelize(Array(2,3,5,7,2,3,6,1)).toDF scala> data.write.parquet("/var/ data/Saif/pq")

Parquet without hadoop: Possible?

2015-08-11 Thread Saif.A.Ellafi
Hi all, I don't have any hadoop fs installed on my environment, but I would like to store dataframes in parquet files. I am failing to do so, if possible, anyone have any pointers? Thank you, Saif

RE: Spark master driver UI: How to keep it after process finished?

2015-08-07 Thread Saif.A.Ellafi
Hello, thank you, but that port is unreachable for me. Can you please share where can I find that port equivalent in my environment? Thank you Saif From: François Pelletier [mailto:newslett...@francoispelletier.org] Sent: Friday, August 07, 2015 4:38 PM To: user@spark.apache.org Subject: Re: Spa

Spark master driver UI: How to keep it after process finished?

2015-08-07 Thread Saif.A.Ellafi
Hi, A silly question here. The Driver Web UI dies when the spark-submit program finish. I would like some time to analyze after the program ends, as the page does not refresh it self, when I hit F5 I lose all the info. Thanks, Saif

Possible bug: JDBC with Speculative mode launches orphan queries

2015-08-07 Thread Saif.A.Ellafi
Hello, When enabling speculation, my first job is to launch a partitioned JDBC dataframe query, in which some partitions take longer than others to respond. This causes speculation and creates new nodes to launch the query. When one of those nodes finish the query, the speculative one remains f

RE: Too many open files

2015-07-29 Thread Saif.A.Ellafi
Thank you both, I will take a look, but 1. For high-shuffle tasks, is this right for the system to have the size and thresholds high? I hope there is no bad consequences. 2. I will try to overlook admin access and see if I can get anything with only user rights From: Ted Yu [mailt

Too many open files

2015-07-29 Thread Saif.A.Ellafi
Hello, I've seen a couple emails on this issue but could not find anything to solve my situation. Tried to reduce the partitioning level, enable consolidateFiles and increase the sizeInFlight limit, but still no help. Spill manager is sort, which is the default, any advice? 15/07/29 10:37:01

RE: Fighting against performance: JDBC RDD badly distributed

2015-07-28 Thread Saif.A.Ellafi
Thank you for your response Zhen, I am using some vendor specific JDBC driver JAR file (honestly I dont know where it came from). It’s api is NOT like JdbcRDD, instead, more like jdbc from DataFrameReader https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameRea

Fighting against performance: JDBC RDD badly distributed

2015-07-28 Thread Saif.A.Ellafi
Hi all, I am experimenting and learning performance on big tasks locally, with a 32 cores node and more than 64GB of Ram, data is loaded from a database through JDBC driver, and launching heavy computations against it. I am presented with two questions: 1. My RDD is poorly distributed. I

CPU Parallelization not being used (local mode)

2015-07-27 Thread Saif.A.Ellafi
Hi all, would like some insight. I am currently computing huge databases, and playing with monitoring and tunning. When monitoring the multiple cores I have, I see that even when RDDs are parallelized, computation on the RDD jump from core to core sporadically ( I guess, depending on where the

suggest coding platform

2015-07-24 Thread Saif.A.Ellafi
Hi all, I tried Notebook Incubator Zeppelin, but I am not completely happy with it. What do you people use for coding? Anything with auto-complete, proper warning logs and perhaps some colored syntax. My platform is on linux, so anything with some notebook studio, or perhaps a windows IDE with

RE: [MLLIB] Anyone tried correlation with RDD[Vector] ?

2015-07-23 Thread Saif.A.Ellafi
Thank you very much, working fine so far Saif From: Robin East [mailto:robin.e...@xense.co.uk] Sent: Thursday, July 23, 2015 12:26 PM To: Rishi Yadav Cc: Ellafi, Saif A.; user@spark.apache.org; Liu, Weicheng Subject: Re: [MLLIB] Anyone tried correlation with RDD[Vector] ? The OP’s problem is he

[MLLIB] Anyone tried correlation with RDD[Vector] ?

2015-07-23 Thread Saif.A.Ellafi
I tried with a RDD[DenseVector] but RDDs are not transformable, so T+ RDD[DenseVector] not >: RDD[Vector] and can't get to use the RDD input method of correlation. Thanks, Saif

Spark on scala 2.11

2015-07-17 Thread Saif.A.Ellafi
Hello everyone, I need help adding or replacing an interpreter with Spark on scala 2.11 instead of 2.10.4. Has anyone tried? Tried two things, changing pom files pointing to 2.11 and creating new interpreter but I am failing to make it actually appear. Any guidance appreciated. Saif

RE: Select all columns except some

2015-07-17 Thread Saif.A.Ellafi
Hello, thank you for your time. Seq[String] works perfectly fine. I also tried running a for loop through all elements to see if any access to a value was broken, but no, they are alright. For now, I solved it properly calling this. Sadly, it takes a lot of time, but works: var data_sas = sql

Select all columns except some

2015-07-16 Thread Saif.A.Ellafi
Hi, In a hundred columns dataframe, I wish to either select all of them except or drop the ones I dont want. I am failing in doing such simple task, tried two ways val clean_cols = df.columns.filterNot(col_name => col_name.startWith("STATE_").mkString(", ") df.select(clean_cols) But this thro

MLLIB RDD segmentation for logistic regression

2015-07-13 Thread Saif.A.Ellafi
Hello all, I have one big RDD, in which there is a column of groups A1, A2, B1, B2, B3, C1, D1, ..., XY. Out of it, I am using map() to transform into RDD[LabeledPoint] with dense vectors for later use into Logistic Regression, which takes RDD[LabeledPoint] I would like to run a logistic regress

RE: java.io.InvalidClassException

2015-07-13 Thread Saif.A.Ellafi
Thank you, extending Serializable solved the issue. I am left with more questions than answers though :-). Regards, Saif From: Yana Kadiyska [mailto:yana.kadiy...@gmail.com] Sent: Monday, July 13, 2015 2:49 PM To: Ellafi, Saif A. Cc: user@spark.apache.org; Liu, Weicheng Subject: Re: java.io.Inva

  1   2   >