Why my Spark job is slow and it throws OOM which leads YARN killing executors?

2015-09-12 Thread unk1102
adoopConfiguration()); HiveContext hiveContext = createHiveContext(sc); declareHiveUDFs(hiveContext); DateTimeFormatter df = DateTimeFormat.forPattern("MMdd"); String yestday = "20150912"; hiveContext.sql(" use

[Question] ORC - EMRFS Problem

2015-09-12 Thread Cazen
Good Day! I think there are some problems between ORC and AWS EMRFS. When I was trying to read "upper 150M" ORC files from S3, ArrayOutOfIndex Exception occured. I'm sure that it's AWS side issue because there was no exception when trying from HDFS or S3NativeFileSystem. Parquet runs

Re: What is the best way to migrate existing scikit-learn code to PySpark?

2015-09-12 Thread Jörn Franke
I fear you have to do the plumbing all yourself. This is the same for all commercial and non-commercial libraries/analytics packages. It often also depends on the functional requirements on how you distribute. Le sam. 12 sept. 2015 à 20:18, Rex X a écrit : > Hi everyone, > >

RDD transformation and action running out of memory

2015-09-12 Thread Utkarsh Sengar
I am trying to run this, a basic mapToPair and then count() to trigger an action. 4 executors are launched but I don't see any relevant logs on those executors. It looks like the the driver is pulling all the data and it runs out of memory, the dataset is big, so it won't fit on 1 machine. So

Re: What is the best way to migrate existing scikit-learn code to PySpark?

2015-09-12 Thread Rex X
Jorn and Nick, Thanks for answering. Nick, the sparkit-learn project looks interesting. Thanks for mentioning it. Rex On Sat, Sep 12, 2015 at 12:05 PM, Nick Pentreath wrote: > You might want to check out https://github.com/lensacom/sparkit-learn >

What is the best way to migrate existing scikit-learn code to PySpark?

2015-09-12 Thread Rex X
Hi everyone, What is the best way to migrate existing scikit-learn code to PySpark cluster? Then we can bring together the full power of both scikit-learn and spark, to do scalable machine learning. (I know we have MLlib. But the existing code base is big, and some functions are not fully

Re: Why my Spark job is slow and it throws OOM which leads YARN killing executors?

2015-09-12 Thread Richard W. Eggert II
eSystem.get(sc.hadoopConfiguration()); HiveContext hiveContext = createHiveContext(sc); declareHiveUDFs(hiveContext); DateTimeFormatter df = DateTimeFormat.forPattern("MMdd"); String yestday = "20150912"; hiveContext

Re: RDD transformation and action running out of memory

2015-09-12 Thread Richard Eggert
Hmm... The count() method invokes this: def runJob[T, U: ClassTag](rdd: RDD[T], func: Iterator[T] => U): Array[U] = { runJob(rdd, func, 0 until rdd.partitions.length) } It appears that you're running out of memory while trying to compute (within the driver) the number of partitions that will

UDAF and UDT with SparkSQL 1.5.0

2015-09-12 Thread jussipekkap
Hi, Issue #1: I'm using the new UDAF interface (UserDefinedAggregateFunction) at Spark 1.5.0 release. Is it possible to aggregate all values in the MutableAggregationBuffer into an array in a robust manner? I'm creating an aggregation function that collects values into an array from all input

Re: What is the best way to migrate existing scikit-learn code to PySpark?

2015-09-12 Thread Nick Pentreath
You might want to check out https://github.com/lensacom/sparkit-learn Though it's true for random Forests / trees you will need to use MLlib — Sent from Mailbox On Sat, Sep 12, 2015 at 9:00 PM, Jörn Franke wrote: > I fear you have to do the plumbing all yourself.

Re: Why my Spark job is slow and it throws OOM which leads YARN killing executors?

2015-09-12 Thread Umesh Kacha
Name("SparkTest"); > setSparkConfProperties(sparkConf); > SparkContext sc = new SparkContext(sparkConf); > final FileSystem fs = FileSystem.get(sc.hadoopConfiguration()); > HiveContext hiveContext = createHiveContext(sc); > >

Spark Streaming..Exception

2015-09-12 Thread Priya Ch
Hello All, When I push messages into kafka and read into streaming application, I see the following exception- I am running the application on YARN and no where broadcasting the message within the application. Just simply reading message, parsing it and populating fields in a class and then

Re: Spark 1.5.0 java.lang.OutOfMemoryError: PermGen space

2015-09-12 Thread Jagat Singh
Sorry to answer your question fully. The job starts tasks and few of them fail and some are successful. The failed one have that PermGen error in logs. But ultimately full job is marked fail and session quits. On Sun, Sep 13, 2015 at 10:48 AM, Jagat Singh wrote: > Hi

Cogrouping data in dataframes - PairRDD cogroup vs. join - best practices

2015-09-12 Thread Matthew Denny
I was wondering what are the best practices in regards to cogrouping data in dataframes. While joins are obviously a powerful tool, it seems that there's still some cases where using cogroup (which is only supported by PairRDDs) is still a better choice. Consider 2 case classes with the

Limiting number of cores per job in multi-threaded driver.

2015-09-12 Thread Philip Weaver
I'm playing around with dynamic allocation in spark-1.5.0, with the FAIR scheduler, so I can define a long-running application capable of executing multiple simultaneous spark jobs. The kind of jobs that I'm running do not benefit from more than 4 cores, but I want my application to be able to

Re: Spark 1.5.0 java.lang.OutOfMemoryError: PermGen space

2015-09-12 Thread Jagat Singh
Hi Davies, This was first query on new version. The one which ran successfully was Spark Pi example ./bin/spark-submit --class org.apache.spark.examples.SparkPi \ --master yarn-client \ --num-executors 3 \ --driver-memory 4g \ --executor-memory 2g \ --executor-cores 1 \

What happens when cache is full?

2015-09-12 Thread Hemminger Jeff
I am trying to understand the process of caching and specifically what the behavior is when the cache is full. Please excuse me if this question is a little vague, I am trying to build my understanding of this process. I have an RDD that I perform several computations with, I persist it with

Re: Multithreaded vs Spark Executor

2015-09-12 Thread Richard Eggert
Parallel processing is what Spark was made for. Let it do its job. Spawning your own threads independently of what Spark is doing seems like you'd just be asking for trouble. I think you can accomplish what you want by taking the cartesian product of the data element RDD and the feature list RDD

change the spark version

2015-09-12 Thread Angel Angel
Respected sir, I installed two versions of spark 1.2.0 (cloudera 5.3) and 1.4.0. I am running some application that need spark 1.4.0 The application is related to deep learning. *So how can i remove the version 1.2.0 * *and run my application on version 1.4.0 ?* When i run command

Spark K means number of Iterations?

2015-09-12 Thread ashensw
H all, I want to know whether the K means algorithm stops if the data set converges to stable clusters before reaching the number of iteration that we defined? As an example if I give 100 as the number of iterations and 5 as the number of clusters, but if the data set converges to stable 5

Re: change the spark version

2015-09-12 Thread Sean Owen
This is a question for the CDH list. CDH 5.4 has Spark 1.3, and 5.5 has 1.5. The best thing is to update CDH as a whole if you can. However it's pretty simple to just run a newer Spark assembly as a YARN app. Don't remove anything in the CDH installation. Try downloading the assembly and

Re: SIGTERM 15 Issue : Spark Streaming for ingesting huge text files using custom Receiver

2015-09-12 Thread Jörn Franke
I am not sure what are you trying to achieve here. Have you thought about using flume? Additionally maybe something like rsync? Le sam. 12 sept. 2015 à 0:02, Varadhan, Jawahar a écrit : > Hi all, >I have a coded a custom receiver which receives kafka messages.

Re: Implement "LIKE" in SparkSQL

2015-09-12 Thread liam
OK, I got another way, it looks silly and low inefficiency but works. tradeDF.registerTempTable(tradeTab); orderDF.registerTempTable(orderTab); //orderId = tid + "_x" String sql1 = "select * from " + tradeTab + " a, " + orderTab + " b where substr(b.orderId,1,15) = substr(a.tid,1) "; String

Re: Help with collect() in Spark Streaming

2015-09-12 Thread Luca
I am trying to implement an application that requires the output to be aggregated and stored as a single txt file to HDFS (instead of, for instance, having 4 different txt files coming from my 4 workers). The solution I used does the trick, but I can't tell if it's ok to regularly stress one of

How to create broadcast variable from Java String array?

2015-09-12 Thread unk1102
Hi I have Java String array which contains 45 string which is basically Schema String[] fieldNames = {"col1","col2",...}; Currently I am storing above array of String in a driver static field. My job is running slow so trying to refactor code I am using String array in creating DataFrame

Re: Spark K means number of Iterations?

2015-09-12 Thread ashensw
Okay. Thanks for the help. On Sat, Sep 12, 2015 at 1:16 PM, Robineast [via Apache Spark User List] < ml-node+s1001560n24665...@n3.nabble.com> wrote: > Yes it does stop if the algorithm converges in less than the specified > tolerance. You have a parameter to the Kmeans constructor called epsilon

[Question] ORC - EMRFS Problem

2015-09-12 Thread Cazen Lee
Good Day! I think there are some problems between ORC and AWS EMRFS. When I was trying to read "upper 150M" ORC files from S3, ArrayOutOfIndex Exception occured. I'm sure that it's AWS side issue because there was no exception when trying from HDFS or S3NativeFileSystem. Parquet runs