Re: Spark can't pickle class: error cannot lookup attribute

2015-02-19 Thread Guillaume Guy
Thanks Davies and Eric. I followed Davies' instructions and it works wonderful. I would add that you can also add these scripts in the pyspark shell too: pyspark --py-files support.py where support.py is your script containing your class as Davies described. Best, Guillaume Guy * +1 919 -

Re: SchemaRDD.select

2015-02-19 Thread Michael Armbrust
The trick here is getting the scala compiler to do the implicit conversion from Symbol - Column. In your second example, the compiler doesn't know that you are going to try and use the Seq[Symbol] as a Seq[Column] and so doesn't do the conversion. The following are other ways to provide enough

Streaming Linear Regression

2015-02-19 Thread barisak
Hi I tried to run Streaming Linear Regression in my local. val trainingData = ssc.textFileStream(/home/barisakgu/Desktop/Spark/train).map(LabeledPoint.parse) textFileStream is not seeing the new files. I search on the Internet, and I saw that somebody has same issue but no solution is found

Re: Spark job fails on cluster but works fine on a single machine

2015-02-19 Thread Pavel Velikhov
Yeah, I do manually delete the files, but it still fails with this error. On Feb 19, 2015, at 8:16 PM, Ganelin, Ilya ilya.gane...@capitalone.com wrote: When writing to hdfs Spark will not overwrite existing files or directories. You must either manually delete these or use Java's Hadoop

Spark streaming program with Tranquility/Druid

2015-02-19 Thread Jaal
Hello all, I am trying to integrate Spark with Tranquility (https://github.com/metamx/tranquility) and when I start the Spark program, I get the error below: java.lang.NoClassDefFoundError: org/codehaus/jackson/annotate/JsonClass at

Re: Unknown sample in Naive Baye's

2015-02-19 Thread Xiangrui Meng
If you know there are data doesn't belong to any existing category, put them into the training set and make a new category for them. It won't help much if instances from this unknown category are all outliers. In that case, lower the thresholds and tune the parameters to get a lower error rate.

Re: stack map functions in a loop (pyspark)

2015-02-19 Thread Davies Liu
On Thu, Feb 19, 2015 at 7:57 AM, jamborta jambo...@gmail.com wrote: Hi all, I think I have run into an issue on the lazy evaluation of variables in pyspark, I have to following functions = [func1, func2, func3] for counter in range(len(functions)): data = data.map(lambda value:

Re: SchemaRDD.select

2015-02-19 Thread Cesar Flores
Well: I think that I solved my issue in the next way: val variable_fieldsStr = List(field1,field2) val variable_argument_list= variable_fieldsStr.map(f = Alias(Symbol(f), f)()) val schm2 = myschemaRDD.select(variable_argument_list:_*) schm2 seems to have the required fields, but would like

spark-sql problem with textfile separator

2015-02-19 Thread sparkino
Hello everybody, I'm quite new to Spark and Scala as well and I was trying to analyze some csv data via spark-sql My csv file contains data like this Following the example at this link below https://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection

Re: Unzipping large files and 2GB partition size.

2015-02-19 Thread Joe Wass
Thanks for your reply Sean. Looks like it's happening in a map: 15/02/18 20:35:52 INFO scheduler.DAGScheduler: Submitting 100 missing tasks from Stage 1 (MappedRDD[17] at mapToPair at NativeMethodAccessorImpl.java:-2) That's my initial 'parse' stage, done before repartitioning. It reduces the

In a Spark Streaming application, what might be the potential causes for util.AkkaUtils: Error sending message in 1 attempts and java.util.concurrent.TimeoutException: Futures timed out and

2015-02-19 Thread Emre Sevinc
Hello, We have a Spark Streaming application that watches an input directory, and as files are copied there the application reads them and sends the contents to a RESTful web service, receives a response and write some contents to an output directory. When testing the application by copying a

Unzipping large files and 2GB partition size.

2015-02-19 Thread Joe Wass
On the advice of some recent discussions on this list, I thought I would try and consume gz files directly. I'm reading them, doing a preliminary map, then repartitioning, then doing normal spark things. As I understand it, zip files aren't readable in partitions because of the format, so I

Re: loads of memory still GC overhead limit exceeded

2015-02-19 Thread Sean Owen
This should result in 4 executors, not 25. They should be able to execute 4*4 = 16 tasks simultaneously. You have them grab 4*32 = 128GB of RAM, not 1TB. It still feels like this shouldn't be running out of memory, not by a long shot though. But just pointing out potential differences between

Re: Tableau beta connector

2015-02-19 Thread Ashutosh Trivedi (MT2013030)
Hi, I would like you to read my stack overflow answer to this question. If you need more clarification feel free to drop a msg. http://stackoverflow.com/questions/28403664/connect-to-existing-hive-in-intellij-using-sbt-as-build Regards, Ashutosh From:

Re: In a Spark Streaming application, what might be the potential causes for util.AkkaUtils: Error sending message in 1 attempts and java.util.concurrent.TimeoutException: Futures timed out and

2015-02-19 Thread Tathagata Das
What version of Spark are you using? TD On Thu, Feb 19, 2015 at 2:45 AM, Emre Sevinc emre.sev...@gmail.com wrote: Hello, We have a Spark Streaming application that watches an input directory, and as files are copied there the application reads them and sends the contents to a RESTful web

storing MatrixFactorizationModel (pyspark)

2015-02-19 Thread Antony Mayi
Hi, when getting the model out of ALS.train it would be beneficial to store it (to disk) so the model can be reused later for any following predictions. I am using pyspark and I had no luck pickling it either using standard pickle module or even dill. does anyone have a solution for this (note

Regarding minimum number of partitions while reading data from Hadoop

2015-02-19 Thread twinkle sachdeva
Hi, In our job, we need to process the data in small chunks, so as to avoid GC and other stuff. For this, we are using old API of hadoop as that let us specify parameter like minPartitions. Does any one knows, If there a way to do the same via newHadoopAPI also? How that way will be different

Re: Unzipping large files and 2GB partition size.

2015-02-19 Thread Sean Owen
gzip and zip are not splittable compression formats; bzip and lzo are. Ideally, use a splittable compression format. Repartitioning is not a great solution since it means a shuffle, typically. This is not necessarily related to how big your partitions are. The question is, when does this happen?

loads of memory still GC overhead limit exceeded

2015-02-19 Thread Antony Mayi
Hi, I have 4 very powerful boxes (256GB RAM, 32 cores each). I am running spark 1.2.0 in yarn-client mode with following layout: spark.executor.cores=4 spark.executor.memory=28G spark.yarn.executor.memoryOverhead=4096 I am submitting bigger ALS trainImplicit task (rank=100, iters=15) on a

Re: loads of memory still GC overhead limit exceeded

2015-02-19 Thread Antony Mayi
based on spark UI I am running 25 executors for sure. why would you expect four? I submit the task with --num-executors 25 and I get 6-7 executors running per host (using more of smaller executors allows me better cluster utilization when running parallel spark sessions (which is not the case

Re: loads of memory still GC overhead limit exceeded

2015-02-19 Thread Sean Owen
Oh OK you are saying you are requesting 25 executors and getting them, got it. You can consider making fewer, bigger executors to pool rather than split up your memory, but at some point it becomes counter-productive. 32GB is a fine executor size. So you have ~8GB available per task which seems

Re: loads of memory still GC overhead limit exceeded

2015-02-19 Thread Antony Mayi
it is from within the ALS.trainImplicit() call. btw. the exception varies between this GC overhead limit exceeded and Java heap space (which I guess is just different outcome of same problem). just tried another run and here are the logs (filtered) - note I tried this run with

Re: In a Spark Streaming application, what might be the potential causes for util.AkkaUtils: Error sending message in 1 attempts and java.util.concurrent.TimeoutException: Futures timed out and

2015-02-19 Thread Emre Sevinc
On Thu, Feb 19, 2015 at 12:27 PM, Tathagata Das t...@databricks.com wrote: What version of Spark are you using? TD Spark version is 1.2.0 (running on Cloudera CDH 5.3.0) -- Emre Sevinç

Re: Regarding minimum number of partitions while reading data from Hadoop

2015-02-19 Thread Sean Owen
I think that the newer Hadoop API does not expose this suggested min partitions parameter like the old one did. I believe you can try setting mapreduce.input.fileinputformat.split.{min,max}size instead on the Hadoop Configuration to suggest a max/min split size, and therefore bound the number of

Re: spark-sql problem with textfile separator

2015-02-19 Thread Yanbo Liang
This is because of each line will be separated into 4 columns instead of 3 columns. If you want to use comma to separate different columns, each column will be not allowed to include commas. 2015-02-19 18:12 GMT+08:00 sparkino francescoboname...@gmail.com: Hello everybody, I'm quite new to

Spark on Mesos: Multiple Users with iPython Notebooks

2015-02-19 Thread John Omernik
I am running Spark on Mesos and it works quite well. I have three users, all who setup iPython notebooks to instantiate a spark instance to work with on the notebooks. I love it so far. Since I am auto instantiating (I don't want a user to have to think about instantiating and submitting a spark

Re: ML Transformer

2015-02-19 Thread Peter Rudenko
Hi Cesar, these methods would be private until new ml api would stabilize (aprox. in spark 1.4). My solution for the same issue was to create org.apache.spark.ml package in my project and extends/implement everything there. Thanks, Peter Rudenko On 2015-02-18 22:17, Cesar Flores wrote: I

Re: loads of memory still GC overhead limit exceeded

2015-02-19 Thread Antony Mayi
now with reverted spark.shuffle.io.preferDirectBufs (to true) getting again GC overhead limit exceeded: === spark stdout ===15/02/19 12:08:08 WARN scheduler.TaskSetManager: Lost task 7.0 in stage 18.0 (TID 5329, 192.168.1.93): java.lang.OutOfMemoryError: GC overhead limit exceeded        at

Re: spark-sql problem with textfile separator

2015-02-19 Thread Francesco Bonamente
Hi Yanbo, unfortunately all csv files contain comma inside some columns and I can't change the structure. How can I work with this kind of textfile and spark-sql? Thank you again 2015-02-19 14:38 GMT+01:00 Yanbo Liang hackingda...@gmail.com: This is because of each line will be separated

Re: JdbcRDD, ClassCastException with scala.Function0

2015-02-19 Thread Cody Koeninger
At the beginning of the code, do a query to find the current maximum ID Don't just put in an arbitrarily large value, or all of your rows will end up in 1 spark partition at the beginning of the range. The question of keys is up to you... all that you need to be able to do is write a sql

Re: SparkSQL + Tableau Connector

2015-02-19 Thread Silvio Fiorito
Great, glad it worked out! From: Todd Nist Date: Thursday, February 19, 2015 at 9:19 AM To: Silvio Fiorito Cc: user@spark.apache.orgmailto:user@spark.apache.org Subject: Re: SparkSQL + Tableau Connector Hi Silvio, I got this working today using your suggestion with the Initial SQL and a Custom

Re: storing MatrixFactorizationModel (pyspark)

2015-02-19 Thread Ilya Ganelin
Yep. the matrix model had two RDD vectors representing the decomposed matrix. You can save these to disk and re use them. On Thu, Feb 19, 2015 at 2:19 AM Antony Mayi antonym...@yahoo.com.invalid wrote: Hi, when getting the model out of ALS.train it would be beneficial to store it (to disk) so

Re: loads of memory still GC overhead limit exceeded

2015-02-19 Thread Ilya Ganelin
Hi Anthony - you are seeing a problem that I ran into. The underlying issue is your default parallelism setting. What's happening is that within ALS certain RDD operations end up changing the number of partitions you have of your data. For example if you start with an RDD of 300 partitions, unless

Why is RDD lookup slow?

2015-02-19 Thread shahab
Hi, I am doing lookup on cached RDDs [(Int,String)], and I noticed that the lookup is relatively slow 30-100 ms ?? I even tried this on one machine with single partition, but no difference! The RDDs are not large at all, 3-30 MB. Is this expected behaviour? should I use other data structures,

Re: Why is RDD lookup slow?

2015-02-19 Thread Sean Owen
RDDs are not Maps. lookup() does a linear scan -- parallel by partition, but stil linear. Yes, it is not supposed be an O(1) lookup data structure. It'd be much nicer to broadcast the relatively small data set as a Map and look it up fast, locally. On Thu, Feb 19, 2015 at 3:29 PM, shahab

Re: Why is RDD lookup slow?

2015-02-19 Thread Ilya Ganelin
Hi Shahab - if your data structures are small enough a broadcasted Map is going to provide faster lookup. Lookup within an RDD is an O(m) operation where m is the size of the partition. For RDDs with multiple partitions, executors can operate on it in parallel so you get some improvement for

Re: spark-sql problem with textfile separator

2015-02-19 Thread Yanbo Liang
For your case, I think you can use a trick for separating with “ “,” instead of “,” You can refer the following code snippet val people = sc.textFile(examples/src/main/resources/data.csv).map( x = x.substring(1,x.length-1).split(\,\)).map(p = List(p(0), p(1), p(2))) On Feb 19, 2015, at 10:02

Re: bulk writing to HDFS in Spark Streaming?

2015-02-19 Thread Akhil Das
There was already a thread around it if i understood your question correctly, you can go through this https://mail-archives.apache.org/mod_mbox/spark-user/201502.mbox/%3ccannjawtrp0nd3odz-5-_ya351rin81q-9+f2u-qn+vruqy+...@mail.gmail.com%3E Thanks Best Regards On Thu, Feb 19, 2015 at 8:16 PM,

Re: No suitable driver found error, Create table in hive from spark sql

2015-02-19 Thread Todd Nist
Hi Dhimant, I believe if you change your spark-shell to pass -driver-class-path /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar vs putting it in --jars. -Todd On Wed, Feb 18, 2015 at 10:41 PM, Dhimant dhimant84.jays...@gmail.com wrote: Found solution from one of the post found on

Re: Tableau beta connector

2015-02-19 Thread Todd Nist
I am able to connect by doing the following using the Tableau Initial SQL and a custom query: 1. First ingest csv file or json and save out to file system: import org.apache.spark.sql.SQLContext import com.databricks.spark.csv._ val sqlContext = new SQLContext(sc) val demo =

Re: Filtering keys after map+combine

2015-02-19 Thread Sean Owen
You have the keys before and after reduceByKey. You want to do something based on the key within reduceByKey? it just calls combineByKey, so you can use that method for lower-level control over the merging. Whether it's possible depends I suppose on what you mean to filter on. If it's just a

Re: issue Running Spark Job on Yarn Cluster

2015-02-19 Thread Harshvardhan Chauhan
Is this the full stack trace ? On Wed, Feb 18, 2015 at 2:39 AM, sachin Singh sachin.sha...@gmail.com wrote: Hi, I want to run my spark Job in Hadoop yarn Cluster mode, I am using below command - spark-submit --master yarn-cluster --driver-memory 1g --executor-memory 1g --executor-cores 1

Re: Filtering keys after map+combine

2015-02-19 Thread Debasish Das
Hi Sean, This is what I intend to do: are you saying that you know a key should be filtered based on its value partway through the merge? I should use combineByKey... Thanks. Deb On Thu, Feb 19, 2015 at 6:31 AM, Sean Owen so...@cloudera.com wrote: You have the keys before and after

Spark 1.2.1: ClassNotFoundException when running hello world example in scala 2.11

2015-02-19 Thread Luis Solano
I'm having an issue with spark 1.2.1 and scala 2.11. I detailed the symptoms in this stackoverflow question. http://stackoverflow.com/questions/28612837/spark-classnotfoundexception-when-running-hello-world-example-in-scala-2-11 Has anyone experienced anything similar? Thank you!

Filter data from one RDD based on data from another RDD

2015-02-19 Thread Himanish Kushary
Hi, I have two RDD's with csv data as below : RDD-1 101970_5854301840,fbcf5485-e696-4100-9468-a17ec7c5bb43,19229261643 101970_5854301839,fbaf5485-e696-4100-9468-a17ec7c5bb39,9229261645 101970_5854301839,fbbf5485-e696-4100-9468-a17ec7c5bb39,9229261647

Incorrect number of records after left outer join (I think)

2015-02-19 Thread Darin McBeath
Consider the following left outer join potentialDailyModificationsRDD = reducedDailyPairRDD.leftOuterJoin(baselinePairRDD).partitionBy(new HashPartitioner(1024)).persist(StorageLevel.MEMORY_AND_DISK_SER()); Below are the record counts for the RDDs involved Number of records for

Re: Spark Streaming and message ordering

2015-02-19 Thread Cody Koeninger
Kafka ordering is guaranteed on a per-partition basis. The high-level consumer api as used by the spark kafka streams prior to 1.3 will consume from multiple kafka partitions, thus not giving any ordering guarantees. The experimental direct stream in 1.3 uses the simple consumer api, and there

Re: Re: Spark streaming doesn't print output when working with standalone master

2015-02-19 Thread bit1...@163.com
Thanks Akhil, you are right. I checked and find that I have only 1 core allocated to the program I am running on a visual machine,and only allocate one processor to it(1 core per processor), so even if I have specified --total-executor-cores 3 in the submit script, the application will still

Re: How to pass parameters to a spark-jobserver Scala class?

2015-02-19 Thread Vasu C
Hi Sasi, I am not sure about Vaadin, but by simple googling you can find many article on how to pass json parameters in http. http://stackoverflow.com/questions/21404252/post-request-send-json-data-java-httpurlconnection You can also try Finagle which is fully fault tolerant framework by

RE: using a database connection pool to write data into an RDBMS from a Spark application

2015-02-19 Thread Mohammed Guller
Hi Kelvin, Yes. I am creating an uber jar with the Postgres driver included, but nevertheless tried both –jars and –driver-classpath flags. It didn’t help. Interestingly, I can’t use BoneCP even in the driver program when I run my application with spark-submit. I am getting the same exception

Re: Failure on a Pipe operation

2015-02-19 Thread athing goingon
It appears that the file paths are different when running spark in local and cluster mode. When running spark without --master the paths to the pipe command are relative to the local machine. When running spark with --master the paths to the pipe command are ./ This is what finally worked. I

Re: using a database connection pool to write data into an RDBMS from a Spark application

2015-02-19 Thread Kelvin Chu
Hi Mohammed, Did you use --jars to specify your jdbc driver when you submitted your job? Take a look of this link: http://spark.apache.org/docs/1.2.0/submitting-applications.html Hope this help! Kelvin On Thu, Feb 19, 2015 at 7:24 PM, Mohammed Guller moham...@glassbeam.com wrote: Hi – I

Spark streaming doesn't print output when working with standalone master

2015-02-19 Thread bit1...@163.com
Hi, I am trying the spark streaming log analysis reference application provided by Databricks at https://github.com/databricks/reference-apps/tree/master/logs_analyzer When I deploy the code to the standalone cluster, there is no output at will with the following shell script.Which means, the

Re: Tableau beta connector

2015-02-19 Thread Ashutosh Trivedi (MT2013030)
Thanks Todd. great stuff :) Regards, Ashu From: Todd Nist tsind...@gmail.com Sent: Thursday, February 19, 2015 7:46 PM To: Ashutosh Trivedi (MT2013030) Cc: user@spark.apache.org Subject: Re: Tableau beta connector I am able to connect by doing the following

Caching RDD

2015-02-19 Thread Kartheek.R
Hi, I have HDFS file of size 598MB. I create RDD over this file and cache it in RAM in a 7 node cluster with 2G RAM each. I find that each partition gets replicated thrice or even 4 times in the cluster even without me specifying in code. Total partitions are 5 for the RDD created but cached

Spark Performance on Yarn

2015-02-19 Thread lbierman
I'm a bit new to Spark, but had a question on performance. I suspect a lot of my issue is due to tuning and parameters. I have a Hive external table on this data and to run queries against it runs in minutes The Job: + 40gb of avro events on HDFS (100 million+ avro events) + Read in the files

Re: Spark Streaming and message ordering

2015-02-19 Thread Neelesh
I had a chance to talk to TD today at the Strata+Hadoop Conf in San Jose. We talked a bit about this after his presentation about this - the short answer is spark streaming does not guarantee any sort of ordering (within batches, across batches). One would have to use updateStateByKey to collect

Re: Learning GraphX Questions

2015-02-19 Thread Takeshi Yamamuro
Hi, Vertices are simply hash-partitioned by spark.HashPartitioner, so you easily calculate partition ids by yourself. Also, you can type the lines to check ids; import org.apache.spark.graphx._ graph.vertices.mapPartitionsWithIndex { (pid, iter) = val vids = Array.newBuilder[VertexId] for

Re: Spark Streaming and message ordering

2015-02-19 Thread Neelesh
Even with the new direct streams in 1.3, isn't it the case that the job *scheduling* follows the partition order, rather than job *execution*? Or is it the case that the stream listens to job completion event (using a streamlistener) before scheduling the next batch? To compare with storm from a

Re: Spark 1.2.1: ClassNotFoundException when running hello world example in scala 2.11

2015-02-19 Thread Akhil Das
Can you downgrade your scala dependency to 2.10 and give it a try? Thanks Best Regards On Fri, Feb 20, 2015 at 12:40 AM, Luis Solano l...@pixable.com wrote: I'm having an issue with spark 1.2.1 and scala 2.11. I detailed the symptoms in this stackoverflow question.

Re: How to diagnose could not compute split errors and failed jobs?

2015-02-19 Thread Akhil Das
Not quiet sure, but this can be the case. One of your executor is stuck on GC pause while the other one asks for the data from it and hence the request timesout ending in that exception. You can try increasing the akk framesize and ack wait timeout as follows:

Re: Spark streaming doesn't print output when working with standalone master

2015-02-19 Thread Akhil Das
While running the program go to your clusters webUI (that runs on 8080, prolly at hadoop.master:8080) and see how many cores are allocated to the program, it should be = 2 for the stream to get processed. [image: Inline image 1] Thanks Best Regards On Fri, Feb 20, 2015 at 9:29 AM,

Re: percentil UDAF in spark 1.2.0

2015-02-19 Thread Mark Hamstra
Already fixed: https://github.com/apache/spark/pull/2802 On Thu, Feb 19, 2015 at 3:17 PM, Mohnish Kodnani mohnish.kodn...@gmail.com wrote: Hi, I am trying to use percentile and getting the following error. I am using spark 1.2.0. Does UDAF percentile exist in that code line and do i have to

Re: percentil UDAF in spark 1.2.0

2015-02-19 Thread Mark Hamstra
The percentile UDAF came in across a couple of PRs. Commit f33d55046427b8594fd19bda5fd2214eeeab1a95 reflects the most recent work, I believe. It will be part of the 1.3.0 release very soon:

stack map functions in a loop (pyspark)

2015-02-19 Thread jamborta
Hi all, I think I have run into an issue on the lazy evaluation of variables in pyspark, I have to following functions = [func1, func2, func3] for counter in range(len(functions)): data = data.map(lambda value: [functions[counter](value)]) it looks like that the counter is evaluated when

Re: RDD Partition number

2015-02-19 Thread Ted Yu
What file system are you using ? If you use hdfs, the documentation you cited is pretty clear on how partitions are determined. bq. file X replicated on 4 machines I don't think replication factor plays a role w.r.t. partitions. On Thu, Feb 19, 2015 at 8:05 AM, Alessandro Lulli

Implicit ALS with multiple features

2015-02-19 Thread poiuytrez
Hello, I would like to use the spark MLlib recommendation filtering library. My goal will be to predict what a user would like to buy based on what he bought before. I read on the spark documentation that Spark supports implicit feedback. However there is not example for this application.

Re: Filtering keys after map+combine

2015-02-19 Thread Daniel Siegmann
I'm not sure what your use case is, but perhaps you could use mapPartitions to reduce across the individual partitions and apply your filtering. Then you can finish with a reduceByKey. On Thu, Feb 19, 2015 at 9:21 AM, Debasish Das debasish.da...@gmail.com wrote: Hi, Before I send out the keys

Re: Spark job fails on cluster but works fine on a single machine

2015-02-19 Thread Pavel Velikhov
On Feb 19, 2015, at 7:29 PM, Pavel Velikhov pavel.velik...@icloud.com wrote: I have a simple Spark job that goes out to Cassandra, runs a pipe and stores results: val sc = new SparkContext(conf) val rdd = sc.cassandraTable(“keyspace, “table) .map(r = r.getInt(“column) + \t +

Spark job fails on cluster but works fine on a single machine

2015-02-19 Thread Pavel Velikhov
I have a simple Spark job that goes out to Cassandra, runs a pipe and stores results: val sc = new SparkContext(conf) val rdd = sc.cassandraTable(“keyspace, “table) .map(r = r.getInt(“column) + \t + write(get_lemmas(r.getString(tags .pipe(python3

Re: Unzipping large files and 2GB partition size.

2015-02-19 Thread Joe Wass
Thanks for your detailed reply Imran. I'm writing this in Clojure (using Flambo which uses the Java API) but I don't think that's relevant. So here's the pseudocode (sorry I've not written Scala for a long time): val rawData = sc.hadoopFile(/dir/to/gzfiles) // NB multiple files. val parsedFiles =

RDD Partition number

2015-02-19 Thread Alessandro Lulli
Hi All, Could you please help me understanding how Spark defines the number of partitions of the RDDs if not specified? I found the following in the documentation for file loaded from HDFS: *The textFile method also takes an optional second argument for controlling the number of partitions of

Re: Implicit ALS with multiple features

2015-02-19 Thread Sean Owen
It's shown at http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html It's really not different to use. It's suitable when you have count-like data rather than rating-like data. That's what you have here. I am not sure what you mean that you want to add frequency too but no the

Re: RDD Partition number

2015-02-19 Thread Ilya Ganelin
By default you will have (fileSize in Mb / 64) partitions. You can also set the number of partitions when you read in a file with sc.textFile as an optional second parameter. On Thu, Feb 19, 2015 at 8:07 AM Alessandro Lulli lu...@di.unipi.it wrote: Hi All, Could you please help me

RE: Spark job fails on cluster but works fine on a single machine

2015-02-19 Thread Ganelin, Ilya
When writing to hdfs Spark will not overwrite existing files or directories. You must either manually delete these or use Java's Hadoop FileSystem class to remove them. Sent with Good (www.good.com) -Original Message- From: Pavel Velikhov

Re: RDD Partition number

2015-02-19 Thread Ted Yu
bq. *blocks being 64MB by default in HDFS* *In hadoop 2.1+, default block size has been increased.* See https://issues.apache.org/jira/browse/HDFS-4053 Cheers On Thu, Feb 19, 2015 at 8:32 AM, Ted Yu yuzhih...@gmail.com wrote: What file system are you using ? If you use hdfs, the

RE: RDD Partition number

2015-02-19 Thread Ganelin, Ilya
As Ted Yu points out, default block size is 128MB as of Hadoop 2.1. Sent with Good (www.good.com) -Original Message- From: Ilya Ganelin [ilgan...@gmail.commailto:ilgan...@gmail.com] Sent: Thursday, February 19, 2015 12:13 PM Eastern Standard Time To: Alessandro Lulli;

Filtering keys after map+combine

2015-02-19 Thread Debasish Das
Hi, Before I send out the keys for network shuffle, in reduceByKey after map + combine are done, I would like to filter the keys based on some threshold... Is there a way to get the key, value after map+combine stages so that I can run a filter on the keys ? Thanks. Deb

Re: Unzipping large files and 2GB partition size.

2015-02-19 Thread Imran Rashid
Hi Joe, The issue is not that you have input partitions that are bigger than 2GB -- its just that they are getting cached. You can see in the stack trace, the problem is when you try to read data out of the DiskStore: org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:132) Also, just

Re: SparkSQL + Tableau Connector

2015-02-19 Thread Todd Nist
Hi Silvio, I got this working today using your suggestion with the Initial SQL and a Custom Query. See here for details: http://stackoverflow.com/questions/28403664/connect-to-existing-hive-in-intellij-using-sbt-as-build/28608608#28608608 It is not ideal as I need to write a custom query, but

bulk writing to HDFS in Spark Streaming?

2015-02-19 Thread Chico Qi
Hi all, In Spark Streaming I want use the Dstream.saveAsTextFiles by bulk writing because of the normal saveAsTextFiles cannot during the batch interval of setting. May be a common pool of writing or another assigned worker for bulk writing? Thanks! B/R Jichao

Re: JdbcRDD, ClassCastException with scala.Function0

2015-02-19 Thread Dmitry Goldenberg
Yup, I did see that. Good point though, Cody. The mismatch was happening for me when I was trying to get the 'new JdbcRDD' approach going. Once I switched to the 'create' method things are working just fine. Was just able to refactor the 'get connection' logic into a 'DbConnection implements

Re: Unzipping large files and 2GB partition size.

2015-02-19 Thread Imran Rashid
oh, I think you are just choosing a number that is too small for your number of partitions. All of the data in /dir/to/gzfiles is going to be sucked into one RDD, with the data divided into partitions. So if you're parsing 200 files, each about 2 GB, and then repartitioning down to 100

Re: Filtering keys after map+combine

2015-02-19 Thread Debasish Das
I thought combiner comes from reduceByKey and not mapPartitions right...Let me dig deeper into the APIs On Thu, Feb 19, 2015 at 8:29 AM, Daniel Siegmann daniel.siegm...@velos.io wrote: I'm not sure what your use case is, but perhaps you could use mapPartitions to reduce across the individual

Re: JdbcRDD, ClassCastException with scala.Function0

2015-02-19 Thread Dmitry Goldenberg
That's a good point, thanks. Is there a way to instrument continuous realtime streaming of data out of a database? If the data keeps changing, one way to implement extraction would be to keep track of something like the last-modified timestamp and instrument the query to be 'where lastmodified ?'

Re: Some tasks taking too much time to complete in a stage

2015-02-19 Thread Imran Rashid
almost all your data is going to one task. You can see that the shuffle read for task 0 is 153.3 KB, and for most other tasks its just 26B (which is probably just some header saying there are no actual records). You need to ensure your data is more evenly distributed before this step. On Thu,

Re: Some tasks taking too much time to complete in a stage

2015-02-19 Thread Jatinpreet Singh
Hi Imran, Thanks for pointing that out. My data comes from the HBase connector of Spark. I do not govern the distribution of data myself. HBase decides to put the data on any of the region servers. Is there a way to distribute data evenly? And I am especially interested in running even small

Re: Why is RDD lookup slow?

2015-02-19 Thread Burak Yavuz
If your dataset is large, there is a Spark Package called IndexedRDD optimized for lookups. Feel free to check that out. Burak On Feb 19, 2015 7:37 AM, Ilya Ganelin ilgan...@gmail.com wrote: Hi Shahab - if your data structures are small enough a broadcasted Map is going to provide faster

SchemaRDD.select

2015-02-19 Thread Cesar Flores
I am trying to pass a variable number of arguments to the select function of a SchemaRDD I created, as I want to select the fields in run time: val variable_argument_list = List('field1,'field2') val schm1 = myschemaRDD.select('field1,'field2) // works val schm2 =

Re: percentil UDAF in spark 1.2.0

2015-02-19 Thread Mohnish Kodnani
Isnt that PR about being able to pass in an array to percentile function. If I understand this error correctly, its not able to find the function percentile itself. Also, if I am incorrect and that PR fixes it, is it available in a release ? On Thu, Feb 19, 2015 at 3:27 PM, Mark Hamstra

unsubscribe

2015-02-19 Thread chaitu reddy
-- cheers, chaitu

Re: issue Running Spark Job on Yarn Cluster

2015-02-19 Thread Sachin Singh
Yes. On 19 Feb 2015 23:40, Harshvardhan Chauhan ha...@gumgum.com wrote: Is this the full stack trace ? On Wed, Feb 18, 2015 at 2:39 AM, sachin Singh sachin.sha...@gmail.com wrote: Hi, I want to run my spark Job in Hadoop yarn Cluster mode, I am using below command - spark-submit --master

using a database connection pool to write data into an RDBMS from a Spark application

2015-02-19 Thread Mohammed Guller
Hi – I am trying to use BoneCP (a database connection pooling library) to write data from my Spark application to an RDBMS. The database inserts are inside a foreachPartition code block. I am getting this exception when the code tries to insert data using BoneCP: java.sql.SQLException: No

Re: Incorrect number of records after left outer join (I think)

2015-02-19 Thread Imran Rashid
if you have duplicate values for a key, join creates all pairs. Eg. if you 2 values for key X in rdd A 2 values for key X in rdd B, then a.join(B) will have 4 records for key X On Thu, Feb 19, 2015 at 3:39 PM, Darin McBeath ddmcbe...@yahoo.com.invalid wrote: Consider the following left outer

percentil UDAF in spark 1.2.0

2015-02-19 Thread Mohnish Kodnani
Hi, I am trying to use percentile and getting the following error. I am using spark 1.2.0. Does UDAF percentile exist in that code line and do i have to do something to get this to work. java.util.NoSuchElementException: key not found: percentile at

How to diagnose could not compute split errors and failed jobs?

2015-02-19 Thread Tim Smith
My streaming app runs fine for a few hours and then starts spewing Could not compute split, block input-xx-xxx not found errors. After this, jobs start to fail and batches start to pile up. My question isn't so much about why this error but rather, how do I trace what leads to this error? I

Re: Filter data from one RDD based on data from another RDD

2015-02-19 Thread Imran Rashid
the more scalable alternative is to do a join (or a variant like cogroup, leftOuterJoin, subtractByKey etc. found in PairRDDFunctions) the downside is this requires a shuffle of both your RDDs On Thu, Feb 19, 2015 at 3:36 PM, Himanish Kushary himan...@gmail.com wrote: Hi, I have two RDD's

Failure on a Pipe operation

2015-02-19 Thread athing goingon
Hi, I'm trying to figure out why the following job is failing on a pipe http://pastebin.com/raw.php?i=U5E8YiNN With this exception: http://pastebin.com/raw.php?i=07NTGyPP Any help is welcome. Thank you.

Re: Failure on a Pipe operation

2015-02-19 Thread Imran Rashid
The error msg is telling you the exact problem, it can't find ProgramSIM, the thing you are trying to run Lost task 3520.3 in stage 0.0 (TID 11, compute3.research.dev): java.io.IOException: Cannot run program ProgramSIM: error=2, No s\ uch file or directory On Thu, Feb 19, 2015 at 5:52 PM,

Re: Spark job fails on cluster but works fine on a single machine

2015-02-19 Thread Ilya Ganelin
The stupid question is whether you're deleting the file from hdfs on the right node? On Thu, Feb 19, 2015 at 11:31 AM Pavel Velikhov pavel.velik...@gmail.com wrote: Yeah, I do manually delete the files, but it still fails with this error. On Feb 19, 2015, at 8:16 PM, Ganelin, Ilya

  1   2   >