Percentile Calculation

2015-01-28 Thread kundan kumar
Is there any inbuilt function for calculating percentile over a dataset ? I want to calculate the percentiles for each column in my data. Regards, Kundan

Re: Percentile Calculation

2015-01-28 Thread Kohler, Curt E (ELS-STL)
When I looked at this last fall, the only way that seemed to be available was to transform my data into SchemaRDDs, register them as tables and then use the Hive processor to calculate them with its built in percentile UDFs that were added in 1.2. Curt From:

Re: HW imbalance

2015-01-28 Thread simon elliston ball
You shouldn’t have any issues with differing nodes on the latest Ambari and Hortonworks. It works fine for mixed hardware and spark on yarn. Simon On Jan 26, 2015, at 4:34 PM, Michael Segel msegel_had...@hotmail.com wrote: If you’re running YARN, then you should be able to mix and max

Re: Exception when using HttpSolrServer (httpclient) from within Spark Streaming: java.lang.NoSuchMethodError: org.apache.http.impl.conn.SchemeRegistryFactory.createSystemDefault()Lorg/apache/http/con

2015-01-28 Thread Emre Sevinc
This is what I get: ./bigcontent-1.0-SNAPSHOT.jar:org/apache/http/impl/conn/SchemeRegistryFactory.class (probably because I'm using a self-contained JAR). In other words, I'm still stuck. -- Emre On Wed, Jan 28, 2015 at 2:47 PM, Charles Feduke charles.fed...@gmail.com wrote: I deal with

Re: Running a task over a single input

2015-01-28 Thread Sean Owen
Processing one object isn't a distributed operation, and doesn't really involve Spark. Just invoke your function on your object in the driver; there's no magic at all to that. You can make an RDD of one object and invoke a distributed Spark operation on it, but assuming you mean you have it on

Re: Exception when using HttpSolrServer (httpclient) from within Spark Streaming: java.lang.NoSuchMethodError: org.apache.http.impl.conn.SchemeRegistryFactory.createSystemDefault()Lorg/apache/http/con

2015-01-28 Thread Charles Feduke
I deal with problems like this so often across Java applications with large dependency trees. Add the shell function at the following link to your shell on the machine where your Spark Streaming is installed: https://gist.github.com/cfeduke/fe63b12ab07f87e76b38 Then run in the directory where

Re: Running a task over a single input

2015-01-28 Thread Matan Safriel
Thanks! So I assume I can safely run a function *F* of mine within the spark driver program, without dispatching it to the cluster (?), thereby sticking to one piece of code for *both* a real cluster run over big data, and for small on-demand runs for a single input (now and then), both scenarios

Re: Partition + equivalent of MapReduce multiple outputs

2015-01-28 Thread Corey Nolet
I'm looking @ the ShuffledRDD code and it looks like there is a method setKeyOrdering()- is this guaranteed to order everything in the partition? I'm on Spark 1.2.0 On Wed, Jan 28, 2015 at 9:07 AM, Corey Nolet cjno...@gmail.com wrote: In all of the soutions I've found thus far, sorting has been

RE: spark 1.2 ec2 launch script hang

2015-01-28 Thread ey-chih chow
We found the problem and already fixed it. Basically, spark-ec2 requires ec2 instances to have external ip addresses. You need to specify this in the ASW console. From: nicholas.cham...@gmail.com Date: Tue, 27 Jan 2015 17:19:21 + Subject: Re: spark 1.2 ec2 launch script hang To:

How to unregister/re-register a TempTable in Spark?

2015-01-28 Thread shahab
Hi, I just wonder if there is any way to unregister/re-register a TempTable in Spark? best, /Shahab

ETL process design

2015-01-28 Thread Danny Yates
Hi, My apologies for what has ended up as quite a long email with a lot of open-ended questions, but, as you can see, I'm really struggling to get started and would appreciate some guidance from people with more experience. I'm new to Spark and big data in general, and I'm struggling with what I

Re: spark-submit conflicts with dependencies

2015-01-28 Thread Sean Owen
Normally, if this were all in one app, Maven would have solved the problem for you by choosing 1.8 over 1.6. You do not need to exclude anything; Maven does it for you. Here the problem is that 1.8 is in the app but the server (Spark) uses 1.6. This is what the userClassPathFirst setting is for,

Conflict between elasticsearch-spark and elasticsearch-hadoop jars

2015-01-28 Thread aarthi
Hi We have a maven project which supports running of spark jobs and pig jobs. But I could use only either one of elasticsearch-hadoop or elasticsearch-spark jars at a time.If I use both jars together, I get conflict in org.elasticsearch.hadoop.cfg.SettingsManager which is presnt as class in

Re: performance of saveAsTextFile moving files from _temporary

2015-01-28 Thread Aaron Davidson
Upon completion of the 2 hour part of the run, the files did not exist in the output directory? One thing that is done serially is deleting any remaining files from _temporary, so perhaps there was a lot of data remaining in _temporary but the committed data had already been moved. I am,

Issue with SparkContext in cluster

2015-01-28 Thread Marco
I've created a spark app, which runs fine if I copy the corresponding jar to the hadoop-server (where yarn is running) and submit it there. If it try it to submit it from my local machine, I get the error which I've attached below. Submit cmd: spark-submit.cmd --class

Re: performance of saveAsTextFile moving files from _temporary

2015-01-28 Thread Thomas Demoor
TLDR Extend FileOutPutCommitter to eliminate the temporary_storage. There are some implementations to be found online, typically called DirectOutputCommitter, f.i. this spark pull request https://github.com/themodernlife/spark/commit/4359664b1d557d55b0579023df809542386d5b8c. Tell Spark to use your

Re: NegativeArraySizeException in pyspark when loading an RDD pickleFile

2015-01-28 Thread Rok Roskar
hi, thanks for the quick answer -- I suppose this is possible, though I don't understand how it could come about. The largest individual RDD elements are ~ 1 Mb in size (most are smaller) and the RDD is composed of 800k of them. The file is saved in 134 parts, but is being read in using some 1916+

Kryo buffer overflows

2015-01-28 Thread Tristan Blakers
A search shows several historical threads for similar Kryo issues, but none seem to have a definitive solution. Currently using Spark 1.2.0. While collecting/broadcasting/grouping moderately sized data sets (~500MB - 1GB), I regularly see exceptions such as the one below. I’ve tried increasing

Running a task over a single input

2015-01-28 Thread Matan Safriel
Hi, How would I run a given function in Spark, over a single input object? Would I first add the input to the file system, then somehow invoke the Spark function on just that input? or should I rather twist the Spark streaming api for it? Assume I'd like to run a piece of computation that

Re: Conflict between elasticsearch-spark and elasticsearch-hadoop jars

2015-01-28 Thread Costin Leau
That indicates that you are using two different versions of es-hadoop (2.0.x) and es-spark (2.1.x) Have you considered aligning the two versions? On 1/28/15 11:08 AM, aarthi wrote: Hi We have a maven project which supports running of spark jobs and pig jobs. But I could use only either one of

Re: Issues with constants in Spark HiveQL queries

2015-01-28 Thread Pala M Muthaia
By typo i meant that the column name had a spelling error: conversion_aciton_id. It should have been conversion_action_id. No, we tried it a few times, and we didn't have + signs or anything like that - we tried it with columns of different types too - string, double etc and saw the same error.

Re: Got java.lang.SecurityException: class javax.servlet.FilterRegistration's when running job from intellij Idea

2015-01-28 Thread Marco
I've switched to maven and all issues are gone, now. 2015-01-23 12:07 GMT+01:00 Sean Owen so...@cloudera.com: Use mvn dependency:tree or sbt dependency-tree to print all of the dependencies. You are probably bringing in more servlet API libs from some other source? On Fri, Jan 23, 2015 at

Re: ETL process design

2015-01-28 Thread Stadin, Benjamin
Hi Danny, What you describe sounds like you may also consider to use Spring XD instead, at least for the file-centric stuff. Regards Ben Von meinem iPad gesendet Am 28.01.2015 um 10:42 schrieb Danny Yates da...@codeaholics.org: Hi, My apologies for what has ended up as quite a long

Re: Issue with SparkContext in cluster

2015-01-28 Thread Shixiong Zhu
It's because you committed the job in Windows to a Hadoop cluster running in Linux. Spark has not yet supported it. See https://issues.apache.org/jira/browse/SPARK-1825 Best Regards, Shixiong Zhu 2015-01-28 17:35 GMT+08:00 Marco marco@gmail.com: I've created a spark app, which runs fine if

Re: Spark on Yarn: java.lang.IllegalArgumentException: Invalid rule

2015-01-28 Thread siddardha
Then your spark is not built for yarn. Try to build with sbt/sbt -Dhadoop.version=2.3.0 -Pyarn assembly -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-Yarn-java-lang-IllegalArgumentException-Invalid-rule-tp21382p21404.html Sent from the Apache

Exception when using HttpSolrServer (httpclient) from within Spark Streaming: java.lang.NoSuchMethodError: org.apache.http.impl.conn.SchemeRegistryFactory.createSystemDefault()Lorg/apache/http/conn/sc

2015-01-28 Thread Emre Sevinc
Hello, I'm using *Spark 1.1.0* and *Solr 4.10.3*. I'm getting an exception when using *HttpSolrServer* from within Spark Streaming: 15/01/28 13:42:52 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) java.lang.NoSuchMethodError:

RDD caching, memory network input

2015-01-28 Thread Andrianasolo Fanilo
Hello Spark fellows :), I think I need some help to understand how .cache and task input works within a job. I have an 7 GB input matrix in HDFS that I load using .textFile(). I also have a config file which contains an array of 12 Logistic Regression Model parameters, loaded as an

Re: Spark and S3 server side encryption

2015-01-28 Thread Kohler, Curt E (ELS-STL)
So, following up on your suggestion, I'm still having some problems getting the configuration changes recognized when my job run. I've added jets3t.properties to the root of my application jar file that I submit to Spark (via spark-submit). I've verified that my jets3t.properties is at the

Re: Running a task over a single input

2015-01-28 Thread Sean Owen
On Wed, Jan 28, 2015 at 1:44 PM, Matan Safriel dev.ma...@gmail.com wrote: So I assume I can safely run a function F of mine within the spark driver program, without dispatching it to the cluster (?), thereby sticking to one piece of code for both a real cluster run over big data, and for small

Snappy Crash

2015-01-28 Thread Sven Krasser
I'm running into a new issue with Snappy causing a crash (using Spark 1.2.0). Did anyone see this before? -Sven 2015-01-28 16:09:35,448 WARN [shuffle-server-1] storage.MemoryStore (Logging.scala:logWarning(71)) - Failed to reserve initial memory threshold of 1024.0 KB for computing block

Re: spark-shell working in scala-2.11

2015-01-28 Thread Krishna Sankar
Stephen, Scala 2.11 worked fine for me. Did the dev change and then compile. Not using in production, but I go back and forth between 2.10 2.11. Cheers k/ On Wed, Jan 28, 2015 at 12:18 PM, Stephen Haberman stephen.haber...@gmail.com wrote: Hey, I recently compiled Spark master against

Re: spark 1.2 ec2 launch script hang

2015-01-28 Thread Nicholas Chammas
Hmm, I can’t see why using ~ would be problematic, especially if you confirm that echo ~/path/to/pem expands to the correct path to your identity file. If you have a simple reproduction of the problem, please send it over. I’d love to look into this. When I pass paths with ~ to spark-ec2 on my

RE: Spark on Windows 2008 R2 serv er does not work

2015-01-28 Thread Wang, Ningjun (LNG-NPV)
Has anybody successfully install and run spark-1.2.0 on windows 2008 R2 or windows 7? How do you get that works? Regards, Ningjun Wang Consulting Software Engineer LexisNexis 121 Chanlon Road New Providence, NJ 07974-1541 From: Wang, Ningjun (LNG-NPV) [mailto:ningjun.w...@lexisnexis.com] Sent:

Re: spark 1.2 ec2 launch script hang

2015-01-28 Thread Charles Feduke
Yeah, I agree ~ should work. And it could have been [read: probably was] the fact that one of the EC2 hosts was in my known_hosts (don't know, never saw an error message, but the behavior is no error message for that state), which I had fixed later with Pete's patch. But the second execution when

spark-shell working in scala-2.11

2015-01-28 Thread Stephen Haberman
Hey, I recently compiled Spark master against scala-2.11 (by running the dev/change-versions script), but when I run spark-shell, it looks like the sc variable is missing. Is this a known/unknown issue? Are others successfully using Spark with scala-2.11, and specifically spark-shell? It is

Parquet divide by zero

2015-01-28 Thread Jim Carroll
Hello all, I've been hitting a divide by zero error in Parquet though Spark detailed (and fixed) here: https://github.com/apache/incubator-parquet-mr/pull/102 Is anyone else hitting this error? I hit it frequently. It looks like the Parquet team is preparing to release 1.6.0 and, since they

Re: Parquet divide by zero

2015-01-28 Thread Sean Owen
It looks like it's just a problem with the log message? is it actually causing a problem in Parquet / Spark? but yeah seems like an easy fix. On Wed, Jan 28, 2015 at 9:28 PM, Jim Carroll jimfcarr...@gmail.com wrote: Hello all, I've been hitting a divide by zero error in Parquet though Spark

Re: Parquet divide by zero

2015-01-28 Thread Sean Owen
Answered my own questions seconds later: these aren't doubles, so you don't get NaN, you get an Exception. Right. On Wed, Jan 28, 2015 at 9:35 PM, Sean Owen so...@cloudera.com wrote: It looks like it's just a problem with the log message? is it actually causing a problem in Parquet / Spark? but

Re: Data Locality

2015-01-28 Thread hnahak
I have wrote a custom input split and I want to set to the specific node, where my data is stored. but currently split can start at any node and pick data from different node in the cluster. any suggestion, how to set host in spark -- View this message in context:

Re: reduceByKeyAndWindow, but using log timestamps instead of clock seconds

2015-01-28 Thread Tobias Pfeiffer
Hi, On Thu, Jan 29, 2015 at 1:54 AM, YaoPau jonrgr...@gmail.com wrote: My thinking is to maintain state in an RDD and update it an persist it with each 2-second pass, but this also seems like it could get messy. Any thoughts or examples that might help me? I have just implemented some

is there a master for spark cluster in ec2

2015-01-28 Thread Mohit Singh
Hi, Probably a naive question.. But I am creating a spark cluster on ec2 using the ec2 scripts in there.. But is there a master param I need to set.. ./bin/pyspark --master [ ] ?? I don't yet fully understand the ec2 concepts so just wanted to confirm this?? Thanks -- Mohit When you want

Re: spark 1.2 ec2 launch script hang

2015-01-28 Thread Charles Feduke
It was only hanging when I specified the path with ~ I never tried relative. Hanging on the waiting for ssh to be ready on all hosts. I let it sit for about 10 minutes then I found the StackOverflow answer that suggested specifying an absolute path, cancelled, and re-run with --resume and the

Re: reduceByKeyAndWindow, but using log timestamps instead of clock seconds

2015-01-28 Thread Imran Rashid
I'm not an expert on streaming, but I think you can't do anything like this right now. It seems like a very sensible use case, though, so I've created a jira for it: https://issues.apache.org/jira/browse/SPARK-5467 On Wed, Jan 28, 2015 at 8:54 AM, YaoPau jonrgr...@gmail.com wrote: The

Hive on Spark vs. SparkSQL using Hive ?

2015-01-28 Thread ogoh
Hello, probably this question was already asked but still I'd like to confirm from Spark users. This following blog shows 'hive on spark' : http://blog.cloudera.com/blog/2014/12/hands-on-hive-on-spark-in-the-aws-cloud/;. How is it different from using hive as data storage of SparkSQL

Re: [SparkSQL] Try2: Parquet predicate pushdown troubles

2015-01-28 Thread Cheng Lian
Hey Yana, An update about this Parquet filter push-down issue. It turned out to be a bit complicated, but (hopefully) all clear now. 1. Yesterday I found a bug in Parquet, which essentially disables row group filtering for almost all |AND| predicates. * JIRA ticket: PARQUET-173

Dependency unresolved hadoop-yarn-common 1.0.4 when running quickstart example

2015-01-28 Thread sarwar.bhuiyan
Hello all, I'm trying to build the sample application on the spark 1.2.0 quickstart page (https://spark.apache.org/docs/latest/quick-start.html) using the following build.sbt file: name := Simple Project version := 1.0 scalaVersion := 2.10.4 libraryDependencies += org.apache.spark %%

Set is not parseable as row field in SparkSql

2015-01-28 Thread Jorge Lopez-Malla
Hello, We are trying to insert a case class in Parquet using SparkSql. When i'm creating the SchemaRDD, that include a Set, i have the following exception: sqc.createSchemaRDD(r) scala.MatchError: Set[(scala.Int, scala.Int)] (of class scala.reflect.internal.Types$TypeRef$$anon$1) at

Re: Exception when using HttpSolrServer (httpclient) from within Spark Streaming: java.lang.NoSuchMethodError: org.apache.http.impl.conn.SchemeRegistryFactory.createSystemDefault()Lorg/apache/http/con

2015-01-28 Thread Charles Feduke
It looks like you're shading in the Apache HTTP commons library and its a different version than what is expected. (Maybe 4.6.x based on the Javadoc.) I see you are attempting to exclude commons-httpclient by using: exclusion groupIdcommons-httpclient/groupId

unsubscribe

2015-01-28 Thread Abhi Basu
-- Abhi Basu

Re: Data Locality

2015-01-28 Thread Harihar Nahak
Hi Guys, I have the similar question and doubt. How spark create an executor on the same node where is data block stored? Does it first take information from HDFS name mode, get the block information and then place executor on the same node is spark-worker demon is installed? -

Appending to an hdfs file

2015-01-28 Thread Matan Safriel
Hi, Is it possible to append to an existing (hdfs) file, through some Spark action? Should there be any reason not to use a hadoop append api within a Spark job? Thanks, Matan

Re: Appending to an hdfs file

2015-01-28 Thread Sean Owen
You can call any API you like in a Spark job, as long as the libraries are available, and Hadoop HDFS APIs will be available from the cluster. You could write a foreachPartition() that appends partitions of data to files, yes. Spark itself does not use appending. I think the biggest reason is

Re: unsubscribe

2015-01-28 Thread Ted Yu
send an email to user-unsubscr...@spark.apache.org Cheers On Wed, Jan 28, 2015 at 2:16 PM, Abhi Basu 9000r...@gmail.com wrote: -- Abhi Basu

Dependency unresolved hadoop-yarn-common 1.0.4 when running quickstart example

2015-01-28 Thread Sarwar Bhuiyan
Hello all, I'm trying to build the sample application on the spark 1.2.0 quickstart page (https://spark.apache.org/docs/latest/quick-start.html) using the following build.sbt file: name := Simple Project version := 1.0 scalaVersion := 2.10.4 libraryDependencies += org.apache.spark %%

Re: Spark on Windows 2008 R2 serv er does not work

2015-01-28 Thread Marcelo Vanzin
https://issues.apache.org/jira/browse/SPARK-2356 Take a look through the comments, there are some workarounds listed there. On Wed, Jan 28, 2015 at 1:40 PM, Wang, Ningjun (LNG-NPV) ningjun.w...@lexisnexis.com wrote: Has anybody successfully install and run spark-1.2.0 on windows 2008 R2 or

Re: Parquet divide by zero

2015-01-28 Thread Lukas Nalezenec
Hi Jim, I am sorry, I know about your patch and I will commit it ASAP. Lukas Nalezenec On 28.1.2015 22:28, Jim Carroll wrote: Hello all, I've been hitting a divide by zero error in Parquet though Spark detailed (and fixed) here: https://github.com/apache/incubator-parquet-mr/pull/102 Is

Re: Partition + equivalent of MapReduce multiple outputs

2015-01-28 Thread Corey Nolet
I think this repartitionAndSortWithinPartitions() method may be what I'm looking for in [1]. At least it sounds like it is. Will this method allow me to deal with sorted partitions even when the partition doesn't fit into memory? [1]

Data are partial to a specific partition after sort

2015-01-28 Thread 瀬川 卓也
For example, We consider the word count of the long text data (100GB order). There is clearly a bias for the word , has been expected to be a long tail data do word count. Probably word number 1 occupies about over 1 / 10. word count code ``` val allWordLineSplited: RDD[String] = // create

StackOverflowError with SchemaRDD

2015-01-28 Thread ankits
Hi, I am getting a stack overflow error when querying a schemardd comprised of parquet files. This is (part of) the stack trace: Caused by: java.lang.StackOverflowError at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at

Re: reduceByKeyAndWindow, but using log timestamps instead of clock seconds

2015-01-28 Thread Tathagata Das
Ohhh nice! Would be great if you can share us some code soon. It is indeed a very complicated problem and there is probably no single solution that fits all usecases. So having one way of doing things would be a great reference. Looking forward to that! On Wed, Jan 28, 2015 at 4:52 PM, Tobias

Re: spark 1.2 ec2 launch script hang

2015-01-28 Thread Nicholas Chammas
Thanks for sending this over, Peter. What if you try this? (i.e. Remove the = after --identity-file.) ec2/spark-ec2 --key-pair=spark-streaming-kp --identity-file ~/.pzkeys/spark-streaming-kp.pem --region=us-east-1 login pz-spark-cluster If that works, then I think the problem in this case is

Re: spark 1.2 ec2 launch script hang

2015-01-28 Thread Nicholas Chammas
If that was indeed the problem, I suggest updating your answer on SO http://stackoverflow.com/a/28005151/877069 to help others who may run into this same problem. ​ On Wed Jan 28 2015 at 9:40:39 PM Nicholas Chammas nicholas.cham...@gmail.com wrote: Thanks for sending this over, Peter. What

Re: data locality in logs

2015-01-28 Thread hnahak
Hi How to set a preferred location for an InputSplit in spark standalone? I have data in specific machine and I want to read them using Splits which is created for that node only, by assigning some property which help Spark to create a split in that node only. -- View this message in

Re: Error reporting/collecting for users

2015-01-28 Thread Tathagata Das
You could use foreachRDD to do the operations and then inside the foreach create an accumulator to gather all the errors together dstream.foreachRDD { rdd = val accumulator = new Accumulator[] rdd.map { . }.count // whatever operation that is error prone // gather all errors

Re: spark sqlContext udaf

2015-01-28 Thread sunwei
Thanks very much. It seems that I have to use HiveContext at present. 在 2015年1月28日,上午11:34,Kuldeep Bora kuldeep.b...@gmail.com 写道: UDAF is a WIP, at least from API user's perspective as there is no public API to my knowledge. https://issues.apache.org/jira/browse/SPARK-3947 Thanks On

RE: unsubscribe

2015-01-28 Thread Bob Tiernay
Cheers Date: Wed, 28 Jan 2015 14:18:49 -0800 Subject: Re: unsubscribe From: yuzhih...@gmail.com To: 9000r...@gmail.com CC: user@spark.apache.org send an email to user-unsubscr...@spark.apache.org Cheers On Wed, Jan 28, 2015 at 2:16 PM, Abhi Basu 9000r...@gmail.com wrote: -- Abhi Basu

Re: spark 1.2 ec2 launch script hang

2015-01-28 Thread Peter Zybrick
Below is trace from trying to access with ~/path. I also did the echo as per Nick (see the last line), looks ok to me. This is my development box with Spark 1.2.0 running CentOS 6.5, Python 2.6.6 [pete.zybrick@pz-lt2-ipc spark-1.2.0]$ ec2/spark-ec2 --key-pair=spark-streaming-kp

RE: reduceByKeyAndWindow, but using log timestamps instead of clock seconds

2015-01-28 Thread Shao, Saisai
That's definitely a good supplement to the current Spark Streaming, I've heard many guys want to process the data using log time. Looking forward to the code. Thanks Jerry -Original Message- From: Tathagata Das [mailto:tathagata.das1...@gmail.com] Sent: Thursday, January 29, 2015 10:33

Re: Hive on Spark vs. SparkSQL using Hive ?

2015-01-28 Thread Arush Kharbanda
Spark SQL on Hive 1. The purpose of Spark SQL is to allow Spark users to selectively use SQL expressions (with not a huge number of functions currently supported) when writing Spark jobs 2. Already Available Hive on Spark 1.Spark users will automatically get the whole set of Hive’s rich

Re: RDD caching, memory network input

2015-01-28 Thread Sandy Ryza
Hi Fanilo, How many cores are you using per executor? Are you aware that you can combat the container is running beyond physical memory limits error by bumping the spark.yarn.executor.memoryOverhead property? Also, are you caching the parsed version or the text? -Sandy On Wed, Jan 28, 2015 at

Re: Exception when using HttpSolrServer (httpclient) from within Spark Streaming: java.lang.NoSuchMethodError: org.apache.http.impl.conn.SchemeRegistryFactory.createSystemDefault()Lorg/apache/http/con

2015-01-28 Thread Emre Sevinc
When I examine the dependencies again, I see that SolrJ library is using v. 4.3.1 of org.apache.httpcomponents:httpclient [INFO] +- org.apache.solr:solr-solrj:jar:4.10.3:compile [INFO] | +- org.apache.httpcomponents:httpclient:jar:4.3.1:compile == [INFO] | +-

RE: RDD caching, memory network input

2015-01-28 Thread Andrianasolo Fanilo
Each machine has 24 cores, but I assume each executor on a machine is attributed one core max because I set the –executor-cores property to 1. I’m going to try a higher memoryOverhead later, I’ll post the results. I’m caching the parsed version, something like val matrix =

Re: Spark and S3 server side encryption

2015-01-28 Thread Charles Feduke
I have been trying to work around a similar problem with my Typesafe config *.conf files seemingly not appearing on the executors. (Though now that I think about it its not because the files are absent in the JAR, but because the -Dconf.resource environment variable I pass to the master obviously

Re: Exception when using HttpSolrServer (httpclient) from within Spark Streaming: java.lang.NoSuchMethodError: org.apache.http.impl.conn.SchemeRegistryFactory.createSystemDefault()Lorg/apache/http/con

2015-01-28 Thread Charles Feduke
Yeah it sounds like your original exclusion of commons-httpclient from hadoop-* was correct, but its still coming in from somewhere. Can you try something like this?: dependency artifactIdcommons-http/artifactId groupIdhttpclient/groupId scopeprovided/scope /dependency ref:

Re: java.lang.OutOfMemoryError: GC overhead limit exceeded

2015-01-28 Thread Guru Medasani
Hi Antony, Did you get pass this error by repartitioning your job with smaller tasks as Sven Krasser pointed out? From: Antony Mayi antonym...@yahoo.com Reply-To: Antony Mayi antonym...@yahoo.com Date: Tuesday, January 27, 2015 at 5:24 PM To: Guru Medasani gdm...@outlook.com, Sven Krasser

Re: reduceByKeyAndWindow, but using log timestamps instead of clock seconds

2015-01-28 Thread Akhil Das
I'm not quiet sure if i understood it correctly, but can you not create a key from the timestamps and do the reduceByKeyAndWindow over it? Thanks Best Regards On Wed, Jan 28, 2015 at 10:24 PM, YaoPau jonrgr...@gmail.com wrote: The TwitterPopularTags example works great: the Twitter firehose

Re: NegativeArraySizeException in pyspark when loading an RDD pickleFile

2015-01-28 Thread Davies Liu
HadoopRDD will try to split the file as 64M partitions in size, so you got 1916+ partitions. (assume 100k per row, they are 80G in size). I think it has very small chance that one object or one batch will be bigger than 2G. Maybe there are a bug when it split the pickled file, could you create a

Re: Set is not parseable as row field in SparkSql

2015-01-28 Thread Cheng Lian
Hey Jorge, This is expected. Because there isn’t an obvious mapping from |Set[T]| to any SQL types. Currently we have complex types like array, map, and struct, which are inherited from Hive. In your case, I’d transform the |Set[T]| into a |Seq[T]| first, then Spark SQL can map it to an

reduceByKeyAndWindow, but using log timestamps instead of clock seconds

2015-01-28 Thread YaoPau
The TwitterPopularTags example works great: the Twitter firehose keeps messages pretty well in order by timestamp, and so to get the most popular hashtags over the last 60 seconds, reduceByKeyAndWindow works well. My stream pulls Apache weblogs from Kafka, and so it's not as simple: messages can

Re: MappedRDD signature

2015-01-28 Thread Sanjay Subramanian
Thanks Sean. that works and I started the join of this mappedRDD to another one I have.I have to internalize the use of Map versus FlatMap. Thinking Map Reduce Java Hadoop code often blinds me :-)  From: Sean Owen so...@cloudera.com To: Sanjay Subramanian sanjaysubraman...@yahoo.com Cc: