Does spark *always* fork its workers?

2015-02-18 Thread Kevin Burton
I want to map over a Cassandra table in Spark but my code that executes needs a shutdown() call to return any threads, release file handles, etc. Will spark always execute my mappers as a forked process? And if so how do I handle threads preventing the JVM from terminating. It would be nice if

Re: Spark Streaming output cannot be used as input?

2015-02-18 Thread Emre Sevinc
Hello Jose, We've hit the same issue a couple of months ago. It is possible to write directly to files instead of creating directories, but it is not straightforward, and I haven't seen any clear demonstration in books, tutorials, etc. We do something like: SparkConf sparkConf = new

Cannot access Spark web UI

2015-02-18 Thread Mukesh Jha
Hello Experts, I am running a spark-streaming app inside YARN. I have Spark History server running as well (Do we need it running to access UI?). The app is running fine as expected but the Spark's web UI is not accessible. When I try to access the ApplicationMaster of the Yarn application I

Re: Cannot access Spark web UI

2015-02-18 Thread Akhil Das
The error says Cannot assign requested address. This means that you need to use the correct address for one of your network interfaces or 0.0.0.0 to accept connections from all interfaces. Can you paste your spark-env.sh file and /etc/hosts file. Thanks Best Regards On Wed, Feb 18, 2015 at 2:06

RE: spark-core in a servlet

2015-02-18 Thread Anton Brazhnyk
Check for the dependencies. Looks like you have a conflict around servlet-api jars. Maven's dependency-tree, some exclusions and some luck :) could help. From: Ralph Bergmann | the4thFloor.eu [ra...@the4thfloor.eu] Sent: Tuesday, February 17, 2015 4:14 PM

Re: Magic number 16: Why doesn't Spark Streaming process more than 16 files?

2015-02-18 Thread Emre Sevinc
Hello Imran, (a) I know that all 20 files are processed when I use foreachRDD, because I can see the processed files in the output directory. (My application logic writes them to an output directory after they are processed, *but* that writing operation does not happen in foreachRDD, below you

Re: Re: Problem with 1 master + 2 slaves cluster

2015-02-18 Thread Emre Sevinc
On Wed, Feb 18, 2015 at 10:23 AM, bit1...@163.com bit1...@163.com wrote: Sure, thanks Akhil. A further question : Is local file system(file:///) not supported in standalone cluster? FYI: I'm able to write to local file system (via HDFS API and using file:/// notation) when using Spark. --

Re: Can't I mix non-Spark properties into a .properties file and pass it to spark-submit via --properties-file?

2015-02-18 Thread Emre Sevinc
Thanks to everyone for suggestions and explanations. Currently I've started to experiment with the following scenario, that seems to work for me: - Put the properties file on a web server so that it is centrally available - Pass it to the Spark driver program via --conf 'propertiesFile=http:

cannot connect to Spark Application Master in YARN

2015-02-18 Thread rok
I'm trying to access the Spark UI for an application running through YARN. Clicking on the Application Master under Tracking UI I get an HTTP ERROR 500: HTTP ERROR 500 Problem accessing /proxy/application_1423151769242_0088/. Reason: Connection refused Caused by:

Re: cannot connect to Spark Application Master in YARN

2015-02-18 Thread Sean Owen
Can you track your comments on the existing issue? https://issues.apache.org/jira/browse/SPARK-5837 I personally can't reproduce this but more info would help narrow it down. On Wed, Feb 18, 2015 at 10:58 AM, rok rokros...@gmail.com wrote: I'm trying to access the Spark UI for an application

Is there a limit to the number of RDDs in a Spark context?

2015-02-18 Thread Juan Rodríguez Hortalá
Hi, I'm writing a Spark program where I want to divide a RDD into different groups, but the groups are too big to use groupByKey. To cope with that, since I know in advance the list of keys for each group, I build a map from the keys to the RDDs that result from filtering the input RDD to get the

spark 1.2 slower than 1.0 in unit tests

2015-02-18 Thread Marcin Cylke
Hi We're using Spark in our app's unit tests. The tests start spark context with local[*] and test time now is 178 seconds on spark 1.2 instead of 41 seconds on 1.0. We are using spark version from cloudera CDH (1.2.0-cdh5.3.1). Could you give some hints what could cause that? and where to

[POWERED BY] Can you add Big Industries to the Powered by Spark page?

2015-02-18 Thread Emre Sevinc
Hello, Could you please add Big Industries to the Powered by Spark page at https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark ? Company Name: Big Industries URL: http://http://www.bigindustries.be/ Spark Components: Spark Streaming Use Case: Big Content Platform Summary:

Re: Problem with 1 master + 2 slaves cluster

2015-02-18 Thread bit1...@163.com
But I am able to run the SparkPi example: ./run-example SparkPi 1000 --master spark://192.168.26.131:7077 Result:Pi is roughly 3.14173708 bit1...@163.com From: bit1...@163.com Date: 2015-02-18 16:29 To: user Subject: Problem with 1 master + 2 slaves cluster Hi sparkers, I setup a

Re: Re: Problem with 1 master + 2 slaves cluster

2015-02-18 Thread Akhil Das
when you give file:// , while reading, it requires that all slaves has that path/file available locally in their system. It's ok to give file:// when you run your application in local mode (like master=local[*]) Thanks Best Regards On Wed, Feb 18, 2015 at 2:58 PM, Emre Sevinc

Re: Cannot access Spark web UI

2015-02-18 Thread Arush Kharbanda
It seems like that its not able to get a port it needs are you sure that the required port is available. In what logs did you find this error? On Wed, Feb 18, 2015 at 2:21 PM, Akhil Das ak...@sigmoidanalytics.com wrote: The error says Cannot assign requested address. This means that you need

Re: Spark Streaming output cannot be used as input?

2015-02-18 Thread Sean Owen
To clarify, sometimes in the world of Hadoop people freely refer to an output 'file' when it's really a directory containing 'part-*' files which are pieces of the file. It's imprecise but that's the meaning. I think the scaladoc may be referring to 'the path to the file, which includes this

Re: Why groupBy is slow?

2015-02-18 Thread shahab
Thanks Francois for the comment and useful link. I understand the problem better now. best, /Shahab On Wed, Feb 18, 2015 at 10:36 AM, francois.garil...@typesafe.com wrote: In a nutshell : because it’s moving all of your data, compared to other operations (e.g. reduce) that summarize it in

Re: Is spark streaming +MlLib for online learning?

2015-02-18 Thread mucaho
Hi What is the general consensus/roadmap for implementing additional online / streamed trainable models? Apache Spark 1.2.1 currently supports streaming linear regression clustering, although other streaming linear methods are planned according to the issue tracker. However, I can not find any

Re: Problem with 1 master + 2 slaves cluster

2015-02-18 Thread Akhil Das
Since the cluster is standalone, you are better off reading/writing to hdfs instead of local filesystem. Thanks Best Regards On Wed, Feb 18, 2015 at 2:32 PM, bit1...@163.com bit1...@163.com wrote: But I am able to run the SparkPi example: ./run-example SparkPi 1000 --master

Re: How to pass parameters to a spark-jobserver Scala class?

2015-02-18 Thread Vasu C
Hi Sasi, Forgot to mention job server uses Typesafe Config library. The input is JSON, you can find syntax in below link https://github.com/typesafehub/config Regards, Vasu C -- View this message in context:

Re: Does spark *always* fork its workers?

2015-02-18 Thread Sean Owen
Forked, meaning, different from the driver? Spark will in general not even execute your tasks on the same machine as your driver. The driver can choose to execute a task locally in some cases. You are creating non-daemon threads in your function? your function can and should clean up after

Re: Why groupBy is slow?

2015-02-18 Thread francois . garillot
In a nutshell : because it’s moving all of your data, compared to other operations (e.g. reduce) that summarize it in one form or another before moving it. For the longer answer:

Re: Re: Problem with 1 master + 2 slaves cluster

2015-02-18 Thread bit1...@163.com
Sure, thanks Akhil. A further question : Is local file system(file:///) not supported in standalone cluster? bit1...@163.com From: Akhil Das Date: 2015-02-18 17:35 To: bit1...@163.com CC: user Subject: Re: Problem with 1 master + 2 slaves cluster Since the cluster is standalone, you are

Why groupBy is slow?

2015-02-18 Thread shahab
Hi, Based on what I could see in the Spark UI, I noticed that groupBy transformation is quite slow (taking a lot of time) compared to other operations. Is there any reason that groupBy is slow? shahab

issue Running Spark Job on Yarn Cluster

2015-02-18 Thread sachin Singh
Hi, I want to run my spark Job in Hadoop yarn Cluster mode, I am using below command - spark-submit --master yarn-cluster --driver-memory 1g --executor-memory 1g --executor-cores 1 --class com.dc.analysis.jobs.AggregationJob sparkanalitic.jar param1 param2 param3 I am getting error as under,

Creating RDDs from within foreachPartition() [Spark-Streaming]

2015-02-18 Thread t1ny
Hi all, I am trying to create RDDs from within /rdd.foreachPartition()/ so I can save these RDDs to ElasticSearch on the fly : stream.foreachRDD(rdd = { rdd.foreachPartition { iterator = { val sc = rdd.context iterator.foreach { case (cid,

Job Fails on sortByKey

2015-02-18 Thread athing goingon
hi, I have a job that fails on a shuffle during a sortByKey, on a relatively small dataset. http://pastebin.com/raw.php?i=1LxiG4rY

Re: JdbcRDD, ClassCastException with scala.Function0

2015-02-18 Thread Dmitry Goldenberg
Thanks, Cody. Yes, I originally started off by looking at that but I get a compile error if I try and use that approach: constructor JdbcRDD in class JdbcRDDT cannot be applied to given types. Not to mention that JavaJdbcRDDSuite somehow manages to not pass in the class tag (the last argument).

Re: Job Fails on sortByKey

2015-02-18 Thread Saisai Shao
Would you mind explaining your problem a little more specifically, like exceptions you met or others, so someone who has experiences on it could give advice. Thanks Jerry 2015-02-19 1:08 GMT+08:00 athing goingon athinggoin...@gmail.com: hi, I have a job that fails on a shuffle during a

Re: JdbcRDD, ClassCastException with scala.Function0

2015-02-18 Thread Cody Koeninger
Take a look at https://github.com/apache/spark/blob/v1.2.1/core/src/test/java/org/apache/spark/JavaJdbcRDDSuite.java On Wed, Feb 18, 2015 at 11:14 AM, dgoldenberg dgoldenberg...@gmail.com wrote: I'm reading data from a database using JdbcRDD, in Java, and I have an implementation of

Re: JdbcRDD, ClassCastException with scala.Function0

2015-02-18 Thread Cody Koeninger
Is sc there a SparkContext or a JavaSparkContext? The compilation error seems to indicate the former, but JdbcRDD.create expects the latter On Wed, Feb 18, 2015 at 12:30 PM, Dmitry Goldenberg dgoldenberg...@gmail.com wrote: I have tried that as well, I get a compile error -- [ERROR]

build spark for cdh5

2015-02-18 Thread Koert Kuipers
does anyone have the right maven invocation for cdh5 with yarn? i tried: $ mvn -Phadoop2.3 -Dhadoop.version=2.5.0-cdh5.2.3 -Pyarn -DskipTests clean package $ mvn -Phadoop2.3 -Dhadoop.version=2.5.0-cdh5.2.3 -Pyarn test it builds and passes tests just fine, but when i deploy on cluster and i try to

Re: Class loading issue, spark.files.userClassPathFirst doesn't seem to be working

2015-02-18 Thread Marcelo Vanzin
Hello, On Tue, Feb 17, 2015 at 8:53 PM, dgoldenberg dgoldenberg...@gmail.com wrote: I've tried setting spark.files.userClassPathFirst to true in SparkConf in my program, also setting it to true in $SPARK-HOME/conf/spark-defaults.conf as Is the code in question running on the driver or in some

Re: build spark for cdh5

2015-02-18 Thread Koert Kuipers
thanks! my bad On Wed, Feb 18, 2015 at 2:00 PM, Sandy Ryza sandy.r...@cloudera.com wrote: Hi Koert, You should be using -Phadoop-2.3 instead of -Phadoop2.3. -Sandy On Wed, Feb 18, 2015 at 10:51 AM, Koert Kuipers ko...@tresata.com wrote: does anyone have the right maven invocation for

Re: Tableau beta connector

2015-02-18 Thread ganterm
Ashutosh, Were you able to figure this out? I am having the exact some question. I think the answer is to use Spark SQL to create/load a table in Hive (e.g. execute the HiveQL CREATE TABLE statement) but I am not sure. Hoping for something more simple than that. Anybody? Thanks! -- View

Re: Is there a limit to the number of RDDs in a Spark context?

2015-02-18 Thread Juan Rodríguez Hortalá
Hi Sean, Thanks a lot for your answer. That explains it, as I was creating thousands of RDDs, so I guess the communication overhead was the reason why the Spark job was freezing. After changing the code to use RDDs of pairs and aggregateByKey it works just fine, and quite fast. Again, thanks a

Spark data incorrect when more than 200 tasks

2015-02-18 Thread lbierman
I'm fairly new to Spark. We have data in avro files on hdfs. We are trying to load up all the avro files (28 gigs worth right now) and do an aggregation. When we have less than 200 tasks the data all runs and produces the proper results. If there are more than 200 tasks (as stated in the logs by

Re: JdbcRDD, ClassCastException with scala.Function0

2015-02-18 Thread Cody Koeninger
That test I linked https://github.com/apache/spark/blob/v1.2.1/core/src/test/java/org/apache/spark/JavaJdbcRDDSuite.java#L90 is calling a static method JdbcRDD.create, not new JdbcRDD. Is that what you tried doing? On Wed, Feb 18, 2015 at 12:00 PM, Dmitry Goldenberg dgoldenberg...@gmail.com

Re: JdbcRDD, ClassCastException with scala.Function0

2015-02-18 Thread Dmitry Goldenberg
I have tried that as well, I get a compile error -- [ERROR] ...SparkProto.java:[105,39] error: no suitable method found for create(SparkContext,anonymous ConnectionFactory,String,int,int,int,anonymous FunctionResultSet,Integer) The code is a copy and paste: JavaRDDInteger jdbcRDD =

Spark and Spark Streaming code sharing best practice.

2015-02-18 Thread Jean-Pascal Billaud
Hey, It seems pretty clear that one of the strength of Spark is to be able to share your code between your batch and streaming layer. Though, given that Spark streaming uses DStream being a set of RDDs and Spark uses a single RDD there might some complexity associated with it. Of course since

Re: Spark and Spark Streaming code sharing best practice.

2015-02-18 Thread Arush Kharbanda
I find monoids pretty useful in this respect, basically separating out the logic in a monoid and then applying the logic to either a stream or a batch. A list of such practices could be really useful. On Thu, Feb 19, 2015 at 12:26 AM, Jean-Pascal Billaud j...@tellapart.com wrote: Hey, It

Hamburg Apache Spark Meetup

2015-02-18 Thread Johan Beisser
If you could also add the Hamburg Apache Spark Meetup, I'd appreciate it. http://www.meetup.com/Hamburg-Apache-Spark-Meetup/ On Tue, Feb 17, 2015 at 5:08 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Thanks! I've added you. Matei On Feb 17, 2015, at 4:06 PM, Ralph Bergmann |

Re: build spark for cdh5

2015-02-18 Thread Sandy Ryza
Hi Koert, You should be using -Phadoop-2.3 instead of -Phadoop2.3. -Sandy On Wed, Feb 18, 2015 at 10:51 AM, Koert Kuipers ko...@tresata.com wrote: does anyone have the right maven invocation for cdh5 with yarn? i tried: $ mvn -Phadoop2.3 -Dhadoop.version=2.5.0-cdh5.2.3 -Pyarn -DskipTests

ML Transformer

2015-02-18 Thread Cesar Flores
I am working right now with the ML pipeline, which I really like it. However in order to make a real use of it, I would like create my own transformers that implements org.apache.spark.ml.Transformer. In order to do that, a method from the PipelineStage needs to be implemented. But this method is

NotSerializableException: org.apache.http.impl.client.DefaultHttpClient when trying to send documents to Solr

2015-02-18 Thread dgoldenberg
I'm using Solrj in a Spark program. When I try to send the docs to Solr, I get the NotSerializableException on the DefaultHttpClient. Is there a possible fix or workaround? I'm using Spark 1.2.1 with Hadoop 2.4, SolrJ is version 4.0.0. final HttpSolrServer solrServer = new

Re: JsonRDD to parquet -- data loss

2015-02-18 Thread Michael Armbrust
Concurrent inserts into the same table are not supported. I can try to make this clearer in the documentation. On Tue, Feb 17, 2015 at 8:01 PM, Vasu C vasuc.bigd...@gmail.com wrote: Hi, I am running spark batch processing job using spark-submit command. And below is my code snippet.

Re: Class loading issue, spark.files.userClassPathFirst doesn't seem to be working

2015-02-18 Thread Dmitry Goldenberg
I'm not sure what on the driver means but I've tried setting spark.files.userClassPathFirst to true, in $SPARK-HOME/conf/spark-defaults.conf and also in the SparkConf programmatically; it appears to be ignored. The solution was to follow Emre's recommendation and downgrade the selected Solrj

Re: JdbcRDD, ClassCastException with scala.Function0

2015-02-18 Thread Dmitry Goldenberg
Cody, you were right, I had a copy and paste snag where I ended up with a vanilla SparkContext rather than a Java one. I also had to *not* use my function subclasses, rather just use anonymous inner classes for the Function stuff and that got things working. I'm fully following the JdbcRDD.create

Spark can't pickle class: error cannot lookup attribute

2015-02-18 Thread Guillaume Guy
Hi, This is a duplicate of the stack-overflow question here http://stackoverflow.com/questions/28569374/spark-returning-pickle-error-cannot-lookup-attribute. I hope to generate more interest on this mailing list. *The problem:* I am running into some attribute lookup problems when trying to

Thriftserver Beeline

2015-02-18 Thread gtinside
Hi , I created some hive tables and trying to list them through Beeline , but not getting any results. I can list the tables through spark-sql. When I connect beeline, it starts up with following message : Connecting to jdbc:hive2://tst001:10001 Enter username for jdbc:hive2://tst001:10001:

Re: JdbcRDD, ClassCastException with scala.Function0

2015-02-18 Thread Cody Koeninger
Cant you implement the org.apache.spark.api.java.function.Function interface and pass an instance of that to JdbcRDD.create ? On Wed, Feb 18, 2015 at 3:48 PM, Dmitry Goldenberg dgoldenberg...@gmail.com wrote: Cody, you were right, I had a copy and paste snag where I ended up with a vanilla

spark slave cannot execute without admin permission on windows

2015-02-18 Thread Judy Nash
Hi, Is it possible to configure spark to run without admin permission on windows? My current setup run master slave successfully with admin permission. However, if I downgrade permission level from admin to user, SparkPi fails with the following exception on the slave node: Exception in thread

Re: Spark and Spark Streaming code sharing best practice.

2015-02-18 Thread Arush Kharbanda
Monoids are useful in Aggregations and try avoiding Anonymous functions, creating out functions out of the spark code allows the functions to be reused(Possibly between Spark and Spark Streaming) On Thu, Feb 19, 2015 at 6:56 AM, Jean-Pascal Billaud j...@tellapart.com wrote: Thanks Arush. I will

Re: spark slave cannot execute without admin permission on windows

2015-02-18 Thread Akhil Das
You need not require admin permission, but just make sure all those jars has execute permission ( read/write access) Thanks Best Regards On Thu, Feb 19, 2015 at 11:30 AM, Judy Nash judyn...@exchange.microsoft.com wrote: Hi, Is it possible to configure spark to run without admin

Re: OutOfMemory and GC limits (TODO) Error in map after self-join

2015-02-18 Thread Tom Walwyn
Thanks Imran, I'll try your suggestions. I eventually got this to run by 'checkpointing' the joined RDD (according to Akhil's suggestion) before performing the reduceBy, and then checkpointing it again afterward. i.e. val rdd2 = rdd.join(rdd, numPartitions=1000) .map(fp=((fp._2._1, fp._2._2),

Re: Is spark streaming +MlLib for online learning?

2015-02-18 Thread Reza Zadeh
This feature request is already being tracked: https://issues.apache.org/jira/browse/SPARK-4981 Aiming for 1.4 Best, Reza On Wed, Feb 18, 2015 at 2:40 AM, mucaho muc...@yahoo.com wrote: Hi What is the general consensus/roadmap for implementing additional online / streamed trainable models?

Re: How to pass parameters to a spark-jobserver Scala class?

2015-02-18 Thread Sasi
Thank you very much Vasu. Let me add some more points to my question. We are developing a Java program for connection spark-jobserver to Vaadin (Java framework). Following is the sample code I wrote for connecting both (the code works fine): / URL url = null; HttpURLConnection connection = null;

How to connect a mobile app (Android/iOS) with a Spark backend?

2015-02-18 Thread Ralph Bergmann | the4thFloor.eu
Hi, I have dependency problems to use spark-core inside of a HttpServlet (see other mail from me). Maybe I'm wrong?! What I want to do: I develop a mobile app (Android and iOS) and want to connect them with Spark on backend side. To do this I want to use Tomcat. The app uses https to ask

Periodic Broadcast in Apache Spark Streaming

2015-02-18 Thread aanilpala
I am implementing a stream learner for text classification. There are some single-valued parameters in my implementation that needs to be updated as new stream items arrive. For example, I want to change learning rate as the new predictions are made. However, I doubt that there is a way to

Re: Is there a limit to the number of RDDs in a Spark context?

2015-02-18 Thread Sean Owen
At some level, enough RDDs creates hundreds of thousands of tiny partitions of data each of which creates a task for each stage. The raw overhead of all the message passing can slow things down a lot. I would not design something to use an RDD per key. You would generally use key by some value you

Re: Class loading issue, spark.files.userClassPathFirst doesn't seem to be working

2015-02-18 Thread Dmitry Goldenberg
Are you proposing I downgrade Solrj's httpclient dependency to be on par with that of Spark/Hadoop? Or upgrade Spark/Hadoop's httpclient to the latest? Solrj has to stay with its selected version. I could try and rebuild Spark with the latest httpclient but I've no idea what effects that may

Re: How to integrate hive on spark

2015-02-18 Thread Arush Kharbanda
Hi Did you try these steps. https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started Thanks Arush On Wed, Feb 18, 2015 at 7:20 PM, sandeepvura sandeepv...@gmail.com wrote: Hi , I am new to sparks.I had installed spark on 3 node cluster.I would like to integrate

Re: Creating RDDs from within foreachPartition() [Spark-Streaming]

2015-02-18 Thread Sean Owen
You can't use RDDs inside RDDs. RDDs are managed from the driver, and functions like foreachRDD execute things on the remote executors. You can write code to simply directly save whatever you want to ES. There is not necessarily a need to use RDDs for that. On Wed, Feb 18, 2015 at 11:36 AM, t1ny

Re: How to connect a mobile app (Android/iOS) with a Spark backend?

2015-02-18 Thread Arush Kharbanda
I am running Spark Jobs behind tomcat. We didn't face any issues.But for us the user base is very small. The possible blockers could be 1. If there are many users of the system. Then jobs might have to w8, you might want to think about the kind of scheduling you want to do. 2.Again if the no of

Re: Is there a limit to the number of RDDs in a Spark context?

2015-02-18 Thread Juan Rodríguez Hortalá
Hi Paweł, Thanks a lot for your answer. I finally got the program to work by using aggregateByKey, but I was wondering why creating thousands of RDDs doesn't work. I think that could be interesting for using methods that work on RDDs like for example JavaDoubleRDD.stats() (

Re: Class loading issue, spark.files.userClassPathFirst doesn't seem to be working

2015-02-18 Thread Emre Sevinc
Hello Dmitry, I had almost the same problem and solved it by using version 4.0.0 of SolrJ: dependency groupIdorg.apache.solr/groupId artifactIdsolr-solrj/artifactId version4.0.0/version /dependency In my case, I was lucky that version 4.0.0 of SolrJ had all the functionality

Re: Class loading issue, spark.files.userClassPathFirst doesn't seem to be working

2015-02-18 Thread Dmitry Goldenberg
Thank you, Emre. It seems solrj still depends on HttpClient 4.1.3; would that not collide with Spark/Hadoop's default dependency on HttpClient set to 4.2.6? If that's the case that might just solve the problem. Would Solrj 4.0.0 work with the latest Solr, 4.10.3? On Wed, Feb 18, 2015 at 10:50

Re: Why cached RDD is recomputed again?

2015-02-18 Thread shahab
Thanks Sean, but I don't think that fitting into memory is the case, because: 1- I can see in the UI that 100% of RDD is cached, (moreover the RDD is quite small, 100 MB, while worker has 1.5 GB) 2- I also tried MEMORY_AND_DISK, but absolutely no difference ! Probably I have messed up somewhere

[no subject]

2015-02-18 Thread Luca Puggini

Re: Class loading issue, spark.files.userClassPathFirst doesn't seem to be working

2015-02-18 Thread Emre Sevinc
On Wed, Feb 18, 2015 at 4:54 PM, Dmitry Goldenberg dgoldenberg...@gmail.com wrote: Thank you, Emre. It seems solrj still depends on HttpClient 4.1.3; would that not collide with Spark/Hadoop's default dependency on HttpClient set to 4.2.6? If that's the case that might just solve the problem.

Re: Class loading issue, spark.files.userClassPathFirst doesn't seem to be working

2015-02-18 Thread Dmitry Goldenberg
Thanks, Emre! Will definitely try this. On Wed, Feb 18, 2015 at 11:00 AM, Emre Sevinc emre.sev...@gmail.com wrote: On Wed, Feb 18, 2015 at 4:54 PM, Dmitry Goldenberg dgoldenberg...@gmail.com wrote: Thank you, Emre. It seems solrj still depends on HttpClient 4.1.3; would that not collide

Re: How to connect a mobile app (Android/iOS) with a Spark backend?

2015-02-18 Thread Sean Owen
This does not sound like a Spark problem -- doesn't even necessarily sound like a distributed problem. Are you of a scale where building simple logic in a web tier that queries a NoSQL / SQL database doesn't work? If you are at such a scale, then it sounds like you're describing a very high

Re: Learning GraphX Questions

2015-02-18 Thread Matthew Bucci
Thanks for all the responses so far! I have started to understand the system more, but I just had another question while I was going along. Is there a way to check the individual partitions of an RDD? For example, if I had a graph with vertices a,b,c,d and it was split into 2 partitions could I

Re: Class loading issue, spark.files.userClassPathFirst doesn't seem to be working

2015-02-18 Thread Dmitry Goldenberg
I think I'm going to have to rebuild Spark with commons.httpclient.version set to 4.3.1 which looks to be the version chosen by Solrj, rather than the 4.2.6 that Spark's pom mentions. Might work. On Wed, Feb 18, 2015 at 1:37 AM, Arush Kharbanda ar...@sigmoidanalytics.com wrote: Hi Did you

Re:Is Databricks log analysis reference app only based on Java API

2015-02-18 Thread Todd
sorry for the noise. I have found it.. At 2015-02-18 23:34:40, Todd bit1...@163.com wrote: Looks the log anylysis reference app provided by Databricks at https://github.com/databricks/reference-apps only has java API? I'd like to see the Scala version one.

Is Databricks log analysis reference app only based on Java API

2015-02-18 Thread Todd
Looks the log anylysis reference app provided by Databricks at https://github.com/databricks/reference-apps only has java API? I'd like to see the Scala version one.

How to integrate hive on spark

2015-02-18 Thread sandeepvura
Hi , I am new to sparks.I had installed spark on 3 node cluster.I would like to integrate hive on spark . can anyone please help me on this, Regards, Sandeep.v -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-integrate-hive-on-spark-tp21702.html

Re: Is there a limit to the number of RDDs in a Spark context?

2015-02-18 Thread Paweł Szulc
Maybe you can omit using grouping all together with groupByKey? What is your next step after grouping elements by key? Are you trying to reduce values? If so then I would recommend using some reducing functions like for example reduceByKey or aggregateByKey. Those will first reduce value for each

Re: How to connect a mobile app (Android/iOS) with a Spark backend?

2015-02-18 Thread Sean Owen
Although you can do lots of things, I don't think Spark is something you should think of as a synchronous, real-time query API. So, somehow trying to use it directly from a REST API is probably not the best architecture. That said, it depends a lot on what you are trying to do. What are you

Re: Magic number 16: Why doesn't Spark Streaming process more than 16 files?

2015-02-18 Thread Imran Rashid
so if you only change this line: https://gist.github.com/emres/0fb6de128baea099e741#file-mymoduledriver-java-L137 to json.print() it processes 16 files instead? I am totally perplexed. My only suggestions to help debug are (1) see what happens when you get rid of MyModuleWorker completely --

Re: Why cached RDD is recomputed again?

2015-02-18 Thread Sean Owen
The mostly likely explanation is that you wanted to put all the partitions in memory and they don't all fit. Unless you asked to persist to memory or disk, some partitions will simply not be cached. Consider using MEMORY_OR_DISK persistence. This can also happen if blocks were lost due to node

Re: How to connect a mobile app (Android/iOS) with a Spark backend?

2015-02-18 Thread Ralph Bergmann | the4thFloor.eu
Hi, Am 18.02.15 um 15:58 schrieb Sean Owen: That said, it depends a lot on what you are trying to do. What are you trying to do? You just say you're connecting to spark. There are 2 tasks I want to solve with Spark. 1) The user opens the mobile app. The app sends a pink to the backend. When

Re: WARN from Similarity Calculation

2015-02-18 Thread Debasish Das
I am still debugging it but I believe if m% of users have unusually large columns and the RDD partitioner on RowMatrix is hashPartitioner then due to the basic algorithm without sampling, some partitions can cause unusually large number of keys... If my debug shows that I will add a custom

Re: OutOfMemory and GC limits (TODO) Error in map after self-join

2015-02-18 Thread Imran Rashid
Hi Tom, there are a couple of things you can do here to make this more efficient. first, I think you can replace your self-join with a groupByKey. on your example data set, this would give you (1, Iterable(2,3)) (4, Iterable(3)) this reduces the amount of data that needs to be shuffled, and

Re: Spark can't pickle class: error cannot lookup attribute

2015-02-18 Thread Davies Liu
Currently, PySpark can not support pickle a class object in current script ( '__main__'), the workaround could be put the implementation of the class into a separate module, then use bin/spark-submit --py-files xxx.py in deploy it. in xxx.py: class test(object): def __init__(self, a, b):

Re: JdbcRDD, ClassCastException with scala.Function0

2015-02-18 Thread Dmitry Goldenberg
That's exactly what I was doing. However, I ran into runtime issues with doing that. For instance, I had a public class DbConnection extends AbstractFunction0Connection implements Serializable I got a runtime error from Spark complaining that DbConnection wasn't an instance of scala.Function0.

Re: ML Transformer

2015-02-18 Thread Joseph Bradley
Hi Cesar, Thanks for trying out Pipelines and bringing up this issue! It's been an experimental API, but feedback like this will help us prepare it for becoming non-Experimental. I've made a JIRA, and will vote for this being protected (instead of private[ml]) for Spark 1.3:

No suitable driver found error, Create table in hive from spark sql

2015-02-18 Thread Dhimant
No suitable driver found error, Create table in hive from spark sql. I am trying to execute following example. SPARKGIT: spark/examples/src/main/scala/org/apache/spark/examples/sql/hive/HiveFromSpark.scala My setup :- hadoop 1.6,spark 1.2, hive 1.0, mysql server (installed via yum install

Re: Processing graphs

2015-02-18 Thread Vijayasarathy Kannan
Hi, Thanks for your reply. I basically want to check if my understanding what parallelize() on RDDs is correct. In my case, I create a vertex RDD and edge RDD and distribute them by calling parallelize(). Now does Spark perform any operation on these RDDs in parallel? For example, if I apply

Re: Spark Streaming output cannot be used as input?

2015-02-18 Thread Tim Smith
+1 for writing the Spark output to Kafka. You can then hang off multiple compute/storage framework from kafka. I am using a similar pipeline to feed ElasticSearch and HDFS in parallel. Allows modularity, you can take down ElasticSearch or HDFS for maintenance without losing (except for some edge

Re: Spark and Spark Streaming code sharing best practice.

2015-02-18 Thread Jean-Pascal Billaud
Thanks Arush. I will check that out. On Wed, Feb 18, 2015 at 11:06 AM, Arush Kharbanda ar...@sigmoidanalytics.com wrote: I find monoids pretty useful in this respect, basically separating out the logic in a monoid and then applying the logic to either a stream or a batch. A list of such

Re: No suitable driver found error, Create table in hive from spark sql

2015-02-18 Thread Dhimant
Found solution from one of the post found on internet. I updated spark/bin/compute-classpath.sh and added database connector jar into classpath. CLASSPATH=$CLASSPATH:/data/mysql-connector-java-5.1.14-bin.jar -- View this message in context:

RE: Spark Streaming output cannot be used as input?

2015-02-18 Thread Jose Fernandez
Thanks for the advice folks, it is much appreciated. This seems like a pretty unfortunate design flaw. My team was surprised by it. I’m going to drop the two-step process and do it all in a single step until we get Kafka online. From: Sean Owen [mailto:so...@cloudera.com] Sent: Wednesday,

Re: Spark Streaming and message ordering

2015-02-18 Thread jay vyas
This is a *fantastic* question. The idea of how we identify individual things in multiple DStreams is worth looking at. The reason being, that you can then fine tune your streaming job, based on the RDD identifiers (i.e. are the timestamps from the producer correlating closely to the order in

RE: NotSerializableException: org.apache.http.impl.client.DefaultHttpClient when trying to send documents to Solr

2015-02-18 Thread Jose Fernandez
You need to instantiate the server in the forEachPartition block or Spark will attempt to serialize it to the task. See the design patterns section in the Spark Streaming guide. Jose Fernandez | Principal Software Developer jfernan...@sdl.com | The information transmitted, including

Spark Streaming and message ordering

2015-02-18 Thread Neelesh
There does not seem to be a definitive answer on this. Every time I google for message ordering,the only relevant thing that comes up is this - http://samza.apache.org/learn/documentation/0.8/comparisons/spark-streaming.html . With a kafka receiver that pulls data from a single kafka partition