Re: install sbt

2014-10-28 Thread Akhil Das
1. Download 2. Extract 3. export PATH=$PATH:/path/to/sbt/bin/sbt Now you can do all the sbt commands (sbt package etc.) Thanks Best Regards On Tue, Oct 28, 2014 at 9:49 PM, Soumya Simanta wrote: > sbt is just a jar file. So

Re: Spray client reports Exception: akka.actor.ActorSystem.dispatcher()Lscala/concurrent/ExecutionContext

2014-10-28 Thread Prashant Sharma
Yes we shade akka to change its protobuf version (If I am not wrong.). Yes, binary compatibility with other akka modules is compromised. One thing you can try is use akka from org.spark-project.akka, I have not tried this and not sure if its going to help you but may be you could exclude the akka s

Re: Submiting Spark application through code

2014-10-28 Thread Akhil Das
​And the scala way of doing it would be: val sc = new SparkContext(conf) sc.addJar("/full/path/to/my/application/jar/myapp.jar") On Wed, Oct 29, 2014 at 1:44 AM, Shailesh Birari wrote: > Yes, this is doable. > I am submitting the Spark job using > JavaSparkContext spark = new JavaSparkCo

Fwd: sampling in spark

2014-10-28 Thread Chengi Liu
-- Forwarded message -- From: Chengi Liu Date: Tue, Oct 28, 2014 at 11:23 PM Subject: Re: sampling in spark To: Davies Liu Any suggestions.. Thanks On Tue, Oct 28, 2014 at 12:53 AM, Chengi Liu wrote: > Is there an equivalent way of doing the following: > > a = [1,2,3,4] > > r

Re: FileNotFoundException in appcache shuffle files

2014-10-28 Thread Shaocun Tian
Hi, Ryan We have met similar errors and increasing executor memory solved it. Though I am not sure about the detailed reason, it might be worth a try. On Wed, Oct 29, 2014 at 1:34 PM, Ryan Williams [via Apache Spark User List] wrote: > My job is failing with the following error: > > 14/10/29 02

How to retrive spark context when hiveContext is used in sparkstreaming

2014-10-28 Thread critikaled
Hi, I'm trying to get hold of use spark context from hive context or streamingcontext. I have 2 pieces of codes one in core spark one in spark streaming. plain spark with hive which gives me context. Spark streaming code with hive which prints null. plz help me figure out how to make this code wor

Spark Streaming from Kafka

2014-10-28 Thread Harold Nguyen
Hi, Just wondering if you've seen the following error when reading from Kafka: ERROR ReceiverTracker: Deregistered receiver for stream 0: Error starting receiver 0 - java.lang.NoClassDefFoundError: scala/reflect/ClassManifest at kafka.utils.Log4jController$.(Log4jController.scala:29) at kafka.uti

RE: FileNotFoundException in appcache shuffle files

2014-10-28 Thread Shao, Saisai
Hi Ryan, This is an issue from sort-based shuffle, not consolidated hash-based shuffle. I guess mostly this issue occurs when Spark cluster is in abnormal situation, maybe long time of GC pause or some others, you can check the system status or if there’s any other exceptions beside this one.

FileNotFoundException in appcache shuffle files

2014-10-28 Thread Ryan Williams
My job is failing with the following error: 14/10/29 02:59:14 WARN scheduler.TaskSetManager: Lost task 1543.0 in stage 3.0 (TID 6266, demeter-csmau08-19.demeter.hpc.mssm.edu): java.io.FileNotFoundException: /data/05/dfs/dn/yarn/nm/usercache/willir31/appcache/application_1413512480649_0108/spark-lo

Re: how to retrieve the value of a column of type date/timestamp from a Spark SQL Row

2014-10-28 Thread Shixiong Zhu
Or "def getAs[T](i: Int): T" Best Regards, Shixiong Zhu 2014-10-29 13:16 GMT+08:00 Zhan Zhang : > Can you use row(i).asInstanceOf[] > > Thanks. > > Zhan Zhang > > > > On Oct 28, 2014, at 5:03 PM, Mohammed Guller > wrote: > > Hi – > > The Spark SQL Row class has methods such as getInt, getLong,

Re: cannot run spark shell in yarn-client mode

2014-10-28 Thread TJ Klein
Hi Marco, I have the same issue. Did you fix it by chance? How? Best, Tassilo -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/cannot-run-spark-shell-in-yarn-client-mode-tp4013p17603.html Sent from the Apache Spark User List mailing list archive at Nabble.

Re: how to retrieve the value of a column of type date/timestamp from a Spark SQL Row

2014-10-28 Thread Zhan Zhang
Can you use row(i).asInstanceOf[] Thanks. Zhan Zhang On Oct 28, 2014, at 5:03 PM, Mohammed Guller wrote: > Hi – > > The Spark SQL Row class has methods such as getInt, getLong, getBoolean, > getFloat, getDouble, etc. However, I don’t see a getDate method. So how can > one retrieve a date/

Re: MLLib ALS ArrayIndexOutOfBoundsException with Scala Spark 1.1.0

2014-10-28 Thread Xiangrui Meng
Hi Ilya, Let's move the discussion to the JIRA page. I saw couple users reporting this issue but I have never seen it myself. Best, Xiangrui On Tue, Oct 28, 2014 at 8:50 AM, Ilya Ganelin wrote: > Hi all - I've simplified the code so now I'm literally feeding in 200 > million ratings directly to

Re: How to import mllib.rdd.RDDFunctions into the spark-shell

2014-10-28 Thread Xiangrui Meng
FYI, there is a PR to make mllib.rdd.RDDFunctions public: https://github.com/apache/spark/pull/2907 -Xiangrui On Tue, Oct 28, 2014 at 5:18 AM, Yanbo Liang wrote: > Yes, it can import org.apache.spark.mllib.rdd.RDDFunctions but you can not > use any method in this class or even new an object of th

Re: Spark-submt job "Killed"

2014-10-28 Thread akhandeshi
I did have it as rdd.saveAsText("RDD"); and now I have it as: Log.info("RDD Counts"+rdd.persist(StorageLevel.MEMORY_AND_DISK_SER()).count()); -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-submt-job-Killed-tp17560p17598.html Sent from the Apache Spa

Re: Spray client reports Exception: akka.actor.ActorSystem.dispatcher()Lscala/concurrent/ExecutionContext

2014-10-28 Thread Jianshi Huang
I'm using Spark built from HEAD, I think it uses modified Akka 2.3.4, right? Jianshi On Wed, Oct 29, 2014 at 5:53 AM, Mohammed Guller wrote: > Try a version built with Akka 2.2.x > > > > Mohammed > > > > *From:* Jianshi Huang [mailto:jianshi.hu...@gmail.com] > *Sent:* Tuesday, October 28, 2014

Re: com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 13994

2014-10-28 Thread Steve Lewis
I wrote a custom class loader to find all classes that were loaded that implement Serializabke. I ran it locally to load all classes and registered ALL of these - I still get these issues On Tue, Oct 28, 2014 at 8:02 PM, Ganelin, Ilya wrote: > Have you checked for any global variables in your s

RE: Use RDD like a Iterator

2014-10-28 Thread Ganelin, Ilya
Would Rdd.map() do what you need? It will apply a function to every element of the rdd and return a resulting RDD. -Original Message- From: Zhan Zhang [zzh...@hortonworks.com] Sent: Tuesday, October 28, 2014 11:23 PM Eastern Standard Time To: Dai, Kevin Cc:

Re: Use RDD like a Iterator

2014-10-28 Thread Zhan Zhang
I think it is already lazily computed, or do you mean something else? Following is the signature of compute in RDD def compute(split: Partition, context: TaskContext): Iterator[T] Thanks. Zhan Zhang On Oct 28, 2014, at 8:15 PM, Dai, Kevin wrote: > Hi, ALL > > I have a RDD[T], can I use it

Re: run multiple spark applications in parallel

2014-10-28 Thread Zhan Zhang
You can set your executor number with --num-executors. Also changing yarn-client save you one container for driver. Then check your yarn resource manager to make sure there are more containers available to serve your extra apps. Thanks. Zhan Zhang On Oct 28, 2014, at 5:31 PM, Soumya Simanta

Use RDD like a Iterator

2014-10-28 Thread Dai, Kevin
Hi, ALL I have a RDD[T], can I use it like a iterator. That means I can compute every element of this RDD lazily. Best Regards, Kevin.

Re: Selecting Based on Nested Values using Language Integrated Query Syntax

2014-10-28 Thread Corey Nolet
scala> locations.queryExecution warning: there were 1 feature warning(s); re-run with -feature for details res28: _4.sqlContext.QueryExecution forSome { val _4: org.apache.spark.sql.SchemaRDD } = == Parsed Logical Plan == SparkLogicalPlan (ExistingRdd [locationName#80,locationNumber#81], Mapped

RE: com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 13994

2014-10-28 Thread Ganelin, Ilya
Have you checked for any global variables in your scope? Remember that even if variables are not passed to the function they will be included as part of the context passed to the nodes. If you can't zen out what is breaking then try to simplify what you're doing. Set up a simple test call (like

Re: Selecting Based on Nested Values using Language Integrated Query Syntax

2014-10-28 Thread Michael Armbrust
Can you println the .queryExecution of the SchemaRDD? On Tue, Oct 28, 2014 at 7:43 PM, Corey Nolet wrote: > So this appears to work just fine: > > hctx.sql("SELECT p.name, p.age FROM people p LATERAL VIEW > explode(locations) l AS location JOIN location5 lo ON l.number = > lo.streetNumber WHERE

com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 13994

2014-10-28 Thread Steve Lewis
A cluster I am running on keeps getting KryoException. Unlike the Java serializer the Kryo Exception gives no clue as to what class is giving the error The application runs properly locally but no the cluster and I have my own custom KryoRegistrator and register sereral dozen classes - essentially

Re: Selecting Based on Nested Values using Language Integrated Query Syntax

2014-10-28 Thread Michael Armbrust
On Tue, Oct 28, 2014 at 6:56 PM, Corey Nolet wrote: > Am I able to do a join on an exploded field? > > Like if I have another object: > > { "streetNumber":"2300", "locationName":"The Big Building"} and I want to > join with the previous json by the locations[].number field- is that > possible? >

Re: Selecting Based on Nested Values using Language Integrated Query Syntax

2014-10-28 Thread Corey Nolet
Am I able to do a join on an exploded field? Like if I have another object: { "streetNumber":"2300", "locationName":"The Big Building"} and I want to join with the previous json by the locations[].number field- is that possible? On Tue, Oct 28, 2014 at 9:31 PM, Corey Nolet wrote: > Michael, >

Re: Selecting Based on Nested Values using Language Integrated Query Syntax

2014-10-28 Thread Corey Nolet
Michael, Awesome, this is what I was looking for. So it's possible to use hive dialect in a regular sql context? This is what was confusing to me- the docs kind of allude to it but don't directly point it out. On Tue, Oct 28, 2014 at 9:30 PM, Michael Armbrust wrote: > You can do this: > > $ sbt

Re: Selecting Based on Nested Values using Language Integrated Query Syntax

2014-10-28 Thread Michael Armbrust
You can do this: $ sbt/sbt hive/console scala> jsonRDD(sparkContext.parallelize("""{ "name":"John", "age":53, "locations": [{ "street":"Rodeo Dr", "number":2300 }]}""" :: Nil)).registerTempTable("people") scala> sql("SELECT name FROM people LATERAL VIEW explode(locations) l AS location WHERE loc

Re: Selecting Based on Nested Values using Language Integrated Query Syntax

2014-10-28 Thread Corey Nolet
So it wouldn't be possible to have a json string like this: { "name":"John", "age":53, "locations": [{ "street":"Rodeo Dr", "number":2300 }]} And query all people who have a location with number = 2300? On Tue, Oct 28, 2014 at 5:30 PM, Michael Armbrust wrote: > On Tue, Oct 28, 2014 at 2:19

Re: Why RDD is not cached?

2014-10-28 Thread Mayur Rustagi
What is the partition count of the RDD, its possible that you dont have enough memory to store the whole RDD on a single machine. Can you try forcibly repartitioning the RDD & then cacheing. Regards Mayur On Tue Oct 28 2014 at 1:19:09 AM shahab wrote: > I used Cache followed by a "count" on RDD

Re: run multiple spark applications in parallel

2014-10-28 Thread Soumya Simanta
Maybe changing --master yarn-cluster to --master yarn-client help. On Tue, Oct 28, 2014 at 7:25 PM, Josh J wrote: > Sorry, I should've included some stats with my email > > I execute each job in the following manner > > ./bin/spark-submit --class CLASSNAME --master yarn-cluster --driver-memory

Re: NoClassDefFoundError on ThreadFactoryBuilder in Intellij

2014-10-28 Thread Stephen Boesch
I have checked out from master, cleaned/rebuilt on command line in maven, then cleaned/rebuilt in intellij many times. This error persists through it all. Anyone have a solution? 2014-10-23 1:43 GMT-07:00 Stephen Boesch : > After having checked out from master/head the following error occu

how to retrieve the value of a column of type date/timestamp from a Spark SQL Row

2014-10-28 Thread Mohammed Guller
Hi - The Spark SQL Row class has methods such as getInt, getLong, getBoolean, getFloat, getDouble, etc. However, I don't see a getDate method. So how can one retrieve a date/timestamp type column from a result set? Thanks, Mohammed

RE: Is it possible to call a transform + action inside an action?

2014-10-28 Thread Ganelin, Ilya
Don't have to do toArray - think in M/R terms. HashTable a; rdd.map(s => a[s.key]=s.value). Broadcast variables are treated like regular variables (as far as syntax is concerned) . The difference is that normally Spark has to serialize and send to all nodes any variables in the scope of your f

Re: Batch of updates

2014-10-28 Thread Flavio Pompermaier
Sorry but I wasn't able to code my stuff using accumulators as you suggested :( In my use case I have to to add elements to an array/list and then, every 100 element commit the batch to a solr index and then clear it. In the cleanup code I have to commit the uncommited (remainder) elements. In the

Re: run multiple spark applications in parallel

2014-10-28 Thread Josh J
Sorry, I should've included some stats with my email I execute each job in the following manner ./bin/spark-submit --class CLASSNAME --master yarn-cluster --driver-memory 1g --executor-memory 1g --executor-cores 1 UBER.JAR ${ZK_PORT_2181_TCP_ADDR} my-consumer-group1 1 The box has 24 CPUs, Inte

Re: run multiple spark applications in parallel

2014-10-28 Thread Soumya Simanta
Try reducing the resources (cores and memory) of each application. > On Oct 28, 2014, at 7:05 PM, Josh J wrote: > > Hi, > > How do I run multiple spark applications in parallel? I tried to run on yarn > cluster, though the second application submitted does not run. > > Thanks, > Josh

run multiple spark applications in parallel

2014-10-28 Thread Josh J
Hi, How do I run multiple spark applications in parallel? I tried to run on yarn cluster, though the second application submitted does not run. Thanks, Josh

RE: Is it possible to call a transform + action inside an action?

2014-10-28 Thread kpeng1
Ok cool. So in that case the only way I could think of doing this would be calling the toArray method on those RDDs which would return Array[String] and store them as broadcast variables. I read about the broadcast variables, but it still fuzzy. I am assume that since broadcast variables are ava

How to properly debug spark streaming?

2014-10-28 Thread kpeng1
I am still fairly new to spark and spark streaming. I have been struggling with how to properly debug spark streaming and I was wondering what is the best approach. I have been basically putting println statements everywhere, but sometimes they show up when I run the job and sometimes they don't.

RE: Is it possible to call a transform + action inside an action?

2014-10-28 Thread Ganelin, Ilya
You cannot have nested RDD transformations in Scala Spark. The issue is that when the outer operation is distributed to the cluster and kicks off a new job (the inner query) the inner job no longer has the context for the outer job. The way around this is to either do a join on two RDDs or to st

Re: [SPARK SQL] kerberos error when creating database from beeline/ThriftServer2

2014-10-28 Thread Du Li
To clarify, this error was thrown from the thrift server when beeline was started to establish the connection, as follows: $ beeline -u jdbc:hive2://`hostname`:4080 –n username From: Du Li mailto:l...@yahoo-inc.com.INVALID>> Date: Tuesday, October 28, 2014 at 11:35 AM To: Cheng Lian mailto:lian.c

Is it possible to call a transform + action inside an action?

2014-10-28 Thread kpeng1
I currently writing an application that uses spark streaming. What I am trying to do is basically read in a few files (I do this by using the spark context textFile) and then process those files inside an action that I apply to a streaming RDD. Here is the main code below: def main(args: Array[S

unsubscribe

2014-10-28 Thread Ricky Thomas

RE: Spray client reports Exception: akka.actor.ActorSystem.dispatcher()Lscala/concurrent/ExecutionContext

2014-10-28 Thread Mohammed Guller
Try a version built with Akka 2.2.x Mohammed From: Jianshi Huang [mailto:jianshi.hu...@gmail.com] Sent: Tuesday, October 28, 2014 3:03 AM To: user Subject: Spray client reports Exception: akka.actor.ActorSystem.dispatcher()Lscala/concurrent/ExecutionContext Hi, I got the following exceptions w

Re: Building Spark against Cloudera 5.2.0 - Failure

2014-10-28 Thread Sean Owen
Typo: -Phadoop.version=2.5.0-cdh5.2.0 should be -Dhadoop.version=2.5.0-cdh5.2.0 which is correct in the first command. -P selects profiles. On Tue, Oct 28, 2014 at 10:35 PM, Ganelin, Ilya wrote: > Hello all, I am attempting to manually build the master branch of Spark > against Cloudera’s 5.

Re: Spark-submt job "Killed"

2014-10-28 Thread Ganelin, Ilya
Hi Ami - I suspect that your code is completing because you have nothing to actually force resolution of your job. Spark executes lazily, so for example, if you have a bunch of maps in sequence but nothing else, Spark will not actually execute anything. Try adding an RDD.count() on the last RDD th

Building Spark against Cloudera 5.2.0 - Failure

2014-10-28 Thread Ganelin, Ilya
Hello all, I am attempting to manually build the master branch of Spark against Cloudera’s 5.2.0 deployment. To do this I am running: mvn -Pyarn -Dhadoop.version=2.5.0-cdh5.2.0 -DskipTests clean package The build completes successfully and then I run: mvn -Pyarn -Phadoop.version=2.5.0-cdh5.2.0 t

Spark-submt job "Killed"

2014-10-28 Thread akhandeshi
I recently starting seeing this new problem where spark-submt is terminated by "Killed" message but no error message indicating what happened. I have enable logging on in spark configuration. has anyone seen this or know how to troubleshoot? -- View this message in context: http://apache-spar

Re: Selecting Based on Nested Values using Language Integrated Query Syntax

2014-10-28 Thread Michael Armbrust
On Tue, Oct 28, 2014 at 2:19 PM, Corey Nolet wrote: > Is it possible to select if, say, there was an addresses field that had a > json array? > You can get the Nth item by "address".getItem(0). If you want to walk through the whole array look at LATERAL VIEW EXPLODE in HiveQL

Re: Does JavaSchemaRDD inherit the Hive partitioning of data?

2014-10-28 Thread Michael Armbrust
This feature is not in 1.1 and is not going to promise one file per unique value of the data. The only way to do that would be to write your own partitioner . On Tue, Oct

Re: Does JavaSchemaRDD inherit the Hive partitioning of data?

2014-10-28 Thread nitinkak001
So, this means that I can create table and insert data in it with Dynbamic partitioning and those partitions would be inherited by RDDs. Is it in Spark 1.1.0? If not, is there a way to partition the data in a file based on some attributes of the rows in the data data(without hardcoding the number

problem with start-slaves.sh

2014-10-28 Thread Pagliari, Roberto
I ran sbin/start-master.sh followed by sbin/start-slaves.sh (I build with PHive option to be able to interface with hive) I'm getting this ip_address: org.apache.spark.deploy.worker.Worker running as process . Stop it first. Am I doing something wrong? In my specific case, shark+hive is ru

Re: Yarn-Client Python

2014-10-28 Thread TJ Klein
Hi Andrew, thanks for trying to help. However, I am a bit confused now. I'm not setting any 'spark.driver.host', particularly spark-defaults.conf is empty/non-exisiting. I thought this is only required when running Spark standalone mode. Isn't it the case, when using YARN all the configuration nee

Re: Submiting Spark application through code

2014-10-28 Thread Shailesh Birari
Yes, this is doable. I am submitting the Spark job using JavaSparkContext spark = new JavaSparkContext(sparkMaster, "app name", System.getenv("SPARK_HOME"), new String[] {"application JAR"}); To run this first you have to create the application jar and in above API specify

Re: Yarn-Client Python

2014-10-28 Thread Andrew Or
Hey TJ, It appears that your ApplicationMaster thinks it's on the same node as your driver. Are you setting "spark.driver.host" by any chance? Can you post the value of this config here? (You can access it through the SparkUI) 2014-10-28 12:50 GMT-07:00 TJ Klein : > Hi there, > > I am trying to

Re: GraphX StackOverflowError

2014-10-28 Thread Ankur Dave
At 2014-10-28 16:27:20 +0300, Zuhair Khayyat wrote: > I am using connected components function of GraphX (on Spark 1.0.2) on some > graph. However for some reason the fails with StackOverflowError. The graph > is not too big; it contains 1 vertices and 50 edges. > > [...] > 14/10/28 16:08:

Re: Submitting Spark job on Unix cluster from dev environment (Windows)

2014-10-28 Thread Shailesh Birari
Can anyone please help me here ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Submitting-Spark-job-on-Unix-cluster-from-dev-environment-Windows-tp16989p17552.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --

Yarn-Client Python

2014-10-28 Thread TJ Klein
Hi there, I am trying to run Spark on YARN managed cluster using Python (which requires yarn-client mode). However, I cannot get it running (same with example apps). Using spark-submit to launch the script I get the following warning: WARN cluster.YarnClientClusterScheduler: Initial job has not

Re: Including jars in Spark-shell vs Spark-submit

2014-10-28 Thread Helena Edelson
Ah excellent! I will be sure to check if we need to update our documentation based on your feedback :) Cheers, Helena On Oct 28, 2014, at 3:03 PM, Harold Nguyen wrote: > Hi Helena, > > It's great to e-meet you! I've actually been following along your blogs and > talks trying to get this t

Re: Is Spark in Java a bad idea?

2014-10-28 Thread Kevin Markey
Don't be too concerned about the Scala hoop.  Before making the commitment to Scala, I had coded up a modest analytic prototype in Hadoop mapreduce.  Once making the commitment, it took 10 days to (1) learn enough Scala, and (2) re-write the prototype in Spark in Scala. 

Re: How to set persistence level of graph in GraphX in spark 1.0.0

2014-10-28 Thread Ankur Dave
At 2014-10-25 08:56:34 +0530, Arpit Kumar wrote: > GraphLoader1.scala:49: error: class EdgePartitionBuilder in package impl > cannot be accessed in package org.apache.spark.graphx.impl > [INFO] val builder = new EdgePartitionBuilder[Int, Int] Here's a workaround: 1. Copy and modify the Gra

Re: Is Spark in Java a bad idea?

2014-10-28 Thread Mark Hamstra
I believe that you are overstating your case. If you want to work with with Spark, then the Java API is entirely adequate with a very few exceptions -- unfortunately, though, one of those exceptions is with something that you are interested in, JdbcRDD. If you want to work on Spark -- customizing

Re: Selecting Based on Nested Values using Language Integrated Query Syntax

2014-10-28 Thread Michael Armbrust
Try: "address.city".attr On Tue, Oct 28, 2014 at 8:30 AM, Brett Antonides wrote: > Hello, > > Given the following example customers.json file: > { > "name": "Sherlock Holmes", > "customerNumber": 12345, > "address": { > "street": "221b Baker Street", > "city": "London", > "zipcode": "NW1 6XE", >

Re: Does JavaSchemaRDD inherit the Hive partitioning of data?

2014-10-28 Thread Michael Armbrust
DISTRIBUTE BY only promises that data will be collocated, but does not create a partition for each value. You are probably looking for Dynamic Partitions , which was recently merged into HiveContext. On Tue, Oct 28, 2014 at 11:49

Re: Is Spark in Java a bad idea?

2014-10-28 Thread Matei Zaharia
The overridable methods of RDD are marked as @DeveloperApi, which means that these are internal APIs used by people that might want to extend Spark, but are not guaranteed to remain stable across Spark versions (unlike Spark's public APIs). BTW, if you want a way to do this that does not involv

RE: Is Spark in Java a bad idea?

2014-10-28 Thread Ron Ayoub
I interpret this to mean you have to learn Scala in order to work with Spark in Scala (goes without saying) and also to work with Spark in Java (since you have to jump through some hoops for basic functionality). The best path here is to take this as a learning opportunity and sit down and learn

Re: Including jars in Spark-shell vs Spark-submit

2014-10-28 Thread Harold Nguyen
Hi Helena, It's great to e-meet you! I've actually been following along your blogs and talks trying to get this to work. I just solved it, and you were absolutely correct. I've been using 1.1.0-alpha3 as my dependency, but my assembly is the 1.2.0-SNAPSHOT. Thanks for looking through all my othe

Re: Is Spark in Java a bad idea?

2014-10-28 Thread Matei Zaharia
A pretty large fraction of users use Java, but a few features are still not available in it. JdbcRDD is one of them -- this functionality will likely be superseded by Spark SQL when we add JDBC as a data source. In the meantime, to use it, I'd recommend writing a class in Scala that has Java-fri

Re: Including jars in Spark-shell vs Spark-submit

2014-10-28 Thread Helena Edelson
Hi Harold, It seems like, based on your previous post, you are using one version of the connector as a dependency yet building the assembly jar from master? You were using 1.1.0-alpha3 (you can upgrade to alpha4, beta coming this week) yet your assembly is spark-cassandra-connector-assembly-1.2.

Re: Does JavaSchemaRDD inherit the Hive partitioning of data?

2014-10-28 Thread nitinkak001
Any suggestions guys?? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Does-JavaSchemaRDD-inherit-the-Hive-partitioning-of-data-tp17410p17539.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -

Re: Is Spark in Java a bad idea?

2014-10-28 Thread critikaled
Hi Ron, what ever api you have in scala you can possibly use it form java. scala is inter-operable with java and vice versa. scala being both object oriented and functional will make your job easier on jvm and it is more consise than java. Take it as an opportunity and start learning scala ;). -

Re: [SPARK SQL] kerberos error when creating database from beeline/ThriftServer2

2014-10-28 Thread Du Li
If I put all the jar files from my local hive in the front of the spark class path, a different error was reported, as follows: 14/10/28 18:29:40 ERROR transport.TSaslTransport: SASL negotiation failure javax.security.sasl.SaslException: PLAIN auth failed: null at org.apache.hadoop.security.S

Re: RDD to Multiple Tables SparkSQL

2014-10-28 Thread critikaled
Hi oliver, thanks for the answer I don't have the information of all keys before hand, the reason i want to have multiple tables is that based on my information on known key I will apply different queries get the results for that particular key I don't want to touch the unkown ones I'll save that f

Re: [SPARK SQL] kerberos error when creating database from beeline/ThriftServer2

2014-10-28 Thread Du Li
I am using Hadoop 2.5.0.3 and spark 1.1. My local hive version is 0.12.3 the hcatalog.jar of which is included in the path. The stack trace is as follows: 14/10/28 18:24:24 WARN ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.security.AccessControlExceptio

Is Spark in Java a bad idea?

2014-10-28 Thread Ron Ayoub
I haven't learned Scala yet so as you might imagine I'm having challenges working with Spark from the Java API. For one thing, it seems very limited in comparison to Scala. I ran into a problem really quick. I need to hydrate an RDD from JDBC/Oracle and so I wanted to use the JdbcRDD. But that i

Re: Ending a job early

2014-10-28 Thread Patrick Wendell
Hey Jim, There are some experimental (unstable) API's that support running jobs which might short-circuit: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L1126 This can be used for doing online aggregations like you are describing. And in one

Including jars in Spark-shell vs Spark-submit

2014-10-28 Thread Harold Nguyen
Hi all, The following works fine when submitting dependency jars through Spark-Shell: ./bin/spark-shell --master spark://ip-172-31-38-112:7077 --jars /home/ubuntu/spark-cassandra-connector/spark-cassandra-connector/spark-cassandra-connector/target/scala-2.10/spark-cassandra-connector-assembly-1.2

Re: Spray client reports Exception: akka.actor.ActorSystem.dispatcher()Lscala/concurrent/ExecutionContext

2014-10-28 Thread Jianshi Huang
Here's the answer I got from Akka's user ML. """ This looks like a binary incompatibility issue. As far as I know Spark is using a custom built Akka and Scala for various reasons. You should ask this on the Spark mailing list, Akka is binary compatible between major versions (2.3.6 is compatible

Re: Spark Streaming and Storm

2014-10-28 Thread critikaled
http://www.cs.berkeley.edu/~matei/papers/2013/sosp_spark_streaming.pdf -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-and-Storm-tp9118p17530.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -

Re: JdbcRDD in Java

2014-10-28 Thread Sean Owen
That declaration looks OK for Java 8, at least when I tried it just now vs master. The only thing I see wrong here is getInt throws an exception which means the lambda has to be more complicated than this. This is Java code here calling the constructor so yes it can work fine from Java (8). On Tue

Re: real-time streaming

2014-10-28 Thread ll
thanks jay. do you think spark is a good fit for handling streaming & analyzing videos in real time? in this case, we're streaming 30 frames per second, and each frame is an image (size: roughly 500K - 1MB). we need to analyze every frame and return the analysis result back instantly in real ti

Re: real-time streaming

2014-10-28 Thread jay vyas
a REAL TIME stream, by definition, delivers data every X seconds. you can easily do this with spark. roughly here is the way to create a stream gobbler and attach a spark app to read its data every X seconds - Write a Runnable thread which reads data from a source. Test that it works indepen

real-time streaming

2014-10-28 Thread ll
the spark tutorial shows that we can create a stream that reads "new files" from a directory. that seems to have some lag time, as we have to write the data to file first and then wait until spark stream picks it up. what is the best way to implement REAL 'REAL-TIME' streaming for analysis in r

Re: Submiting Spark application through code

2014-10-28 Thread Matt Narrell
Can this be done? Can I just spin up a SparkContext programmatically, point this to my yarn-cluster and this works like spark-submit?? Doesn’t (at least) the application JAR need to be distributed to the workers via HDFS or the like for the jobs to run? mn > On Oct 28, 2014, at 2:29 AM, Akhi

Re: Spark to eliminate full-table scan latency

2014-10-28 Thread Matt Narrell
I’ve been puzzled by this lately. I too would like to use the thrift server to provide JDBC style access to datasets via SparkSQL. Is this possible? The examples show temp tables created during the lifetime of a SparkContext. I assume I can use SparkSQL to query those tables while the contex

Re: Keep state inside map function

2014-10-28 Thread Koert Kuipers
doing cleanup in an iterator like that assumes the iterator always gets fully read, which is not necessary the case (for example RDD.take does not). instead i would use mapPartitionsWithContext, in which case you can write a function of the form. f: (TaskContext, Iterator[T]) => Iterator[U] now

Re: Scala Spark IDE help

2014-10-28 Thread andy petrella
Also, I'm following to master students at the University of Liège (one for computing prob conditional density on massive data and the other implementing a Markov Chain method on georasters), I proposed them to use the Spark-Notebook to learn the framework, they're quite happy with it (so far at lea

Re: How can number of partitions be set in "spark-env.sh"?

2014-10-28 Thread Ilya Ganelin
In Spark, certain functions have an optional parameter to determine the number of partitions (distinct, textFile, etc..). You can also use the coalesce () or repartiton() functions to change the number of partitions for your RDD. Thanks. On Oct 28, 2014 9:58 AM, "shahab" wrote: > Thanks for the u

Re: Scala Spark IDE help

2014-10-28 Thread Matt Narrell
So, Im using Intellij 13.x, and Scala Spark jobs. Make sure you have singletons (objects, not classes), then simply debug the main function. You’ll need to set your master to some derivation of “local”, but thats it. Spark Streaming is kinda wonky when debugging, but data-at-rest behaves like

Re: pySpark - convert log/txt files into sequenceFile

2014-10-28 Thread Holden Karau
Hi Csaba, It sounds like the API you are looking for is sc.wholeTextFiles :) Cheers, Holden :) On Tuesday, October 28, 2014, Csaba Ragany wrote: > Dear Spark Community, > > Is it possible to convert text files (.log or .txt files) into > sequencefiles in Python? > > Using PySpark I can create

Re: Saving to Cassandra from Spark Streaming

2014-10-28 Thread Gerard Maas
Looks like you're having some classpath issues. Are you providing your spark-cassandra-driver classes to your job? sparkConf.setJars(Seq(jars...)) ? On Tue, Oct 28, 2014 at 5:34 PM, Harold Nguyen wrote: > Hi all, > > I'm having trouble troubleshooting this particular block of code for Spark > S

JdbcRDD in Java

2014-10-28 Thread Ron Ayoub
The following line of code is indicating the constructor is not defined. The only examples I can find of usage of JdbcRDD is Scala examples. Does this work in Java? Is there any examples? Thanks. JdbcRDD rdd = new JdbcRDD(sp, () -> ods.getConnection(), sql, 1, 1783059, 1

Saving to Cassandra from Spark Streaming

2014-10-28 Thread Harold Nguyen
Hi all, I'm having trouble troubleshooting this particular block of code for Spark Streaming and saving to Cassandra: val lines = ssc.socketTextStream(args(0), args(1).toInt, StorageLevel.MEMORY_AND_DISK_SER) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x,

Re: install sbt

2014-10-28 Thread Soumya Simanta
sbt is just a jar file. So you really don't need to install anything. Once you run the jar file (sbt-launch.jar) it can download the required dependencies. I use an executable script called sbt that has the following contents. SBT_OPTS="-Xms1024M -Xmx2048M -Xss1M -XX:+CMSClassUnloadingEnabled -

Re: install sbt

2014-10-28 Thread Nicholas Chammas
If you're just calling sbt from within the spark/sbt folder, it should download and install automatically. Nick 2014년 10월 28일 화요일, Ted Yu님이 작성한 메시지: > Have you read this ? > http://lancegatlin.org/tech/centos-6-install-sbt > > On Tue, Oct 28, 2014 at 7:54 AM, Pagliari, Roberto < > rpagli...@app

Re: install sbt

2014-10-28 Thread Ted Yu
Have you read this ? http://lancegatlin.org/tech/centos-6-install-sbt On Tue, Oct 28, 2014 at 7:54 AM, Pagliari, Roberto wrote: > Is there a repo or some kind of instruction about how to install sbt for > centos? > > > > Thanks, > > >

Re: Batch of updates

2014-10-28 Thread Sean Owen
You should use foreachPartition, and take care to open and close your connection following the pattern described in: http://mail-archives.apache.org/mod_mbox/spark-user/201407.mbox/%3CCAPH-c_O9kQO6yJ4khXUVdO=+D4vj=JfG2tP9eqn5RPko=dr...@mail.gmail.com%3E Within a partition, you iterate over elemen

Re: MLLib ALS ArrayIndexOutOfBoundsException with Scala Spark 1.1.0

2014-10-28 Thread Ilya Ganelin
Hi all - I've simplified the code so now I'm literally feeding in 200 million ratings directly to ALS.train. Nothing else is happening in the program. I've also tried with both the regular serializer and the KryoSerializer. With Kryo, I get the same ArrayIndex exceptions. With the regular serializ

  1   2   >