Re: Serialize input path

2014-09-04 Thread Sean Owen
How about: val range = Range.getRange.toString val notWorking = "path/output_{" + range +"}/*/*" On Fri, Sep 5, 2014 at 3:45 AM, jerryye wrote: > Hi, > I have a quick serialization issue. I'm trying to read a date range of input > files and I'm getting a serialization issue when using an input p

Re: EC2 - JNI crashes JVM with multi core instances

2014-09-04 Thread Iriasthor
Hi, thanks for the reply. Actually, I think you are correct about the native library not being thread safe. However, for what I understand, different cores should start different processes, being these independent from one another. If I am not mistaken, the thread safe issue would start being crit

NotSerializableException: org.apache.spark.sql.hive.api.java.JavaHiveContext

2014-09-04 Thread Bijoy Deb
Hello All, I am trying to query a Hive table using Spark SQL from my java code,but getting the following error: *Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task not serializable: java.io.NotSerializableException: org.apache.spark.sql.hive.api.java.JavaHiveCo

Re: API to add/remove containers inside an application

2014-09-04 Thread Praveen Seluka
Mailed our list - will send it to Spark Dev On Fri, Sep 5, 2014 at 11:28 AM, Rajat Gupta wrote: > +1 on this. First step to more automated autoscaling of spark application > master... > > > On Fri, Sep 5, 2014 at 12:56 AM, Praveen Seluka > wrote: > >> +user >> >> >> >> On Thu, Sep 4, 2014 at 1

Re: Programatically running of the Spark Jobs.

2014-09-04 Thread Vicky Kak
I get this error when i run it from IDE *** Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Master removed our application: FAILED at org.apache.spark.scheduler.DAG

Re: Programatically running of the Spark Jobs.

2014-09-04 Thread Vicky Kak
I don't want to use YARN or Mesos, just trying the standalone spark cluster. We need a way to do seamless submission with the API which I don't see. To my surprise I was hit by this issue when i tried running the submit from another machine, it is crazy that I have to submit the job from the worked

Re: Is there any way to control the parallelism in LogisticRegression

2014-09-04 Thread Jiusheng Chen
Thanks DB. Did you mean this? spark.rdd.compress true On Thu, Sep 4, 2014 at 2:48 PM, DB Tsai wrote: > For saving the memory, I recommend you compress the cached RDD, and it > will be couple times smaller than original data sets. > > > Sincerely, > > DB Tsai >

Re: Programatically running of the Spark Jobs.

2014-09-04 Thread Guru Medasani
I am able to run Spark jobs and Spark Streaming jobs successfully via YARN on a CDH cluster. When you mean YARN isn’t quite there yet, you mean to submit the jobs programmatically? or just in general? On Sep 4, 2014, at 1:45 AM, Matt Chu wrote: > https://github.com/spark-jobserver/spark-jo

Serialize input path

2014-09-04 Thread jerryye
Hi, I have a quick serialization issue. I'm trying to read a date range of input files and I'm getting a serialization issue when using an input path that has a object generate a date range. Specifically, my code uses DateTimeFormat in the Joda time package, which is not serializable. How do I get

Re: Programatically running of the Spark Jobs.

2014-09-04 Thread ericacm
Ahh - that probably explains an issue I am seeing. I am a brand new user and I tried running the SimpleApp class that is on the Quick Start page (http://spark.apache.org/docs/latest/quick-start.html). When I use conf.setMaster("local") then I can run the class directly from my IDE. But when I tr

How spark parallelize maps Slices to tasks/executors/workers

2014-09-04 Thread Mozumder, Monir
I have this 2-node cluster setup, where each node has 4-cores. MASTER (Worker-on-master) (Worker-on-node1) (slaves(master,node1)) SPARK_WORKER_INSTANCES=1 I am trying to understand Spark's parallelize behavior. The sparkPi example has this code: val slices = 8

RE: SchemaRDD - Parquet - "insertInto" makes many files

2014-09-04 Thread Cheng, Hao
Hive can launch another job with strategy to merged the small files, probably we can also do that in the future release. From: Michael Armbrust [mailto:mich...@databricks.com] Sent: Friday, September 05, 2014 8:59 AM To: DanteSama Cc: u...@spark.incubator.apache.org Subject: Re: SchemaRDD - Parqu

Re: advice on spark input development - python or scala?

2014-09-04 Thread Tobias Pfeiffer
Hi, On Thu, Sep 4, 2014 at 11:49 PM, Johnny Kelsey wrote: > As a concrete example, we have a python class (part of a fairly large > class library) which, as part of its constructor, also creates a record of > itself in the cassandra key space. So we get an initialised class & a row > in a table

RE: TimeStamp selection with SparkSQL

2014-09-04 Thread Cheng, Hao
There are 2 SQL dialects, one is a very basic SQL support and another is Hive QL. In most of cases I think people prefer using the HQL, which also means you have to use HiveContext instead of the SQLContext. In this particular query you showed, seems datatime is the type Date, unfortunately, ne

Re: SchemaRDD - Parquet - "insertInto" makes many files

2014-09-04 Thread Michael Armbrust
It depends on the RDD in question exactly where the work will be done. I believe that if you do a repartition(1) instead of a coalesce it will force a shuffle so the work will be done distributed and then a single node will read that shuffled data and write it out. If you want to write to a single

Re: spark RDD join Error

2014-09-04 Thread Chris Fregly
specifically, you're picking up the following implicit: import org.apache.spark.SparkContext.rddToPairRDDFunctions (in case you're a wildcard-phobe like me) On Thu, Sep 4, 2014 at 5:15 PM, Veeranagouda Mukkanagoudar < veera...@gmail.com> wrote: > Thanks a lot, that fixed the issue :) > > > On

Re: Mapping Hadoop Reduce to Spark

2014-09-04 Thread Matei Zaharia
BTW you can also use rdd.partitions() to get a list of Partition objects and see how many there are. On September 4, 2014 at 5:18:30 PM, Matei Zaharia (matei.zaha...@gmail.com) wrote: Partitioners also work in local mode, the only question is how to see which data fell into each partition, sin

Re: Mapping Hadoop Reduce to Spark

2014-09-04 Thread Matei Zaharia
Partitioners also work in local mode, the only question is how to see which data fell into each partition, since most RDD operations hide the fact that it's partitioned. You can do rdd.glom().collect() -- the glom() operation turns an RDD of elements of type T into an RDD of List, with a separat

Re: spark RDD join Error

2014-09-04 Thread Veeranagouda Mukkanagoudar
Thanks a lot, that fixed the issue :) On Thu, Sep 4, 2014 at 4:51 PM, Zhan Zhang wrote: > Try this: > Import org.apache.spark.SparkContext._ > > Thanks. > > Zhan Zhang > > > On Sep 4, 2014, at 4:36 PM, Veeranagouda Mukkanagoudar > wrote: > > I am planning to use RDD join operation, to test out

Re: Mapping Hadoop Reduce to Spark

2014-09-04 Thread Steve Lewis
Assume I define a partitioner like /** * partition on the first letter */ public class PartitionByStart extends Partitioner { @Override public int numPartitions() { return 26; } @Override public int getPartition(final Object key) { Strin

Re: spark RDD join Error

2014-09-04 Thread Zhan Zhang
Try this: Import org.apache.spark.SparkContext._ Thanks. Zhan Zhang On Sep 4, 2014, at 4:36 PM, Veeranagouda Mukkanagoudar wrote: > I am planning to use RDD join operation, to test out i was trying to compile > some test code, but am getting following compilation error > > value join is n

spark RDD join Error

2014-09-04 Thread Veeranagouda Mukkanagoudar
I am planning to use RDD join operation, to test out i was trying to compile some test code, but am getting following compilation error *value join is not a member of org.apache.spark.rdd.RDD[(String, Int)]* *[error] rddA.join(rddB).map { case (k, (a, b)) => (k, a+b) }* Code: import org.apac

TimeStamp selection with SparkSQL

2014-09-04 Thread Benjamin Zaitlen
I may have missed this but is it possible to select on datetime in a SparkSQL query jan1 = sqlContext.sql("SELECT * FROM Stocks WHERE datetime = '2014-01-01'") Additionally, is there a guide as to what SQL is valid? The guide says, "Note that Spark SQL currently uses a very basic SQL parser" It

RE: Reduce truncates RDD in standalone, but fine when local.

2014-09-04 Thread Ruebenacker, Oliver A
Hello, I tracked it down to the field nIters being uninitialized when passed to the reduce job while running standalone, but initialized when running local. Must be some strange interaction between Spark and scala.App. If I move the reduce job into a method and make nIters a local field

Re: SchemaRDD - Parquet - "insertInto" makes many files

2014-09-04 Thread DanteSama
Yep, that worked out. Does this solution have any performance implications past all the work being done on (probably) 1 node? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-Parquet-insertInto-makes-many-files-tp13480p13501.html Sent from the Apach

Re: Spark Streaming with Kafka, building project with 'sbt assembly' is extremely slow

2014-09-04 Thread Aris
Thanks for answering Daniil - I have SBT version 0.13.5, is that an old version? Seems pretty up-to-date. It turns out I figured out a way around this entire problem: just use 'sbt package', and when using bin/spark-submit, pass it the "--jars" option and GIVE IT ALL THE JARS from the local iv2 c

Re: Using Spark to add data to an existing Parquet file without a schema

2014-09-04 Thread Jim Carroll
Okay, Obviously I don't care about adding more files to the system so is there a way to point to an existing parquet file (directory) and seed the individual "part-r-***.parquet" (the value of "partition + offset") while preventing I mean, I can hack it by copying files into the same parquet dir

Getting the type of an RDD in spark AND pyspark

2014-09-04 Thread esamanas
Hi, I'm new to spark and scala, so apologies if this is obvious. Every RDD appears to be typed, which I can see by seeing the output in the spark-shell when I execute 'take': scala> val t = sc.parallelize(Array(1,2,3)) t: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[6] at parallelize at

Reduce truncates RDD in standalone, but fine when local.

2014-09-04 Thread Ruebenacker, Oliver A
Hello, In the app below, when I run it with local[1] or local [3], I get the expected result - a list of the square roots of the numbers from 1 to 20. When I try the same app as standalone with one or two workers on the same machine, it will only print 1.0. Adding print statements int

Re: pandas-like dataframe in spark

2014-09-04 Thread Mohit Jaggi
Thanks Matei. I will take a look at SchemaRDDs. On Thu, Sep 4, 2014 at 11:24 AM, Matei Zaharia wrote: > Hi Mohit, > > This looks pretty interesting, but just a note on the implementation -- it > might be worthwhile to try doing this on top of Spark SQL SchemaRDDs. The > reason is that SchemaRDD

Re: API to add/remove containers inside an application

2014-09-04 Thread Praveen Seluka
+user On Thu, Sep 4, 2014 at 10:53 PM, Praveen Seluka wrote: > Spark on Yarn has static allocation of resources. > https://issues.apache.org/jira/browse/SPARK-3174 - This JIRA by Sandy is > about adding and removing executors dynamically based on load. Even before > doing this, can we expose an

Re: pandas-like dataframe in spark

2014-09-04 Thread Matei Zaharia
Hi Mohit, This looks pretty interesting, but just a note on the implementation -- it might be worthwhile to try doing this on top of Spark SQL SchemaRDDs. The reason is that SchemaRDDs already have an efficient in-memory representation (columnar storage), and can be read from a variety of data

Re: Viewing web UI after fact

2014-09-04 Thread Andrew Or
Hi Grzegorz, Sorry for the late response. Unfortunately, if the Master UI doesn't know about your applications (they are "completed" with respect to a different Master), then it can't regenerate the UIs even if the logs exist. You will have to use the history server for that. How did you start th

Re: Web UI

2014-09-04 Thread Andrew Or
Hi all, The JSON version of the web UI is not officially supported; I don't believe this is documented anywhere. The alternative is to set `spark.eventLog.enabled` to true before running your application. This will create JSON SparkListenerEvents with details about each task and stage as a log fi

Re: Is "cluster manager" same as "master"?

2014-09-04 Thread Andrew Or
Correct. For standalone mode, Master is your cluster manager. Spark also supports other cluster managers such as Yarn and Mesos. -Andrew 2014-09-04 5:52 GMT-07:00 Ruebenacker, Oliver A < oliver.ruebenac...@altisource.com>: > > > Hello, > > > > Is “cluster manager” mentioned here >

Re: advice sought on spark/cassandra input development - scala or python?

2014-09-04 Thread Gerard Maas
Johnny, Currently, probably the easiest (and most performant way) to integrate Spark and Cassandra is using the spark-cassandra-connector [1] Given an rdd, saving it to cassandra is as easy as: rdd.saveToCassandra(keyspace, table, Seq(columns)) We tried many 'hand crafted' options to interact w

Re: Starting Thriftserver via hostname on Spark 1.1 RC4?

2014-09-04 Thread Denny Lee
Ahh got it - I knew I was missing something  - appreciate the clarification! :) On September 4, 2014 at 10:27:44, Cheng Lian (lian.cs@gmail.com) wrote: You may configure listening host and port in the same way as HiveServer2 of Hive, namely: via environment variables HIVE_SERVER2_THRIFT_B

Re: Support R in Spark

2014-09-04 Thread Shivaram Venkataraman
Thanks Kui. SparkR is a pretty young project, but there are a bunch of things we are working on. One of the main features is to expose a data frame API (https://sparkr.atlassian.net/browse/SPARKR-1) and we will be integrating this with Spark's MLLib. At a high-level this will allow R users to use

Re: SchemaRDD - Parquet - "insertInto" makes many files

2014-09-04 Thread Michael Armbrust
Try doing coalesce(1) on the rdd before insert into. On Thu, Sep 4, 2014 at 10:40 AM, DanteSama wrote: > It seems that running insertInto on an SchemaRDD with a ParquetRelation > creates an individual file for each item in the RDD. Sometimes, it has > multiple rows in one file, and sometimes it

SchemaRDD - Parquet - "insertInto" makes many files

2014-09-04 Thread DanteSama
It seems that running insertInto on an SchemaRDD with a ParquetRelation creates an individual file for each item in the RDD. Sometimes, it has multiple rows in one file, and sometimes it only writes the column headers. My question is, is it possible to have it write the entire RDD as 1 file, but s

Re: Starting Thriftserver via hostname on Spark 1.1 RC4?

2014-09-04 Thread Cheng Lian
You may configure listening host and port in the same way as HiveServer2 of Hive, namely: - via environment variables - HIVE_SERVER2_THRIFT_BIND_HOST - HIVE_SERVER2_THRIFT_PORT - via system properties - hive.server2.thrift.bind.host - hive.server2.thrift.port Fo

Re: Spark Streaming into HBase

2014-09-04 Thread kpeng1
Tathagata, Thanks for all the help. It looks like the blah method doesn't need to be wrapped around a serializable object, but the main streaming calls do. I am currently running everything from spark-shell so I did not have a main function and object to wrap the streaming, map, and foreach call

Re: 2 python installations cause PySpark on Yarn problem

2014-09-04 Thread Andrew Or
Since you're using YARN, you may also need to set SPARK_YARN_USER_ENV to "PYSPARK_PYTHON=/your/desired/python/on/slave/nodes". 2014-09-04 9:59 GMT-07:00 Davies Liu : > Hey Oleg, > > In pyspark, you MUST have the same version of Python in all the > machines of the cluster, > which means when you

Re: Starting Thriftserver via hostname on Spark 1.1 RC4?

2014-09-04 Thread Gurvinder Singh
I want to add that there a regression when using pyspark to read data from HDFS. its performance during map tasks has gone down approx 1 -> 0.5x. I have tested the 1.0.2 and the performance was fine, but the 1.1 release candidate has this issue. I tested by setting the following properties to make

Re: 2 python installations cause PySpark on Yarn problem

2014-09-04 Thread Davies Liu
Hey Oleg, In pyspark, you MUST have the same version of Python in all the machines of the cluster, which means when you run `python` on these machines, all of them should be the same version ( 2.6 or 2.7). With PYSPARK_PYTHON, you can run pyspark with a specified version of Python. Also, you shou

spark streaming - saving DStream into HBASE doesn't work

2014-09-04 Thread salemi
Hi, I am using the following code to write data to hbase. I see the jobs are send off but I never get anything in my hbase database. Spark doesn't throw any error? How can such a problem be debugged. Is the code below correct for writing data to hbase? val conf = HBaseConfiguration.create(

Re: Object serialisation inside closures

2014-09-04 Thread Mohit Jaggi
I faced the same problem and ended up using the same approach that Sean suggested https://github.com/AyasdiOpenSource/df/blob/master/src/main/scala/com/ayasdi/df/DF.scala#L313 Option 3 also seems reasonable. It should create a CSVParser per executor. On Thu, Sep 4, 2014 at 6:58 AM, Andrianasolo

Re: EC2 - JNI crashes JVM with multi core instances

2014-09-04 Thread maxpar
Hi, What is the setup of your native library? Probably it is not thread safe? Thanks, Max -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/EC2-JNI-crashes-JVM-with-multi-core-instances-tp13463p13470.html Sent from the Apache Spark User List mailing list ar

Re: spark sql results maintain order (in python)

2014-09-04 Thread Davies Liu
On Thu, Sep 4, 2014 at 3:42 AM, jamborta wrote: > hi, > > I ran into a problem with spark sql, when run a query like this "select > count(*), city, industry from table group by hour" and I would like to take > the results from the shemaRDD > > 1, I have to parse each line to get the values out of

Re: advice sought on spark/cassandra input development - scala or python?

2014-09-04 Thread Mohit Jaggi
Johnny, Without knowing the domain of the problem it is hard to choose a programming language. I would suggest you ask yourself the following questions: - What if your project depends on a lot of python libraries that don't have Scala/Java counterparts? It is unlikely but possible. - What if Python

efficient zipping of lots of RDDs

2014-09-04 Thread Mohit Jaggi
Folks, I sent an email announcing https://github.com/AyasdiOpenSource/df This dataframe is basically a map of RDDs of columns(along with DSL sugar), as column based operations seem to be most common. But row operations are not uncommon. To get rows out of columns right now I zip the column RDDs to

pandas-like dataframe in spark

2014-09-04 Thread Mohit Jaggi
Folks, I have been working on a pandas-like dataframe DSL on top of spark. It is written in Scala and can be used from spark-shell. The APIs have the look and feel of pandas which is a wildly popular piece of software data scientists use. The goal is to let people familiar with pandas scale their e

2 python installations cause PySpark on Yarn problem

2014-09-04 Thread Oleg Ruchovets
Hi , I am evaluating the PySpark. I have hdp hortonworks installed with python 2.6.6. (I can't remove it since it is used by hortonworks). I can successfully execute PySpark on Yarn. We need to use Anaconda packages , so I install anaconda. Anaconda is installed with python 2.7.7 and it is a

Setting Java properties for Standalone on Windows 7?

2014-09-04 Thread Ruebenacker, Oliver A
Hello, I'm running Spark on Windows 7 as standalone, with everything on the same machine. No Hadoop installed. My app throws exception and worker reports: Could not locate executable null\bin\winutils.exe in the Hadoop binaries. I had the same problem earlier when deploying local. I u

EC2 - JNI crashes JVM with multi core instances

2014-09-04 Thread Iriasthor
Hi, I am trying to run a custom Spark application on a Spark standalone cluster on Amazon's EC2 infrastructure. So far I have successfully executed the application on several m1.medium instances (each with one core). However, when I try executing the very same application on some c1.medium instanc

advice sought on spark/cassandra input development - scala or python?

2014-09-04 Thread Johnny Kelsey
Hi guys, We're testing out a spark/cassandra cluster, & we're very impressed with what we've seen so far. However, I'd very much like some advice from the shiny brains on the mailing list. We have a large collection of python code that we're in the process of adapting to move into spark/cassandra

re: advice on spark input development - python or scala?

2014-09-04 Thread Johnny Kelsey
Hi guys, We're testing out a spark/cassandra cluster, & we're very impressed with what we've seen so far. However, I'd very much like some advice from the shiny brains on the mailing list. We have a large collection of python code that we're in the process of adapting to move into spark/cassandra

Re: Multiple spark shell sessions

2014-09-04 Thread Dhimant
Thanks Yana, I am able to execute application and command via another session, i also received another port for UI application. Thanks, Dhimant -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Multiple-spark-shell-sessions-tp13441p13459.html Sent from the Ap

RE: Object serialisation inside closures

2014-09-04 Thread Andrianasolo Fanilo
Thank you for the quick answer, looks good to me Though that brings me to another question. Suppose we want to open a connection to a database, an ElasticSearch, etc... I now have two proceedings : 1/ use .mapPartitions and setup the connection at the start of each partition, so I get a connect

Re: Multiple spark shell sessions

2014-09-04 Thread Yana Kadiyska
These are just warnings from the web server. Normally your application will have a UI page on port 4040. In your case, a little after the warning it should bind just fine to another port (mine picked 4041). Im running on 0.9.1. Do you actually see the application failing? The main thing when runnin

Re: Spark processes not doing on killing corresponding YARN application

2014-09-04 Thread didata
Thanks for asking this. I've have this issue with pyspark too on YARN 100 of the time: I quit out of pyspark and, while my Unix shell prompt returns, a 'yarn application -list' always shows (as does the UI) that application is still running (or at least not totally dead). When I then log onto

Re: Object serialisation inside closures

2014-09-04 Thread Yana Kadiyska
In the third case the object does not get shipped around. Each executor will create it's own instance. I got bitten by this here: http://apache-spark-user-list.1001560.n3.nabble.com/Help-with-object-access-from-mapper-simple-question-tt8125.html On Thu, Sep 4, 2014 at 9:29 AM, Andrianasolo Fanil

Re: Object serialisation inside closures

2014-09-04 Thread Sean Owen
In your original version, the object is referenced by the function but it's on the driver, and so has to be serialized. This leads to an error since it's not serializable. Instead, you want to recreate the object locally on each of the remote machines. In your third version you are holding the par

[Spark Streaming] Tracking/solving 'block input not found'

2014-09-04 Thread Gerard Maas
Hello Sparkers, I'm currently running load tests on a Spark Streaming job. When the task duration increases beyond the batchDuration the job become unstable. In the logs I see tasks failed with the following message: Job aborted due to stage failure: Task 266.0:1 failed 4 times, most recent failu

Object serialisation inside closures

2014-09-04 Thread Andrianasolo Fanilo
Hello Spark fellows :) I'm a new user of Spark and Scala and have been using both for 6 months without too many problems. Here I'm looking for best practices for using non-serializable classes inside closure. I'm using Spark-0.9.0-incubating here with Hadoop 2.2. Suppose I am using OpenCSV pars

Using Spark to add data to an existing Parquet file without a schema

2014-09-04 Thread Jim Carroll
Hello all, I've been trying to figure out how to add data to an existing Parquet file without having a schema. Spark has allowed me to load JSON and save it as a Parquet file but I was wondering if anyone knows how to ADD/INSERT more data. I tried using sql insert and that doesn't work. All of t

Re: Programatically running of the Spark Jobs.

2014-09-04 Thread Vicky Kak
I don't think so. On Thu, Sep 4, 2014 at 5:36 PM, Ruebenacker, Oliver A < oliver.ruebenac...@altisource.com> wrote: > > > Hello, > > > > Can this be used as a library from within another application? > > Thanks! > > > > Best, Oliver > > > > *From:* Matt Chu [mailto:m...@kabam.com]

Is "cluster manager" same as "master"?

2014-09-04 Thread Ruebenacker, Oliver A
Hello, Is "cluster manager" mentioned here the same thing as "master" mentioned here? Thanks! Best, Oliver Oliver Ruebenacker | Sol

RE: Web UI

2014-09-04 Thread Ruebenacker, Oliver A
Hello, Thanks for the link – this is for standalone, though, and most URLs don’t work for local. I will look into deploying as standalone on a single node for testing and development. Best, Oliver From: Akhil Das [mailto:ak...@sigmoidanalytics.com] Sent: Thursday, September 04,

RE: Programatically running of the Spark Jobs.

2014-09-04 Thread Ruebenacker, Oliver A
Hello, Can this be used as a library from within another application? Thanks! Best, Oliver From: Matt Chu [mailto:m...@kabam.com] Sent: Thursday, September 04, 2014 2:46 AM To: Vicky Kak Cc: user Subject: Re: Programatically running of the Spark Jobs. https://github.com/spark-job

spark sql results maintain order (in python)

2014-09-04 Thread jamborta
hi, I ran into a problem with spark sql, when run a query like this "select count(*), city, industry from table group by hour" and I would like to take the results from the shemaRDD 1, I have to parse each line to get the values out of the dic (eg in order to convert it to a csv) 2, The order is

Re: RDDs

2014-09-04 Thread Kartheek.R
Thank you yuanbosoft. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDDs-tp13343p13444.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-

Spark processes not doing on killing corresponding YARN application

2014-09-04 Thread Hemanth Yamijala
Hi, I launched a spark streaming job under YARN using default configuration for Spark, using spark-submit with the master as yarn-cluster. It launched an ApplicationMaster, and 2 CoarseGrainedExecutorBackend processes. Everything ran fine, then I killed the application using yarn application -kil

Spark streaming saveAsHadoopFiles API question

2014-09-04 Thread Hemanth Yamijala
Hi, I extended the Spark streaming wordcount example to save files to Hadoop file system - just to test how that interface works. In doing so, I ran into an API problem that I hope folks here can help clarify. My goal was to see how I could save the final word counts generated in each micro-batch

Multiple spark shell sessions

2014-09-04 Thread Dhimant
Hi, I am receiving following error while connecting the spark server via shell if one shell is already open. How can I open multiple sessions ? Does anyone know abt Workflow Engine/Job Server like apache oozie for spark ? / Welcome to __ / __/__ ___ _/ /__ _\

Re: Iterate over ArrayBuffer

2014-09-04 Thread Ngoc Dao
> I want to iterate over the ArrayBuffer. You should get yourself familiar with methods related to the Scala collection library: https://twitter.github.io/scala_school/collections.html Almost all of the methods take a function as their parameter. This is a very convenient feature of Scala (unlike

Re: Iterate over ArrayBuffer

2014-09-04 Thread Madabhattula Rajesh Kumar
Hi Deep, If you are requirement is to read the values from ArrayBuffer use below code scala> import scala.collection.mutable.ArrayBuffer import scala.collection.mutable.ArrayBuffer scala> var a = ArrayBuffer(5,3,1,4) a: scala.collection.mutable.ArrayBuffer[Int] = ArrayBuffer(5, 3, 1, 4) scala>

Re: error: type mismatch while assigning RDD to RDD val object

2014-09-04 Thread Sean Owen
I think this is a known problem with the shell and case classes. Have a look at JIRA. https://issues.apache.org/jira/browse/SPARK-1199 On Thu, Sep 4, 2014 at 7:56 AM, Dhimant wrote: > I am receiving following error in Spark-Shell while executing following code. > > /class LogRecrod(logLine: Stri

Re: .sparkrc for Spark shell?

2014-09-04 Thread Jianshi Huang
I se. Thanks Prashant! Jianshi On Wed, Sep 3, 2014 at 7:05 PM, Prashant Sharma wrote: > Hey, > > You can use spark-shell -i sparkrc, to do this. > > Prashant Sharma > > > > > On Wed, Sep 3, 2014 at 2:17 PM, Jianshi Huang > wrote: > >> To make my shell experience merrier, I need to import seve

Iterate over ArrayBuffer

2014-09-04 Thread Deep Pradhan
Hi, I have the following ArrayBuffer *ArrayBuffer(5,3,1,4)* Now, I want to iterate over the ArrayBuffer. What is the way to do it? Thank You

Re: memory size for caching RDD

2014-09-04 Thread 牛兆捷
ok. So can I use the similar logic as the block manager does when space fills up ? 2014-09-04 15:05 GMT+08:00 Liu, Raymond : > I think there is no public API available to do this. In this case, the > best you can do might be unpersist some RDDs manually. The problem is that > this is done by RDD

Re: Web UI

2014-09-04 Thread Akhil Das
Hi You can see this doc for all the available webUI ports. Yes there are ways to get the data metrics in Json format, One of them is below: *​​http://webUI:8080/json/ * O

RE: memory size for caching RDD

2014-09-04 Thread Liu, Raymond
I think there is no public API available to do this. In this case, the best you can do might be unpersist some RDDs manually. The problem is that this is done by RDD unit, not by block unit. And then, if the storage level including disk level, the data on the disk will be removed too. Best Rega