Re: Is sorting persisted after pair rdd transformations?

2014-11-19 Thread Akhil Das
If something is persisted you can easily see them under the Storage tab in the web ui. Thanks Best Regards On Tue, Nov 18, 2014 at 7:26 PM, Aniket Bhatnagar aniket.bhatna...@gmail.com wrote: I am trying to figure out if sorting is persisted after applying Pair RDD transformations and I am

Merging Parquet Files

2014-11-19 Thread Daniel Haviv
Hello, I'm writing a process that ingests json files and saves them a parquet files. The process is as such: val sqlContext = new org.apache.spark.sql.SQLContext(sc) val jsonRequests=sqlContext.jsonFile(/requests) val parquetRequests=sqlContext.parquetFile(/requests_parquet)

Re: Merging Parquet Files

2014-11-19 Thread Marius Soutier
You can also insert into existing tables via .insertInto(tableName, overwrite). You just have to import sqlContext._ On 19.11.2014, at 09:41, Daniel Haviv danielru...@gmail.com wrote: Hello, I'm writing a process that ingests json files and saves them a parquet files. The process is as such:

Re: Merging Parquet Files

2014-11-19 Thread Daniel Haviv
Very cool thank you! On Wed, Nov 19, 2014 at 11:15 AM, Marius Soutier mps@gmail.com wrote: You can also insert into existing tables via .insertInto(tableName, overwrite). You just have to import sqlContext._ On 19.11.2014, at 09:41, Daniel Haviv danielru...@gmail.com wrote: Hello,

RE: Spark to eliminate full-table scan latency

2014-11-19 Thread bchazalet
You can serve queries over your RDD data yes, and return results to the user/client as long as your driver is alive. For example, I have built a play! application that acts as a driver (creating a spark context), loads up data from my database, organize it and subsequently receive and process

Why is ALS class serializable ?

2014-11-19 Thread Hao Ren
Hi, When reading through ALS code, I find that: class ALS private ( private var numUserBlocks: Int, private var numProductBlocks: Int, private var rank: Int, private var iterations: Int, private var lambda: Double, private var implicitPrefs: Boolean, private var

Re: Understanding spark operation pipeline and block storage

2014-11-19 Thread Hao Ren
Anyone has idea on this ? Thx -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Understanding-spark-operation-pipeline-and-block-storage-tp18201p19263.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Efficient way to split an input data set into different output files

2014-11-19 Thread Tom Seddon
I'm trying to set up a PySpark ETL job that takes in JSON log files and spits out fact table files for upload to Redshift. Is there an efficient way to send different event types to different outputs without having to just read the same cached RDD twice? I have my first RDD which is just a json

Re: Is sorting persisted after pair rdd transformations?

2014-11-19 Thread Daniel Darabos
Akhil, I think Aniket uses the word persisted in a different way than what you mean. I.e. not in the RDD.persist() way. Aniket asks if running combineByKey on a sorted RDD will result in a sorted RDD. (I.e. the sorting is preserved.) I think the answer is no. combineByKey uses AppendOnlyMap,

Re: Is sorting persisted after pair rdd transformations?

2014-11-19 Thread Aniket Bhatnagar
Thanks Daniel. I can understand that the keys will not be in sorted order but what I am trying to understanding is whether the functions are passed values in sorted order in a given partition. For example: sc.parallelize(1 to 8).map(i = (1, i)).sortBy(t = t._2).foldByKey(0)((a, b) = b).collect

Re: Is sorting persisted after pair rdd transformations?

2014-11-19 Thread Daniel Darabos
Ah, so I misunderstood you too :). My reading of org/ apache/spark/Aggregator.scala is that your function will always see the items in the order that they are in the input RDD. An RDD partition is always accessed as an iterator, so it will not be read out of order. On Wed, Nov 19, 2014 at 2:28

Debugging spark java application

2014-11-19 Thread Mukesh Jha
Hello experts, Is there an easy way to debug a spark java application? I'm putting debug logs in the map's function but there aren't any logs on the console. Also can i include my custom jars while launching spark-shell and do my poc there? This might me a naive question but any help here is

GraphX bug re-opened

2014-11-19 Thread Gary Malouf
We keep running into https://issues.apache.org/jira/browse/SPARK-2823 when trying to use GraphX. The cost of repartitioning the data is really high for us (lots of network traffic) which is killing the job performance. I understand the bug was reverted to stabilize unit tests, but frankly it

can not found scala.reflect related methods when running spark program

2014-11-19 Thread Dingfei Zhang
Hi, I wrote below simple spark code, and met a runtime issue which seems that the system can't find some methods of scala refect library. package org.apache.spark.examples import scala.io.Source import scala.reflect._ import scala.reflect.api.JavaUniverse import scala.reflect.runtime.universe

Cannot access data after a join (error: value _1 is not a member of Product with Serializable)

2014-11-19 Thread YaoPau
I joined two datasets together, and my resulting logs look like this: (975894369,((72364,20141112T170627,web,MEMPHIS,AR,US,Central),(Male,John,Smith))) (253142991,((30058,20141112T171246,web,ATLANTA16,GA,US,Southeast),(Male,Bob,Jones)))

Re: Why is ALS class serializable ?

2014-11-19 Thread Cheng Lian
When a field of an object is enclosed in a closure, the object itself is also enclosed automatically, thus the object need to be serializable. On 11/19/14 6:39 PM, Hao Ren wrote: Hi, When reading through ALS code, I find that: class ALS private ( private var numUserBlocks: Int,

Re: Cannot access data after a join (error: value _1 is not a member of Product with Serializable)

2014-11-19 Thread Olivier Girardot
can you please post the full source of your code and some sample data to run it on ? 2014-11-19 16:23 GMT+01:00 YaoPau jonrgr...@gmail.com: I joined two datasets together, and my resulting logs look like this: (975894369,((72364,20141112T170627,web,MEMPHIS,AR,US,Central),(Male,John,Smith)))

Re: Is sorting persisted after pair rdd transformations?

2014-11-19 Thread Aniket Bhatnagar
Thanks Daniel :-). It seems to make sense and something I was hoping for. I will proceed with this assumption and report back if I see any anomalies. On Wed Nov 19 2014 at 19:30:02 Daniel Darabos daniel.dara...@lynxanalytics.com wrote: Ah, so I misunderstood you too :). My reading of org/

Re: Getting spark job progress programmatically

2014-11-19 Thread Aniket Bhatnagar
I have for now submitted a JIRA ticket @ https://issues.apache.org/jira/browse/SPARK-4473. I will collate all my experiences ( hacks) and submit them as a feature request for public API. On Tue Nov 18 2014 at 20:35:00 andy petrella andy.petre...@gmail.com wrote: yep, we should also propose to

Re: Getting spark job progress programmatically

2014-11-19 Thread Aniket Bhatnagar
Thanks for pointing this out Mark. Had totally missed the existing JIRA items On Wed Nov 19 2014 at 21:42:19 Mark Hamstra m...@clearstorydata.com wrote: This is already being covered by SPARK-2321 and SPARK-4145. There are pull requests that are already merged or already very far along --

Re: Getting spark job progress programmatically

2014-11-19 Thread Mark Hamstra
This is already being covered by SPARK-2321 and SPARK-4145. There are pull requests that are already merged or already very far along -- e.g., https://github.com/apache/spark/pull/3009 If there is anything that needs to be added, please add it to those issues or PRs. On Wed, Nov 19, 2014 at

Spark Streaming with Flume or Kafka?

2014-11-19 Thread Guillermo Ortiz
Hi, I'm starting with Spark and I just trying to understand if I want to use Spark Streaming, should I use to feed it Flume or Kafka? I think there's not a official Sink for Flume to Spark Streaming and it seems that Kafka it fits better since gives you readibility. Could someone give a good

Re: Spark on YARN

2014-11-19 Thread Alan Prando
Hi all! Thanks for answering! @Sean, I tried to run with 30 executor-cores , and 1 machine still without processing. @Vanzin, I checked RM's web UI, and all nodes were detecteds and RUNNING. The interesting fact is that available memory and available core of 1 node was different of other 2, with

tableau spark sql cassandra

2014-11-19 Thread jererc
Hello! I'm working on a POC with Spark SQL, where I’m trying to get data from Cassandra into Tableau using Spark SQL. Here is the stack: - cassandra (v2.1) - spark SQL (pre build v1.1 hadoop v2.4) - cassandra / spark sql connector (https://github.com/datastax/spark-cassandra-connector) - hive -

[SQL] HiveThriftServer2 failure detection

2014-11-19 Thread Yana Kadiyska
Hi all, I am running HiveThriftServer2 and noticed that the process stays up even though there is no driver connected to the Spark master. I started the server via sbin/start-thriftserver and my namenodes are currently not operational. I can see from the log that there was an error on startup:

Re: Debugging spark java application

2014-11-19 Thread Akhil Das
For debugging you can refer these two threads http://apache-spark-user-list.1001560.n3.nabble.com/How-do-you-hit-breakpoints-using-IntelliJ-In-functions-used-by-an-RDD-td12754.html

Re: [SQL] HiveThriftServer2 failure detection

2014-11-19 Thread Michael Armbrust
This is not by design. Can you please file a JIRA? On Wed, Nov 19, 2014 at 9:19 AM, Yana Kadiyska yana.kadiy...@gmail.com wrote: Hi all, I am running HiveThriftServer2 and noticed that the process stays up even though there is no driver connected to the Spark master. I started the server

Re: SparkSQL and Hive/Hive metastore testing - LocalHiveContext

2014-11-19 Thread Michael Armbrust
On Tue, Nov 18, 2014 at 10:34 PM, Night Wolf nightwolf...@gmail.com wrote: Is there a better way to mock this out and test Hive/metastore with SparkSQL? I would use TestHive which creates a fresh metastore each time it is invoked.

Shuffle Intensive Job: sendMessageReliably failed because ack was not received within 60 sec

2014-11-19 Thread Gary Malouf
Has anyone else received this type of error? We are not sure what the issue is nor how to correct it to get our job to complete...

Re: Merging Parquet Files

2014-11-19 Thread Michael Armbrust
On Wed, Nov 19, 2014 at 12:41 AM, Daniel Haviv danielru...@gmail.com wrote: Another problem I have is that I get a lot of small json files and as a result a lot of small parquet files, I'd like to merge the json files into a few parquet files.. how I do that? You can use `coalesce` on any

Re: How to apply schema to queried data from Hive before saving it as parquet file?

2014-11-19 Thread Michael Armbrust
I am not very familiar with the JSONSerDe for Hive, but in general you should not need to manually create a schema for data that is loaded from hive. You should just be able to call saveAsParquetFile on any SchemaRDD that is returned from hctx.sql(...). I'd also suggest you check out the

Re: tableau spark sql cassandra

2014-11-19 Thread Michael Armbrust
The whole stacktrack/exception would be helpful. Hive is an optional dependency of Spark SQL, but you will need to include it if you are planning to use the thrift server to connect to Tableau. You can enable it by add -Phive when you build Spark. You might also try asking on the cassandra

Re: Shuffle Intensive Job: sendMessageReliably failed because ack was not received within 60 sec

2014-11-19 Thread Michael Armbrust
That error can mean a whole bunch of things (and we've been working in recently to make it more descriptive). Often the actual cause is in the executor logs. On Wed, Nov 19, 2014 at 10:50 AM, Gary Malouf malouf.g...@gmail.com wrote: Has anyone else received this type of error? We are not sure

Re: Converting a json struct to map

2014-11-19 Thread Michael Armbrust
You can override the schema inference by passing a schema as the second argument to jsonRDD, however thats not a super elegant solution. We are considering one option to make this easier here: https://issues.apache.org/jira/browse/SPARK-4476 On Tue, Nov 18, 2014 at 11:06 PM, Akhil Das

Re: [SQL] HiveThriftServer2 failure detection

2014-11-19 Thread Yana Kadiyska
https://issues.apache.org/jira/browse/SPARK-4497 On Wed, Nov 19, 2014 at 1:48 PM, Michael Armbrust mich...@databricks.com wrote: This is not by design. Can you please file a JIRA? On Wed, Nov 19, 2014 at 9:19 AM, Yana Kadiyska yana.kadiy...@gmail.com wrote: Hi all, I am running

Re: Converting a json struct to map

2014-11-19 Thread Daniel Haviv
Thank you Michael I will try it out tomorrow Daniel On 19 בנוב׳ 2014, at 21:07, Michael Armbrust mich...@databricks.com wrote: You can override the schema inference by passing a schema as the second argument to jsonRDD, however thats not a super elegant solution. We are considering one

Re: spark streaming and the spark shell

2014-11-19 Thread Tian Zhang
I am hitting the same issue, i.e., after running for some time, if spark streaming job lost or timeout kafka connection, it will just start to return empty RDD's .. Is there a timeline for when this issue will be fixed so that I can plan accordingly? Thanks. Tian -- View this message in

querying data from Cassandra through the Spark SQL Thrift JDBC server

2014-11-19 Thread Mohammed Guller
Hi - I was curious if anyone is using the Spark SQL Thrift JDBC server with Cassandra. It would be great be if you could share how you got it working? For example, what config changes have to be done in hive-site.xml, what additional jars are required, etc.? I have a Spark app that can

rack-topology.sh no such file or directory

2014-11-19 Thread Arun Luthra
I'm trying to run Spark on Yarn on a hortonworks 2.1.5 cluster. I'm getting this error: 14/11/19 13:46:34 INFO cluster.YarnClientSchedulerBackend: Registered executor: Actor[akka.tcp://sparkExecutor@#/user/Executor#- 2027837001] with ID 42 14/11/19 13:46:34 WARN

Re: Converting a json struct to map

2014-11-19 Thread Yin Huai
Oh, actually, we do not support MapType provided by the schema given to jsonRDD at the moment (my bad..). Daniel, you need to wait for the patch of 4476 (I should have one soon). Thanks, Yin On Wed, Nov 19, 2014 at 2:32 PM, Daniel Haviv danielru...@gmail.com wrote: Thank you Michael I will

Re: rack-topology.sh no such file or directory

2014-11-19 Thread Matei Zaharia
Your Hadoop configuration is set to look for this file to determine racks. Is the file present on cluster nodes? If not, look at your hdfs-site.xml and remove the setting for a rack topology script there (or it might be in core-site.xml). Matei On Nov 19, 2014, at 12:13 PM, Arun Luthra

Re: Spark Streaming with Flume or Kafka?

2014-11-19 Thread Hari Shreedharan
As of now, you can feed Spark Streaming from both kafka and flume. Currently though there is no API to write data back to either of the two directly. I sent a PR which should eventually add something like this:

Re: Spark Streaming with Flume or Kafka?

2014-11-19 Thread Hari Shreedharan
Btw, if you want to write to Spark Streaming from Flume -- there is a sink (it is a part of Spark, not Flume). See Approach 2 here: http://spark.apache.org/docs/latest/streaming-flume-integration.html On Wed, Nov 19, 2014 at 12:41 PM, Hari Shreedharan hshreedha...@cloudera.com wrote: As of

Re: Bug in Accumulators...

2014-11-19 Thread Jake Mannix
I'm running into similar problems with accumulators failing to serialize properly. Are there any examples of accumulators being used in more complex environments than simply initializing them in the same class and then using them in a .foreach() on an RDD referenced a few lines below? From the

Re: Efficient way to split an input data set into different output files

2014-11-19 Thread Nicholas Chammas
I don't have a solution for you, but it sounds like you might want to follow this issue: SPARK-3533 https://issues.apache.org/jira/browse/SPARK-3533 - Add saveAsTextFileByKey() method to RDDs On Wed Nov 19 2014 at 6:41:11 AM Tom Seddon mr.tom.sed...@gmail.com wrote: I'm trying to set up a

Re: spark-shell giving me error of unread block data

2014-11-19 Thread Anson Abraham
Question ... when you mean different versions, different versions of dependency files? what are the dependency files for spark? On Tue Nov 18 2014 at 5:27:18 PM Anson Abraham anson.abra...@gmail.com wrote: when cdh cluster was running, i did not set up spark role. When I did for the first

Can we make EdgeRDD and VertexRDD storage level to MEMORY_AND_DISK?

2014-11-19 Thread Harihar Nahak
Hi, I'm running out of memory when I run a GraphX program for dataset moe than 10 GB, It was handle pretty well in case of noraml spark operation when did StorageLevel.MEMORY_AND_DISK. In case of GraphX I found its only allowed storing in memory, and it is because in Graph constructor, this

How to get list of edges between two Vertex ?

2014-11-19 Thread Harihar Nahak
Hi, I have a graph where no. of edges b/w two vertices are more than once possible. Now I need to find out who are top vertices between which no. of calls happen more? output should look like (V1, V2 , No. of edges) So I need to know, how to find out total no. of edges b/w only that two

Reading nested JSON data with Spark SQL

2014-11-19 Thread Simone Franzini
I have been using Spark SQL to read in JSON data, like so: val myJsonFile = sqc.jsonFile(args(myLocation)) myJsonFile.registerTempTable(myTable) sqc.sql(mySQLQuery).map { row = myFunction(row) } And then in myFunction(row) I can read the various columns with the Row.getX methods. However, this

[SQL]Proper use of spark.sql.thriftserver.scheduler.pool

2014-11-19 Thread Yana Kadiyska
Hi sparkers, I'm trying to use spark.sql.thriftserver.scheduler.pool for the first time (earlier I was stuck because of https://issues.apache.org/jira/browse/SPARK-4037) I have two pools setup: [image: Inline image 1] and would like to issue a query against the low priority pool. I am doing

Re: Reading nested JSON data with Spark SQL

2014-11-19 Thread Michael Armbrust
You can extract the nested fields in sql: SELECT field.nestedField ... If you don't do that then nested fields are represented as rows within rows and can be retrieved as follows: t.getAs[Row](0).getInt(0) Also, I would write t.getAs[Buffer[CharSequence]](12) as t.getAs[Seq[String]](12) since

Re: spark-shell giving me error of unread block data

2014-11-19 Thread Marcelo Vanzin
Hi Anson, We've seen this error when incompatible classes are used in the driver and executors (e.g., same class name, but the classes are different and thus the serialized data is different). This can happen for example if you're including some 3rd party libraries in your app's jar, or changing

Re: Strategies for reading large numbers of files

2014-11-19 Thread soojin
Hi Landon, I tried this but it didn't work for me. I get Task not serializable exception: Caused by: java.io.NotSerializableException: org.apache.hadoop.conf.Configuration How do you make org.apache.hadoop.conf.Configuration hadoopConfiguration available to tasks? -- View this message in

Re: Spark Streaming with Flume or Kafka?

2014-11-19 Thread Guillermo Ortiz
Thank you for your answer, I don't know if I typed the question correctly. But your nswer helps me. I'm going to make the question again for knowing if you understood me. I have this topology: DataSource1, , DataSourceN -- Kafka -- SparkStreaming -- HDFS DataSource1, , DataSourceN --

Re: Spark Streaming with Flume or Kafka?

2014-11-19 Thread Guillermo Ortiz
Thank you for your answer, I don't know if I typed the question correctly. But your nswer helps me. I'm going to make the question again for knowing if you understood me. I have this topology: DataSource1, , DataSourceN -- Kafka -- SparkStreaming -- HDFS

Re: spark-shell giving me error of unread block data

2014-11-19 Thread Anson Abraham
yeah but in this case i'm not building any files. just deployed out config files in CDH5.2 and initiated a spark-shell to just read and output a file. On Wed Nov 19 2014 at 4:52:51 PM Marcelo Vanzin van...@cloudera.com wrote: Hi Anson, We've seen this error when incompatible classes are used

Re: Spark on YARN

2014-11-19 Thread Sean Owen
I think your config may be the issue then. It sounds like 1 server is configured in a different YARN group that states it has way less resource than it does. On Wed, Nov 19, 2014 at 5:27 PM, Alan Prando a...@scanboo.com.br wrote: Hi all! Thanks for answering! @Sean, I tried to run with 30

Re: spark-shell giving me error of unread block data

2014-11-19 Thread Marcelo Vanzin
On Wed, Nov 19, 2014 at 2:13 PM, Anson Abraham anson.abra...@gmail.com wrote: yeah but in this case i'm not building any files. just deployed out config files in CDH5.2 and initiated a spark-shell to just read and output a file. In that case it is a little bit weird. Just to be sure, you are

Re: Reading nested JSON data with Spark SQL

2014-11-19 Thread Simone Franzini
This works great, thank you! Simone Franzini, PhD http://www.linkedin.com/in/simonefranzini On Wed, Nov 19, 2014 at 3:40 PM, Michael Armbrust mich...@databricks.com wrote: You can extract the nested fields in sql: SELECT field.nestedField ... If you don't do that then nested fields are

Re: spark-shell giving me error of unread block data

2014-11-19 Thread Anson Abraham
yeah CDH distribution (1.1). On Wed Nov 19 2014 at 5:29:39 PM Marcelo Vanzin van...@cloudera.com wrote: On Wed, Nov 19, 2014 at 2:13 PM, Anson Abraham anson.abra...@gmail.com wrote: yeah but in this case i'm not building any files. just deployed out config files in CDH5.2 and initiated a

Spark Standalone Scheduling

2014-11-19 Thread TJ Klein
Hi, I am running some Spark code on my cluster in standalone mode. However, I have noticed that the most powerful machines (32 cores, 192 Gb mem) hardly get any tasks, whereas my small machines (8 cores, 128 Gb mem) all get plenty of tasks. The resources are all displayed correctly in the WebUI

Re: NEW to spark and sparksql

2014-11-19 Thread Michael Armbrust
I would use just textFile unless you actually need a guarantee that you will be seeing a whole file at time (textFile splits on new lines). RDDs are immutable, so you cannot add data to them. You can however union two RDDs, returning a new RDD that contains all the data. On Wed, Nov 19, 2014 at

Re: spark-shell giving me error of unread block data

2014-11-19 Thread Anson Abraham
Sorry meant cdh 5.2 w/ spark 1.1. On Wed, Nov 19, 2014, 17:41 Anson Abraham anson.abra...@gmail.com wrote: yeah CDH distribution (1.1). On Wed Nov 19 2014 at 5:29:39 PM Marcelo Vanzin van...@cloudera.com wrote: On Wed, Nov 19, 2014 at 2:13 PM, Anson Abraham anson.abra...@gmail.com wrote:

PairRDDFunctions with Tuple2 subclasses

2014-11-19 Thread Daniel Siegmann
I have a class which is a subclass of Tuple2, and I want to use it with PairRDDFunctions. However, I seem to be limited by the invariance of T in RDD[T] (see SPARK-1296 https://issues.apache.org/jira/browse/SPARK-1296). My Scala-fu is weak: the only way I could think to make this work would be to

Re: PairRDDFunctions with Tuple2 subclasses

2014-11-19 Thread Michael Armbrust
I think you should also be able to get away with casting it back and forth in this case using .asInstanceOf. On Wed, Nov 19, 2014 at 4:39 PM, Daniel Siegmann daniel.siegm...@velos.io wrote: I have a class which is a subclass of Tuple2, and I want to use it with PairRDDFunctions. However, I

Re: spark-shell giving me error of unread block data

2014-11-19 Thread Ritesh Kumar Singh
As Marcelo mentioned, the issue occurs mostly when incompatible classes are used by executors or drivers. Try out if the output is coming on spark-shell. If yes, then most probably in your case, there might be some issue with your configuration files. It will be helpful if you can paste the

Joining DStream with static file

2014-11-19 Thread YaoPau
Here is my attempt: val sparkConf = new SparkConf().setAppName(LogCounter) val ssc = new StreamingContext(sparkConf, Seconds(2)) val sc = new SparkContext() val geoData = sc.textFile(data/geoRegion.csv) .map(_.split(',')) .map(line = (line(0),

Re: PairRDDFunctions with Tuple2 subclasses

2014-11-19 Thread Daniel Siegmann
Casting to Tuple2 is easy, but the output of reduceByKey is presumably a new Tuple2 instance so I'll need to map those to new instances of my class. Not sure how much overhead will be added by the creation of those new instances. If I do that everywhere in my code though, it will make the code

re: How to incrementally compile spark examples using mvn

2014-11-19 Thread Yiming (John) Zhang
Hi Sean, Thank you for your reply. I was wondering whether there is a method of reusing locally-built components without installing them? That is, if I have successfully built the spark project as a whole, how should I configure it so that I can incrementally build (only) the spark-examples

Re: Cannot access data after a join (error: value _1 is not a member of Product with Serializable)

2014-11-19 Thread Tobias Pfeiffer
Hi, it looks what you are trying to use as a Tuple cannot be inferred to be a Tuple from the compiler. Try to add type declarations and maybe you will see where things fail. Tobias

How to view log on yarn-client mode?

2014-11-19 Thread innowireless TaeYun Kim
Hi, How can I view log on yarn-client mode? When I insert the following line on mapToPair function for example, System.out.println(TEST TEST); On local mode, it is displayed on console. But on yarn-client mode, it is not on anywhere. When I use yarn resource manager web UI, the size

Re: Can we make EdgeRDD and VertexRDD storage level to MEMORY_AND_DISK?

2014-11-19 Thread Harihar Nahak
Just figured it out using Graph constructor you can pass the storage level for both Edge and Vertex : Graph.fromEdges(edges, defaultValue = (,),StorageLevel.MEMORY_AND_DISK,StorageLevel.MEMORY_AND_DISK ) Thanks to this post : https://issues.apache.org/jira/browse/SPARK-1991 - --Harihar

insertIntoTable failure deleted pre-existing _metadata file

2014-11-19 Thread Daniel Haviv
Hello, I'm loading and saving json files into an existing directory with parquet files using the insertIntoTable method. If the method fails for some reason (differences in the schema in my case), the _metadata file of the parquet dir is automatically deleted, rendering the existing parquet

Re: How to apply schema to queried data from Hive before saving it as parquet file?

2014-11-19 Thread akshayhazari
Thanks for replying .I was unable to figure out how after I use jsonFile/jsonRDD be able to load data into a hive table. Also I was able to save the SchemaRDD I got via hiveContext.sql(...).saveAsParquetFile(Path) ie. save schemardd as parquetfile but when I tried to fetch data from parquet file

Spark Streaming not working in YARN mode

2014-11-19 Thread kam lee
I created a simple Spark Streaming program - it received numbers and computed averages and sent the results to Kafka. It worked perfectly in local mode as well as standalone master/slave mode across a two-node cluster. It did not work however in yarn-client or yarn-cluster mode. The job was

Transform RDD.groupBY result to multiple RDDs

2014-11-19 Thread Dai, Kevin
Hi, all Suppose I have a RDD of (K, V) tuple and I do groupBy on it with the key K. My question is how to make each groupBy resukt whick is (K, iterable[V]) a RDD. BTW, can we transform it as a DStream and also each groupBY result is a RDD in it? Best Regards, Kevin.

Naive Baye's classification confidence

2014-11-19 Thread jatinpreet
I have been trying the Naive Baye's implementation of Spark's MLlib.During testing phase, I wish to eliminate data with low confidence of prediction. My data set primarily consists of form based documents like reports and application forms. They contain key-value pair type text and hence I assume

Re: How to view log on yarn-client mode?

2014-11-19 Thread Sandy Ryza
While the app is running, you can find logs from the YARN web UI by navigating to containers through the Nodes link. After the app has completed, you can use the YARN logs command: yarn logs -applicationId your app ID -Sandy On Wed, Nov 19, 2014 at 6:01 PM, innowireless TaeYun Kim

Re: How to apply schema to queried data from Hive before saving it as parquet file?

2014-11-19 Thread akshayhazari
Sorry about the confusion I created . I just have started learning this week. Silly me, I was actually writing the schema to a txt file and expecting records. This is what I was supposed to do. Also if you could let me know about adding the data from jsonFile/jsonRDD methods of hiveContext to hive

Re: How to apply schema to queried data from Hive before saving it as parquet file?

2014-11-19 Thread Daniel Haviv
You can save the results as parquet file or as text file and created a hive table based on these files Daniel On 20 בנוב׳ 2014, at 08:01, akshayhazari akshayhaz...@gmail.com wrote: Sorry about the confusion I created . I just have started learning this week. Silly me, I was actually

Re: Spark Streaming not working in YARN mode

2014-11-19 Thread Akhil Das
Make sure the executor cores are set to a value which is = 2 while submitting the job. Thanks Best Regards On Thu, Nov 20, 2014 at 10:36 AM, kam lee cloudher...@gmail.com wrote: I created a simple Spark Streaming program - it received numbers and computed averages and sent the results to

Re: Transform RDD.groupBY result to multiple RDDs

2014-11-19 Thread Sean Owen
What's your use case? You would not generally want to make so many small RDDs. On Nov 20, 2014 6:19 AM, Dai, Kevin yun...@ebay.com wrote: Hi, all Suppose I have a RDD of (K, V) tuple and I do groupBy on it with the key K. My question is how to make each groupBy resukt whick is (K,

Re: Spark Streaming with Flume or Kafka?

2014-11-19 Thread Akhil Das
You can also look at the Amazon's kinesis if you don't want to handle the pain of maintaining kafka/flume infra. Thanks Best Regards On Thu, Nov 20, 2014 at 3:32 AM, Guillermo Ortiz konstt2...@gmail.com wrote: Thank you for your answer, I don't know if I typed the question correctly. But your

Re: Joining DStream with static file

2014-11-19 Thread Akhil Das
1. You don't have to create another sparkContext. you can simply call the *ssc.sparkContext* 2. May be after the transformation on geoData, you could do a persist so next time, it will be read from memory. Thanks Best Regards On Thu, Nov 20, 2014 at 6:43 AM, YaoPau jonrgr...@gmail.com wrote:

Re: Spark Streaming with Flume or Kafka?

2014-11-19 Thread Guillermo Ortiz
Thank you, but I'm just considering a free options. 2014-11-20 7:53 GMT+01:00 Akhil Das ak...@sigmoidanalytics.com: You can also look at the Amazon's kinesis if you don't want to handle the pain of maintaining kafka/flume infra. Thanks Best Regards On Thu, Nov 20, 2014 at 3:32 AM,

re: How to incrementally compile spark examples using mvn

2014-11-19 Thread Sean Owen
Why not install them? It doesn't take any work and is the only correct way to do it. mvn install is all you need. On Nov 20, 2014 2:35 AM, Yiming (John) Zhang sdi...@gmail.com wrote: Hi Sean, Thank you for your reply. I was wondering whether there is a method of reusing locally-built