Re: problem for submitting job

2015-06-29 Thread Akhil Das
You can create a SparkContext in your program and run it as a standalone application without using spark-submit. Here's something that will get you started: //Create SparkContext val sconf = new SparkConf() .setMaster("spark://spark-ak-master:7077") .setAppName("Test") .s

Re: spark streaming - checkpoint

2015-06-29 Thread ram kumar
on using yarn-cluster, it works good On Mon, Jun 29, 2015 at 12:07 PM, ram kumar wrote: > SPARK_CLASSPATH=$CLASSPATH:/usr/hdp/2.2.0.0-2041/hadoop-mapreduce/* > in spark-env.sh > > I think i am facing the same issue > https://issues.apache.org/jira/browse/SPARK-6203 > > > > On Mon, Jun 29, 2015 a

SparkContext and JavaSparkContext

2015-06-29 Thread Hao Ren
Hi, I am working on legacy project using spark java code. I have a function which takes sqlContext as an argument, however, I need a JavaSparkContext in that function. It seems that sqlContext.sparkContext() return a scala sparkContext. I did not find any API for casting a scala sparkContext t

Re: Time series data

2015-06-29 Thread tog
Hi Have you tested the Cloudera project: https://github.com/cloudera/spark-timeseries ? Let me know how did you progress on that route as I am also interested in that topic ? Cheers On 26 June 2015 at 14:07, Caio Cesar Trucolo wrote: > Hi everyone! > > I am working with multiple time series

Re: SparkContext and JavaSparkContext

2015-06-29 Thread Hao Ren
It seems that JavaSparkContext is just a wrapper of scala sparkContext. In JavaSparkContext, the scala one is used to do all the job. If I pass the same scala sparkContext to initialize JavaSparkContext, I still manipulate on the same sparkContext. Sry for spamming. Hao On Mon, Jun 29, 2015 at

Re: kmeans broadcast

2015-06-29 Thread Himanshu Mehra
Hi Haviv, have you tried sc.broadcast(model), the broadcast method is a member of sparkContext class. Thanks Himanshu -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/kmeans-broadcast-tp23511p23526.html Sent from the Apache Spark User List mailing list arch

got "java.lang.reflect.UndeclaredThrowableException" when running multiply APPs in spark

2015-06-29 Thread luohui20001
Hi there I am running 30 APPs in my spark cluster, and some of the APPs got exception like below:[root@slave3 0]# cat stderr 15/06/29 17:20:08 INFO executor.CoarseGrainedExecutorBackend: Registered signal handlers for [TERM, HUP, INT] 15/06/29 17:20:09 WARN util.NativeCodeLoader: Unable to

Re: Union of many RDDs taking a long time

2015-06-29 Thread Tomasz Fruboes
Hi Matt, is there a reason you need to call coalesce every loop iteration? Most likely it forces spark to do lots of unnecessary shuffles. Also - for really large number of inputs this approach can lead to due to to many nested RDD.union calls. A safer approach is to call union from SparkCon

Re: Scala problem when using g.vertices.map "not a member of type parameter"

2015-06-29 Thread Robineast
I can't see an obvious problem. Could you post the full minimal code that reproduces the problem? Also why version of Spark and Scala are you using? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Scala-problem-when-using-g-vertices-map-not-a-member-of-type-

Spark Streaming-Receiver drops data

2015-06-29 Thread summerdaway
Hi, I'm using spark streaming to process data. I do a simple flatMap on each record as follows package bb; import java.io.*; import java.net.*; import java.io.BufferedReader; import java.io.FileReader; import java.io.IOException; import java.net.URI; import java.util.List; import java.util.Array

RE: spark-submit in deployment mode with the "--jars" option

2015-06-29 Thread Hisham Mohamed
Hi Akhil, Thanks for your reply. Here it is Launch Command: "/usr/lib/jvm/java-7-oracle/jre/bin/java" "-cp" "/etc/spark/:/opt/spark/lib/spark-assembly-1.4.0-hadoop2.3.0.jar:/opt/spark-1.4.0-bin-hadoop2.3/lib/datanucleus-rdbms-3.2.9.jar:/opt/spark-1.4.0-bin -hadoop2.3/lib/datanucleus-api-jdo-3.2

Serialization Exception

2015-06-29 Thread Spark Enthusiast
For prototyping purposes, I created a test program injecting dependancies using Spring. Nothing fancy. This is just a re-write of KafkaDirectWordCount. When I run this, I get the following exception: Exception in thread "main" org.apache.spark.SparkException: Task not serializable     at org.

load Java properties file in Spark

2015-06-29 Thread diplomatic Guru
I want to store the Spark application arguments such as input file, output file into a Java property files and pass that file into Spark Driver. I'm using spark-submit for submitting the job but couldn't find a parameter to pass the properties file. Have you got any suggestions?

Directory creation failed leads to job fail (should it?)

2015-06-29 Thread maxdml
Hi there, I have some traces from my master and some workers where for some reason, the ./work directory of an application can not be created on the workers. There is also an issue with the master's temp directory creation. master logs: http://pastebin.com/v3NCzm0u worker's logs: http://pastebin.

Re: Directory creation failed leads to job fail (should it?)

2015-06-29 Thread ayan guha
No, spark can not do that as it does not replicate partitions (so no retry on different worker). It seems your cluster is not provisioned with correct permissions. I would suggest to automate node provisioning. On Mon, Jun 29, 2015 at 11:04 PM, maxdml wrote: > Hi there, > > I have some traces fr

RE: What does "Spark is not just MapReduce" mean? Isn't every Spark job a form of MapReduce?

2015-06-29 Thread prajod.vettiyattil
Hi, Any {fan-out -> process in parallel -> fan-in -> aggregate} pattern of data flow can be conceptually Map-Reduce(MR, as it is done in Hadoop). Apart from the bigger list of map, reduce, sort, filter, pipe, join, combine,... functions, that are many times more efficient and productive for de

RE: Accessing Kerberos Secured HDFS Resources from Spark on Mesos

2015-06-29 Thread Dave Ariens
I'd like to toss out another idea that doesn't involve a complete end-to-end Kerberos implementation. Essentially, have the driver authenticate to Kerberos, instantiate a Hadoop file system, and serialize/cache it for the executors to use instead of them having to instantiate their own. - Dri

Re: Dependency Injection with Spark Java

2015-06-29 Thread Michal Čizmazia
Currently, I am considering to use Guava Suppliers for delayed initialization in workers Supplier supplier = (Serializable & Supplier) () -> new T(); Supplier singleton = Suppliers.memoize(supplier); On 26 June 2015 at 13:17, Igor Berman wrote: > asked myself same question today...actually de

Re: Programming with java on spark

2015-06-29 Thread 付雅丹
Hi, Akhil. Thank you for your reply. I tried what you suggested. But it exists the following error. source code is: JavaPairRDD distFile=sc.hadoopFile( "hdfs://cMaster:9000/wcinput/data.txt", DataInputFormat.class,LongWritable.class,Text.class); while DataInputFormat class is defined as this: cl

Re: How to recover in case user errors in streaming

2015-06-29 Thread Amit Assudani
Thanks TD, this helps. Looking forward to some fix where framework handles the batch failures by some callback methods. This will help not having to write try/catch in every transformation / action. Regards, Amit From: Tathagata Das mailto:t...@databricks.com>> Date: Saturday, June 27, 2015 at

Re: spark streaming with kafka reset offset

2015-06-29 Thread Cody Koeninger
3. You need to use your own method, because you need to set up your job. Read the checkpoint documentation. 4. Yes, if you want to checkpoint, you need to specify a url to store the checkpoint at (s3 or hdfs). Yes, for the direct stream checkpoint it's just offsets, not all the messages. On Sun

Re: Fine control with sc.sequenceFile

2015-06-29 Thread Koert Kuipers
see also: https://github.com/apache/spark/pull/6848 On Mon, Jun 29, 2015 at 12:48 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote: > sc.hadoopConfiguration.set("mapreduce.input.fileinputformat.split.maxsize", > "67108864") > > sc.sequenceFile(getMostRecentDirectory(tablePath, _.startsWith("_")).get > + "/*", classO

Re: spark streaming with kafka reset offset

2015-06-29 Thread ayan guha
Hi Let me take ashot at your questions. (I am sure people like Cody and TD will correct if I am wrong) 0. This is exact copy from the similar question in mail thread from Akhil D: Since you set local[4] you will have 4 threads for your computation, and since you are having 2 receivers, you are le

Schema for type is not supported

2015-06-29 Thread Sander van Dijk
Hey all, I try to make a DataFrame by inspection (using Spark 1.4.0), but run into a parameter of my case class not being supported. Minimal example: val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext.implicits._ import com.vividsolutions.jts.geom.Coordinate case class Fo

Re: load Java properties file in Spark

2015-06-29 Thread ayan guha
You may want to store the propertyfile either locally or ifyour intend to launch it in yarn modes then in HDFS (asyou do notknow which node wil become your AM) On Mon, Jun 29, 2015 at 10:51 PM, diplomatic Guru wrote: > I want to store the Spark application arguments such as input file, output >

Running Spark 1.4.1 without Hadoop

2015-06-29 Thread Sourav Mazumder
Hi, I'm trying to run Spark without Hadoop where the data would be read and written to local disk. For this I have few Questions - 1. Which download I need to use ? In the download option I don't see any binary download which does not need Hadoop. Is the only way to do this to download the sourc

Re: spark streaming with kafka reset offset

2015-06-29 Thread Shushant Arora
1. Here you are basically creating 2 receivers and asking each of them to consume 3 kafka partitions each. - In 1.2 we have high level consumers so how can we restrict no of kafka partitions to consume from? Say I have 300 kafka partitions in kafka topic and as in above I gave 2 receivers and 3 ka

Re: Accessing Kerberos Secured HDFS Resources from Spark on Mesos

2015-06-29 Thread Steve Loughran
On 29 Jun 2015, at 14:18, Dave Ariens mailto:dari...@blackberry.com>> wrote: I'd like to toss out another idea that doesn't involve a complete end-to-end Kerberos implementation. Essentially, have the driver authenticate to Kerberos, instantiate a Hadoop file system, and serialize/cache it f

Re: Running Spark 1.4.1 without Hadoop

2015-06-29 Thread ayan guha
Hi You really donot need hadoop installation. You can dowsload a pre-built version with any hadoop and unzip it and you are good to go. Yes it may complain while launching master and workers, safely ignore them. The only problem is while writing to a directory. Of course you will not be able to us

RE: Accessing Kerberos Secured HDFS Resources from Spark on Mesos

2015-06-29 Thread Dave Ariens
Thanks, Steve--I should have tested out this theory before spamming the list. I haven't been able to get anything working after testing this theory out. I'll hit up the Spark dev mailing list and try to garner enough interest to get some Jira's cut. I really appreciate everyone's feedback, tha

SPARK REMOTE DEBUG

2015-06-29 Thread Pietro Gentile
Hi all, What is the best way to remotely debug, with breakpoints, spark apps? Thanks in advance, Best regards! Pietro

Spark shell crumbles after memory is full

2015-06-29 Thread hbogert
I'm running a query from the BigDataBenchmark, query 1B to be precise. When running this with Spark (1.3.1)+ mesos(0.21) in coarse grained mode with 5 mesos slave, through a spark shell, all is well. However rerunning the query a few times: scala> sqlContext.sql("SELECT pageURL, pageRank FROM

Re: Running Spark 1.4.1 without Hadoop

2015-06-29 Thread Ted Yu
Sourav: Please see https://spark.apache.org/docs/latest/spark-standalone.html Cheers On Mon, Jun 29, 2015 at 7:33 AM, ayan guha wrote: > Hi > > You really donot need hadoop installation. You can dowsload a pre-built > version with any hadoop and unzip it and you are good to go. Yes it may > com

Re: How to recover in case user errors in streaming

2015-06-29 Thread Amit Assudani
Also, how do you suggest catching exceptions while using with connector API like, saveAsNewAPIHadoopFiles ? From: amit assudani mailto:aassud...@impetus.com>> Date: Monday, June 29, 2015 at 9:55 AM To: Tathagata Das mailto:t...@databricks.com>> Cc: Cody Koeninger mailto:c...@koeninger.org>>, "us

Re: Directory creation failed leads to job fail (should it?)

2015-06-29 Thread Max Demoulin
The underlying issue is a filesystem corruption on the workers. In the case where I use hdfs, with a sufficient amount of replica, would Spark try to launch a task on another node where the block replica is present? Thanks :-) -- Henri Maxime Demoulin 2015-06-29 9:10 GMT-04:00 ayan guha : > No

Re: SPARK REMOTE DEBUG

2015-06-29 Thread Ted Yu
Please read: http://search-hadoop.com/m/q3RTtig4WhHTdMa1 FYI On Mon, Jun 29, 2015 at 8:37 AM, Pietro Gentile < pietro.gentile89.develo...@gmail.com> wrote: > Hi all, > > What is the best way to remotely debug, with breakpoints, spark apps? > > > Thanks in advance, > Best regards! > > Pietro >

Re: Directory creation failed leads to job fail (should it?)

2015-06-29 Thread ayan guha
It's a scheduler question. Spark will retry the task on the same worker. >From spark standpoint data is not replicated because spark provides fault tolerance but lineage not by replication. On 30 Jun 2015 01:50, "Max Demoulin" wrote: > The underlying issue is a filesystem corruption on the worker

Re: problem for submitting job

2015-06-29 Thread Akhil Das
Cool. On 29 Jun 2015 21:10, "郭谦" wrote: > Akhil Das, > > You give me a new idea to solve the problem. > > Vova provides me a way to solve the problem just before > > Vova Shelgunov > > Sample code for submitting job from any other java app, e.g. servlet: > > http://pastebin.com/X1S28ivJ > > I app

Re: Spark shell crumbles after memory is full

2015-06-29 Thread ayan guha
When you call collect, you are bringing whole dataset back to driver memory. On 30 Jun 2015 01:43, "hbogert" wrote: > I'm running a query from the BigDataBenchmark, query 1B to be precise. > > When running this with Spark (1.3.1)+ mesos(0.21) in coarse grained mode > with 5 mesos slave, through a

Re: Spark shell crumbles after memory is full

2015-06-29 Thread Mark Hamstra
No. He is collecting the results of the SQL query, not the whole dataset. The REPL does retain references to prior results, so it's not really the best tool to be using when you want no-longer-needed results to be automatically garbage collected. On Mon, Jun 29, 2015 at 9:13 AM, ayan guha wrote:

[SparkR] Missing Spark APIs in R

2015-06-29 Thread Pradeep Bashyal
Hello, I noticed that some of the spark-core APIs are not available with version 1.4.0 release of SparkR. For example textFile(), flatMap() etc. The code seems to be there but is not exported in NAMESPACE. They were all available as part of the AmpLab Extras previously. I wasn't able to find any e

Re: Directory creation failed leads to job fail (should it?)

2015-06-29 Thread Max Demoulin
I see. Thank you for your help! -- Henri Maxime Demoulin 2015-06-29 11:57 GMT-04:00 ayan guha : > It's a scheduler question. Spark will retry the task on the same worker. > From spark standpoint data is not replicated because spark provides fault > tolerance but lineage not by replication. > On

Re: Running Spark 1.4.1 without Hadoop

2015-06-29 Thread Jey Kottalam
Actually, Hadoop InputFormats can still be used to read and write from "file://", "s3n://", and similar schemes. You just won't be able to read/write to HDFS without installing Hadoop and setting up an HDFS cluster. To summarize: Sourav, you can use any of the prebuilt packages (i.e. anything othe

Load Multiple DB Table - Spark SQL

2015-06-29 Thread Ashish Soni
Hi All , What is the best possible way to load multiple data tables using spark sql Map options = new HashMap<>(); options.put("driver", MYSQLDR); options.put("url", MYSQL_CN_URL); options.put("dbtable","(select * from courses); *can i add multiple tables to options map options.put("dbtable1"

Re: Spark shell crumbles after memory is full

2015-06-29 Thread Hans van den Bogert
Would there be a way to force the 'old' data out? Because at this point I'll have to restart the shell every couple of queries to get meaningful timings which are comparable to spark submit . On Jun 29, 2015 6:20 PM, "Mark Hamstra" wrote: > No. He is collecting the results of the SQL query, not

Re: Load Multiple DB Table - Spark SQL

2015-06-29 Thread Ted Yu
To my knowledge this is not supported. On Mon, Jun 29, 2015 at 10:47 AM, Ashish Soni wrote: > > Hi All , > > What is the best possible way to load multiple data tables using spark sql > > Map options = new HashMap<>(); > options.put("driver", MYSQLDR); > options.put("url", MYSQL_CN_URL); > optio

Re: [SparkR] Missing Spark APIs in R

2015-06-29 Thread Shivaram Venkataraman
The RDD API is pretty complex and we are not yet sure we want to export all those methods in the SparkR API. We are working towards exposing a more limited API in upcoming versions. You can find some more details in the recent Spark Summit talk at https://spark-summit.org/2015/events/sparkr-the-pas

SparkSQL built in functions

2015-06-29 Thread Bob Corsaro
I'm having trouble using "select pow(col) from table" It seems the function is not registered for SparkSQL. Is this on purpose or an oversight? I'm using pyspark.

s3 bucket access/read file

2015-06-29 Thread didi
Hi *Cant read text file from s3 to create RDD * after setting the configuration val hadoopConf=sparkContext.hadoopConfiguration; hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem") hadoopConf.set("fs.s3.awsAccessKeyId",yourAccessKey) hadoopConf.set("fs.s3.awsSecretAcc

Re: s3 bucket access/read file

2015-06-29 Thread spark user
Pls check your ACL properties. On Monday, June 29, 2015 11:29 AM, didi wrote: Hi *Cant read text file from s3 to create RDD * after setting the configuration val hadoopConf=sparkContext.hadoopConfiguration; hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSys

Re: Running Spark 1.4.1 without Hadoop

2015-06-29 Thread Sourav Mazumder
Hi Jey, Thanks for your inputs. Probably I'm getting error as I'm trying to read a csv file from local file using com.databricks.spark.csv package. Probably this package has hard coded dependency on Hadoop as it is trying to read input format from HadoopRDD. Can you please confirm ? Here is wha

Applying functions over certain count of tuples .

2015-06-29 Thread anshu shukla
I want to apply some logic on the basis of a FIX count of number of tuples in each RDD . *suppose emit one rdd for every 5 tuple of previous RDD . * -- Thanks & Regards, Anshu Shukla

Re: Applying functions over certain count of tuples .

2015-06-29 Thread Richard Marscher
Hi, not sure what the context is but I think you can do something similar with mapPartitions: rdd.mapPartitions { iterator => iterator.grouped(5).map { tupleGroup => emitOneRddForGroup(tupleGroup) } } The edge case is when the final grouping doesn't have exactly 5 items, if that matters. On

Re: Running Spark 1.4.1 without Hadoop

2015-06-29 Thread Jey Kottalam
Hi Sourav, The error seems to be caused by the fact that your URL starts with "file://" instead of "file:///". Also, I believe the current version of the package for Spark 1.4 with Scala 2.11 should be "com.databricks:spark-csv_2.11:1.1.0". -Jey On Mon, Jun 29, 2015 at 12:23 PM, Sourav Mazumder

is there any significant performance issue converting between rdd and dataframes in pyspark?

2015-06-29 Thread Axel Dahl
In pyspark, when I convert from rdds to dataframes it looks like the rdd is being materialized/collected/repartitioned before it's converted to a dataframe. Just wondering if there's any guidelines for doing this conversion and whether it's best to do it early to get the performance benefits of da

Re: SparkSQL built in functions

2015-06-29 Thread Salih Oztop
Hi Bob,I tested your scenario with Spark 1.3 and I assumed you did not miss the second parameter of pow(x,y) from pyspark.sql import SQLContextsqlContext = SQLContext(sc) df = sqlContext.jsonFile("/vagrant/people.json")# Displays the content of the DataFrame to stdoutdf.show()#These are all fined

Re: SparkSQL built in functions

2015-06-29 Thread Bob Corsaro
1.4 and I did set the second parameter. The DSL works fine but trying out with SQL doesn't. On Mon, Jun 29, 2015, 4:32 PM Salih Oztop wrote: > Hi Bob, > I tested your scenario with Spark 1.3 and I assumed you did not miss the > second parameter of pow(x,y) > > from pyspark.sql import SQLContext

Re: SparkSQL built in functions

2015-06-29 Thread Krishna Sankar
Interesting. Looking at the definitions, sql.functions.pow is defined only for (col,col). Just as an experiment, create a column with value 2 and see if that works. Cheers On Mon, Jun 29, 2015 at 1:34 PM, Bob Corsaro wrote: > 1.4 and I did set the second parameter. The DSL works fine but trying

Job failed but there is no proper reason

2015-06-29 Thread ๏̯͡๏
I have a join + leftOuterJoin + reduceByKey. the join operation worked but the left outer join failed. There are million log lines and when i did this ~ dvasthimal$ cat ~/Desktop/errors | grep " ERROR " | grep -v "client.TransportResponseHandler" | grep -v "shuffle.RetryingBlockFetcher" | grep -

Re: How to recover in case user errors in streaming

2015-06-29 Thread Tathagata Das
I recommend writing using dstream.foreachRDD, and then rdd.saveAsNewAPIHadoopFile inside try catch. See the implementation of dstream.saveAsNewAPIHadoopFiles https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/PairDStreamFunctions.scala#L716 On

Checkpoint support?

2015-06-29 Thread ๏̯͡๏
My job has multiple stages, each time a stage fails i have to restart the entire app. I understand Spark restarts failed tasks. However, Is there a way to restart a Spark app from failed stage ? -- Deepak

Checkpoint FS failure or connectivity issue

2015-06-29 Thread Amit Assudani
Hi All, While using Checkpoints ( using HDFS ), if connectivity to hadoop cluster is lost for a while and gets restored in some time, what happens to the running streaming job. Is it always assumed that connection to checkpoint FS ( this case HDFS ) would ALWAYS be HA and would never fail for

Re: SparkSQL built in functions

2015-06-29 Thread Salih Oztop
Hitested wih Spark 1.4 We need to import pow otherwise it uses python version of pow I guess. >>> from pyspark.sql.functions import pow>>> >>> df.select(pow(df.age,df.age)).show() 15/06/29 22:36:05 INFO Ta++| POWER(age, age)|++| null|| 2.05891132094649E44|

breeze.linalg.DenseMatrix not found

2015-06-29 Thread AlexG
I'm trying to compute the eigendecomposition of a matrix in a portion of my code, using mllib.linalg.EigenValueDecomposition (https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/EigenValueDecomposition.scala ) as follows: val tol = 1e-10 val maxIter

Re: Running Spark 1.4.1 without Hadoop

2015-06-29 Thread Sourav Mazumder
HI Jey, Not much of luck. If I use the class com.databricks:spark-csv_2. 11:1.1.0 or com.databricks.spark.csv_2.11.1.1.0 I get class not found error. With com.databricks.spark.csv I don't get the class not found error but I still get the previous error even after using file:/// in the URI. Regar

Re: Spark 1.4 RDD to DF fails with toDF()

2015-06-29 Thread Srikanth
My error was related to Scala version. Upon further reading, I realized that it takes some effort to get Spark working with Scala 2.11. I've reverted to using 2.10 and moved past that error. Now I hit the issue you mentioned. Waiting for 1.4.1. Srikanth On Fri, Jun 26, 2015 at 9:10 AM, Roberto Co

Re: breeze.linalg.DenseMatrix not found

2015-06-29 Thread AlexG
I get the same error even when I define covOperator not to use a matrix at all: def covOperator(v : BDV[Double]) :BDV[Double] = { v } -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/breeze-linalg-DenseMatrix-not-found-tp23537p23538.html Sent from the Apach

Need clarification on spark on cluster set up instruction

2015-06-29 Thread manish ranjan
Hi All here goes my first question : Here is my use case I have 1TB data I want to process on ec2 using spark I have uploaded the data on ebs volume The instruction on amazon ec2 set up explains "*If your application needs to access large datasets, the fastest way to do that is to load them from

Re: Running Spark 1.4.1 without Hadoop

2015-06-29 Thread Jey Kottalam
The format is still "com.databricks.spark.csv", but the parameter passed to spark-shell is "--packages com.databricks:spark-csv_2.11:1.1.0". On Mon, Jun 29, 2015 at 2:59 PM, Sourav Mazumder < sourav.mazumde...@gmail.com> wrote: > HI Jey, > > Not much of luck. > > If I use the class com.databricks

spark streaming HDFS file issue

2015-06-29 Thread ravi tella
I am running a spark streaming example from learning spark book with one change. The change I made was for streaming a file from HDFS. val lines = ssc.textFileStream("hdfs:/user/hadoop/spark/streaming/input") I ran the application number of times and every time dropped a new file in the input dir

Shuffle files lifecycle

2015-06-29 Thread Thomas Gerber
Hello, It is my understanding that shuffle are written on disk and that they act as checkpoints. I wonder if this is true only within a job, or across jobs. Please note that I use the words job and stage carefully here. 1. can a shuffle created during JobN be used to skip many stages from JobN+1

Re: Shuffle files lifecycle

2015-06-29 Thread Thomas Gerber
Ah, for #3, maybe this is what *rdd.checkpoint *does! https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD Thomas On Mon, Jun 29, 2015 at 7:12 PM, Thomas Gerber wrote: > Hello, > > It is my understanding that shuffle are written on disk and that they act > as chec

?????? How to recover in case user errors in streaming

2015-06-29 Thread Sea
Hi, TD In my code, I write like this: dstream.foreachRDD { rdd => try { } catch { } } it will still throw exception, and the driver will be killed... I need to catch exception in rdd.foreachPartition just like these, so I need to retry by myself .. dstream.foreach

Re: Shuffle files lifecycle

2015-06-29 Thread Silvio Fiorito
Regarding 1 and 2, yes shuffle output is stored on the worker local disks and will be reused across jobs as long as they’re available. You can identify when they’re used by seeing skipped stages in the job UI. They are periodically cleaned up based on available space of the configured spark.loca

Re: spark streaming HDFS file issue

2015-06-29 Thread bit1...@163.com
What do you mean by "new file", do you upload an already existing file onto HDFS or create a new one locally and then upload it to HDFS? bit1...@163.com From: ravi tella Date: 2015-06-30 09:59 To: user Subject: spark streaming HDFS file issue I am running a spark streaming example from learnin

Re: Shuffle files lifecycle

2015-06-29 Thread Thomas Gerber
Thanks Silvio. On Mon, Jun 29, 2015 at 7:41 PM, Silvio Fiorito < silvio.fior...@granturing.com> wrote: > Regarding 1 and 2, yes shuffle output is stored on the worker local > disks and will be reused across jobs as long as they’re available. You can > identify when they’re used by seeing skipp

Re: GraphX - ConnectedComponents (Pregel) - longer and longer interval between jobs

2015-06-29 Thread Thomas Gerber
It seems the root cause of the delay was the sheer size of the DAG for those jobs, which are towards the end of a long series of jobs. To reduce it, you can probably try to checkpoint (rdd.checkpoint) some previous RDDs. That will: 1. save the RDD on disk 2. remove all references to the parents of

Re: Checkpoint FS failure or connectivity issue

2015-06-29 Thread Tathagata Das
Yes, the observation is correct. That connectivity is assumed to be HA. On Mon, Jun 29, 2015 at 2:34 PM, Amit Assudani wrote: > Hi All, > > While using Checkpoints ( using HDFS ), if connectivity to hadoop > cluster is lost for a while and gets restored in some time, what happens to > the run

Re: Running Spark 1.4.1 without Hadoop

2015-06-29 Thread Sourav Mazumder
Hi Jey, This solves the class not found problem. Thanks. But still the inputs format is not yet resolved. Looks like it is still trying to create a HadoopRDD I don't know why. The error message goes like - java.lang.RuntimeException: Error in configuring object at org.apache.hadoop.util.Refl

Re: Job failed but there is no proper reason

2015-06-29 Thread Matthew Jones
Order the tasks by status and see if there are any with status failed. On Mon, 29 Jun 2015 at 2:26 pm ÐΞ€ρ@Ҝ (๏̯͡๏) wrote: > Attached pic shows the error that is displayed in Spark UI. > > > On Mon, Jun 29, 2015 at 2:22 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) > wrote: > >> I have a join + leftOuterJoin + reduceByKey.

Re: Running Spark 1.4.1 without Hadoop

2015-06-29 Thread Jey Kottalam
All InputFormats will use HadoopRDD or NewHadoopRDD. Do you use "file:///" instead of "file://"? On Mon, Jun 29, 2015 at 8:40 PM, Sourav Mazumder < sourav.mazumde...@gmail.com> wrote: > Hi Jey, > > This solves the class not found problem. Thanks. > > But still the inputs format is not yet resolve

Re: Job failed but there is no proper reason

2015-06-29 Thread ๏̯͡๏
There are many with failed. Attached pic shows exception traces. On Mon, Jun 29, 2015 at 9:07 PM, Matthew Jones wrote: > Order the tasks by status and see if there are any with status failed. > On Mon, 29 Jun 2015 at 2:26 pm ÐΞ€ρ@Ҝ (๏̯͡๏) wrote: > >> Attached pic shows the error that is display

Re: Job failed but there is no proper reason

2015-06-29 Thread Matthew Jones
I can't see any failed tasks in the attached pic. On Mon, Jun 29, 2015 at 9:31 PM ÐΞ€ρ@Ҝ (๏̯͡๏) wrote: > There are many with failed. Attached pic shows exception traces. > > On Mon, Jun 29, 2015 at 9:07 PM, Matthew Jones wrote: > >> Order the tasks by status and see if there are any with status

Spark 1.4.0: read.df() causes excessive IO

2015-06-29 Thread Exie
Hi Folks, I just stepped up from 1.3.1 to 1.4.0, the most notable difference for me so far is the data frame reader/writer. Previously: val myData = hiveContext.load("s3n://someBucket/somePath/","parquet") Now: val myData = hiveContext.read.parquet("s3n://someBucket/somePath") Using the ori

Error while installing spark

2015-06-29 Thread Chintan Bhatt
Facing following error message while performing sbt/sbt assembly Error occurred during initialization of VM Could not reserve enough space for object heap Error: Could not create the Java Virtual Machine. Error: A fatal exception has occurred. Program will exit. pls give me solution for it --

Can Dependencies Be Resolved on Spark Cluster?

2015-06-29 Thread SLiZn Liu
Hey Spark Users, I'm writing a demo with Spark and HBase. What I've done is packaging a **fat jar**: place dependencies in `build.sbt`, and use `sbt assembly` to package **all dependencies** into one big jar. The rest work is copy the fat jar to Spark master node and then launch by `spark-submit`.

Re: Can Dependencies Be Resolved on Spark Cluster?

2015-06-29 Thread Burak Yavuz
You can pass `--packages your:comma-separated:maven-dependencies` to spark submit if you have Spark 1.3 or greater. Best regards, Burak On Mon, Jun 29, 2015 at 10:46 PM, SLiZn Liu wrote: > Hey Spark Users, > > I'm writing a demo with Spark and HBase. What I've done is packaging a > **fat jar**:

Re: Can Dependencies Be Resolved on Spark Cluster?

2015-06-29 Thread SLiZn Liu
Hi Burak, Is `--package` flag only available for maven, no sbt support? On Tue, Jun 30, 2015 at 2:26 PM Burak Yavuz wrote: > You can pass `--packages your:comma-separated:maven-dependencies` to spark > submit if you have Spark 1.3 or greater. > > Best regards, > Burak > > On Mon, Jun 29, 2015 a