Updation of a graph based on changed input

2015-09-23 Thread aparasur
Hi, I am fairly new to Spark GraphX. I am using graphx to create a graph derived from data size in the range of 500GB. The inputs used to create this large graph comes from a set of files with predefined space separated constructs. I also understand that each time, the graph will be constructed

Can DataFrames with different schema be joined efficiently

2015-09-23 Thread MrJew
Hello, I'm using spark streaming to handle quite big data flow. I'm solving a problem where we are inferring the type from the data ( we need more specific data types than what JSON provides ). And quite often there is a small difference between the schemas that we get. Saving to parquet files

Re: Cosine LSH Join

2015-09-23 Thread Nick Pentreath
Looks interesting - I've been trying out a few of the ANN / LSH packages on spark-packages.org and elsewhere e.g. http://spark-packages.org/package/tdebatty/spark-knn-graphs and https://github.com/marufaytekin/lsh-spark How does this compare? Perhaps you could put it up on spark-packages to get

unsubscribe

2015-09-23 Thread Ntale Lukama

Re: JdbcRDD Constructor

2015-09-23 Thread Rishitesh Mishra
Which version of Spark you are using ?? I can get correct results using JdbcRDD. Infact there is a test suite precisely for this (JdbcRDDSuite) . I changed according to your input and got correct results from this test suite. On Wed, Sep 23, 2015 at 11:00 AM, satish chandra j

RE: Yarn Shutting Down Spark Processing

2015-09-23 Thread Bryan
Tathagata, Simple batch jobs do work. The cluster has a good set of resources and a limited input volume on the given Kafka topic. The job works on the small 3-node standalone-configured cluster I have setup for test. Regards, Bryan Jeffrey -Original Message- From: "Tathagata Das"

Re: JdbcRDD Constructor

2015-09-23 Thread Rishitesh Mishra
I am using Spark 1.5. I always get count = 100, irrespective of num partitions. On Wed, Sep 23, 2015 at 5:00 PM, satish chandra j wrote: > HI, > Currently using Spark 1.2.2, could you please let me know correct results > output count which you got it by using

Re: spark on mesos gets killed by cgroups for too much memory

2015-09-23 Thread Dick Davies
I haven't seen that much memory overhead, I think my default is 512Mb (just a small test stack) on spark 1.4.x and i can run simple monte carlo simulations without the 'spike' of RAM usage when they deploy. I'd assume something you're using is grabbing a lot of VM up front - one option you might

Re: Py4j issue with Python Kafka Module

2015-09-23 Thread ayan guha
Thanks guys. On Wed, Sep 23, 2015 at 3:54 PM, Tathagata Das wrote: > SPARK_CLASSPATH is I believe deprecated right now. So I am not surprised > that there is some difference in the code paths. > > On Tue, Sep 22, 2015 at 9:45 PM, Saisai Shao >

Re: JdbcRDD Constructor

2015-09-23 Thread satish chandra j
HI, Could anybody provide inputs if they have came across similar issue @Rishitesh Could you provide if any sample code to use JdbcRDDSuite Regards, Satish Chandra On Wed, Sep 23, 2015 at 5:14 PM, Rishitesh Mishra wrote: > I am using Spark 1.5. I always get count =

Re: JdbcRDD Constructor

2015-09-23 Thread satish chandra j
HI, Currently using Spark 1.2.2, could you please let me know correct results output count which you got it by using JdbcRDDSuite Regards, Satish Chandra On Wed, Sep 23, 2015 at 4:02 PM, Rishitesh Mishra wrote: > Which version of Spark you are using ?? I can get

RE: Why is 1 executor overworked and other sit idle?

2015-09-23 Thread Richard Eggert
Reading from Cassandra and mapping to CSV are likely getting divided among executors, but I think reading from Cassandra is relatively cheap, and mapping to CSV is trivial, but coalescing to a single partition is fairly expensive and funnels the processing to a single executor, and writing out

Cosine LSH Join

2015-09-23 Thread Demir
We've just open sourced a LSH implementation on Spark. We're using this internally in order to find topK neighbors after a matrix factorization. We hope that this might be of use for others: https://github.com/soundcloud/cosine-lsh-join-spark For those wondering: lsh is a technique to quickly

Dose spark auto invoke StreamingContext.stop while receive kill signal?

2015-09-23 Thread Bin Wang
I'd like the spark application to be stopped gracefully while received kill signal, so I add these code: sys.ShutdownHookThread { println("Gracefully stopping Spark Streaming Application") ssc.stop(stopSparkContext = true, stopGracefully = true) println("Application

How to turn off Jetty Http stack errors on Spark web

2015-09-23 Thread Rafal Grzymkowski
Hi, Is it possible to disable Jetty stack trace with errors on Spark master:8080 ? When I trigger Http server error 500 than anyone can read details. I tried options available in log4j.properties but it doesn't help. Any hint? Thank you for answer MyCo

Re: unsubscribe

2015-09-23 Thread Akhil Das
To unsubscribe, you need to send an email to user-unsubscr...@spark.apache.org as described here http://spark.apache.org/community.html Thanks Best Regards On Wed, Sep 23, 2015 at 1:23 AM, Stuart Layton wrote: > > > -- > Stuart Layton >

Re: Why RDDs are being dropped by Executors?

2015-09-23 Thread Uthayan Suthakar
Thank you tathagata for your response. It make sense to use the MEMORY_AND_DISK. But sometime when I start the job it does not cache everyting at the start. It only caches 90%. The LRU scheme will only take affect after a while when the data is not in use but why it failing to cache the data at

Re: Has anyone used the Twitter API for location filtering?

2015-09-23 Thread Akhil Das
I just tried it and very few tweets has the .getPlace and .getGeoLocation data available in it. [image: Inline image 1] I guess this is more of an issue with the twitter api. Thanks Best Regards On Tue, Sep 22, 2015 at 11:35 PM, Jo Sunad wrote: > Thanks Akhil, but I

Re: K Means Explanation

2015-09-23 Thread Sabarish Sasidharan
You can't obtain that from the model. But you can always ask the model to predict the cluster center for your vectors by calling predict(). Regards Sab On Wed, Sep 23, 2015 at 7:24 PM, Tapan Sharma wrote: > Hi All, > > In the KMeans example provided under mllib, it

Re: How to subtract two RDDs with same size

2015-09-23 Thread Sujit Pal
Hi Zhiliang, How about doing something like this? val rdd3 = rdd1.zip(rdd2).map(p => p._1.zip(p._2).map(z => z._1 - z._2)) The first zip will join the two RDDs and produce an RDD of (Array[Float], Array[Float]) pairs. On each pair, we zip the two Array[Float] components together to form an

Re: Calling a method parallel

2015-09-23 Thread Sujit Pal
Hi Tapan, Perhaps this may work? It takes a range of 0..100 and creates an RDD out of them, then calls X(i) on each. The X(i) should be executed on the workers in parallel. Scala: val results = sc.parallelize(0 until 100).map(idx => X(idx)) Python: results =

Re: Checkpoint files are saved before stream is saved to file (rdd.toDF().write ...)?

2015-09-23 Thread Cody Koeninger
TD can correct me on this, but I believe checkpointing is done after a set of jobs is submitted, not after they are completed. If you fail while processing the jobs, starting over from that checkpoint should put you in the correct state. In any case, are you actually observing a loss of messages

Re: How to control spark.sql.shuffle.partitions per query

2015-09-23 Thread Ted Yu
Please take a look at the following for example: sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2Suites.scala Search for spark.sql.shuffle.partitions and SQLConf.SHUFFLE_PARTITIONS.key FYI On Wed, Sep 23, 2015 at 12:42 AM, tridib

Calling a method parallel

2015-09-23 Thread Tapan Sharma
Hi All, I want to call a method X(int i) from my Spark program for different values of i. This means. X(1), X(2).. X(n).. Each time it returns the one object. Currently I am doing this sequentially. Is there any way to run these in parallel and I get back the list of objects? Sorry for this

K Means Explanation

2015-09-23 Thread Tapan Sharma
Hi All, In the KMeans example provided under mllib, it traverse the outcome of KMeansModel to know the cluster centers like this: KMeansModel model = KMeans.train(points.rdd(), k, iterations, runs, KMeans.K_MEANS_PARALLEL()); System.out.println("Cluster centers:"); for (Vector center :

Re: How to turn off Jetty Http stack errors on Spark web

2015-09-23 Thread Ted Yu
Have you read this ? http://stackoverflow.com/questions/2246074/how-do-i-hide-stack-traces-in-the-browser-using-jetty On Wed, Sep 23, 2015 at 6:56 AM, Rafal Grzymkowski wrote: > Hi, > > Is it possible to disable Jetty stack trace with errors on Spark > master:8080 ? > When I trigger

Re: How to get RDD from PairRDD<key,value> in Java

2015-09-23 Thread Ankur Srivastava
PairRdd.values is what you need. Ankur On Tue, Sep 22, 2015, 11:25 PM Zhang, Jingyu wrote: > Hi All, > > I want to extract the "value" RDD from PairRDD in Java > > Please let me know how can I get it easily. > > Thanks > > Jingyu > > > This message and its

Re: SparkContext declared as object variable

2015-09-23 Thread Akhil Das
Yes of course it works. [image: Inline image 1] Thanks Best Regards On Tue, Sep 22, 2015 at 4:53 PM, Priya Ch wrote: > Parallelzing some collection (array of strings). Infact in our product we > are reading data from kafka using KafkaUtils.createStream and

Re: unsubscribe

2015-09-23 Thread Richard Hillegas
Hi Ntale, To unsubscribe from the user list, please send a message to user-unsubscr...@spark.apache.org as described here: http://spark.apache.org/community.html#mailing-lists. Thanks, -Rick Ntale Lukama wrote on 09/23/2015 04:34:48 AM: > From: Ntale Lukama

Re: WAL on S3

2015-09-23 Thread Steve Loughran
On 23 Sep 2015, at 14:56, Michal Čizmazia > wrote: To get around the fact that flush does not work in S3, my custom WAL implementation stores a separate S3 object per each WriteAheadLog.write call. Do you see any gotchas with this approach?

Re: Calling a method parallel

2015-09-23 Thread Robineast
The following should give you what you need: val results = sc.makeRDD(1 to n).map(X(_)).collect This should return the results as an array. _ Robin East Spark GraphX in Action - Michael Malak and Robin East Manning Publications

Re: How to turn off Jetty Http stack errors on Spark web

2015-09-23 Thread Rafal Grzymkowski
Yes, I've seen it, but there are no files web.xml and error.jsp in binary installation of Spark. To apply this solution I should probably take Spark sources than create missing files and than recompile Spark. Right? I am looking for a solution to turn off error details without recompilation.

Re: KafkaProducer using Cassandra as source

2015-09-23 Thread kali.tumm...@gmail.com
Guys sorry I figured it out. val x=p.collect().mkString("\n").replace("[","").replace("]","").replace(",","~") Full Code:- package com.examples /** * Created by kalit_000 on 22/09/2015. */ import kafka.producer.KeyedMessage import kafka.producer.Producer import kafka.producer.ProducerConfig

create table in hive from spark-sql

2015-09-23 Thread Mohit Singh
Probably a noob question. But I am trying to create a hive table using spark-sql. Here is what I am trying to do: hc = HiveContext(sc) hdf = hc.parquetFile(output_path) data_types = hdf.dtypes schema = "(" + " ,".join(map(lambda x: x[0] + " " + x[1], data_types)) +")" hc.sql(" CREATE TABLE IF

Re: How to turn on basic authentication for the Spark Web

2015-09-23 Thread Deenar Toraskar
Rafal Check this out https://spark.apache.org/docs/latest/security.html Regards Deenar On 23 September 2015 at 19:13, Rafal Grzymkowski wrote: > Hi, > > I want to enable basic Http authentication for the spark web UI (without > recompilation need for Spark). > I see there is

Re: Cosine LSH Join

2015-09-23 Thread Nick Pentreath
Not sure of performance but DIMSUM only handles "column similarity" and scales to maybe 100k columns. Item-item similarity e.g. in MF models often requires millions of items (would be millions of columns in DIMSUM). So one needs an LSH type approach or brute force via Cartesian product (but

Re: JdbcRDD Constructor

2015-09-23 Thread Deenar Toraskar
Satish Can you post the SQL query you are using? The SQL query must have 2 placeholders and both of them should be an inclusive range (<= and >=).. e.g. select title, author from books where ? <= id and id <= ? Are you doing this? Deenar On 23 September 2015 at 20:18, Deenar Toraskar <

Re: WAL on S3

2015-09-23 Thread Michal Čizmazia
Thanks Steve! FYI: S3 now supports GET-after-PUT consistency for new objects in all regions, including US Standard https://aws.amazon.com/about-aws/whats-new/2015/08/amazon-s3-introduces-new-usability-enhancements/

Re: Cosine LSH Join

2015-09-23 Thread Charlie Hack
This is great! Pretty sure I have a use for it involving entity resolution of text records.  ​ ​How does this compare to the DIMSUM similarity join implementation in MLlib performance wise, out of curiosity? ​ ​Thanks, ​ ​Charlie  On Wednesday, Sep 23, 2015 at 09:25, Nick

Provide sampling ratio while loading json in spark version > 1.4.0

2015-09-23 Thread Udit Mehta
Hi, In earlier versions of spark(< 1.4.0), we were able to specify the sampling ratio while using *sqlContext.JsonFile* or *sqlContext.JsonRDD* so that we dont inspect each and every element while inferring the schema. I see that the use of these methods is deprecated in the newer spark version

How to turn on basic authentication for the Spark Web

2015-09-23 Thread Rafal Grzymkowski
Hi, I want to enable basic Http authentication for the spark web UI (without recompilation need for Spark). I see there is 'spark.ui.filters' option but don't know how to use it. I found possibility to use kerberos param but it's not an option for me. What should I set there to use secret token

Re: Dose spark auto invoke StreamingContext.stop while receive kill signal?

2015-09-23 Thread Tathagata Das
YEs, since 1.4.0, it shuts down streamingContext without gracefully from shutdown hook. You can make it shutdown gracefully in that hook by setting the SparkConf "spark.streaming.stopGracefullyOnShutdown" to "true" Note to self, document this in the programming guide. On Wed, Sep 23, 2015 at

Re: Why RDDs are being dropped by Executors?

2015-09-23 Thread Tathagata Das
There could multiple reasons for caching till 90% - 1. not enough aggregate space in cluster - increase cluster memory 2. ata is skewed among executor so one executor is try to cache too much while others are idle - Repartition the data using RDD.repartition to force even distribution. The

Re: How to turn on basic authentication for the Spark Web

2015-09-23 Thread Rafal Grzymkowski
I know this Spark Security page, but the information there is not sufficient. Anyone make it works? Those basic servlets for ui.filters

Re: How to turn on basic authentication for the Spark Web

2015-09-23 Thread Deenar Toraskar
Check this out http://lambda.fortytools.com/post/26977061125/servlet-filter-for-http-basic-auth or https://gist.github.com/neolitec/8953607 for examples of filters implementing basic authentication. Implement one of these and set them in the spark.ui.filters property. Deenar On 23 September 2015

Re: Why RDDs are being dropped by Executors?

2015-09-23 Thread Tathagata Das
If the RDD is not constantly in use, then the LRU scheme in each executor can kick out some of the partitions from memory. If you want to avoid recomputing in such cases, you could persist with StorageLevel.MEMORY_AND_DISK, where the partitions will dropped to disk when kicked from memory. That

Re: How to get RDD from PairRDD<key,value> in Java

2015-09-23 Thread Andy Huang
use .values() which will return an RDD of just values On Wed, Sep 23, 2015 at 4:24 PM, Zhang, Jingyu wrote: > Hi All, > > I want to extract the "value" RDD from PairRDD in Java > > Please let me know how can I get it easily. > > Thanks > > Jingyu > > > This

Re: Scala Limitation - Case Class definition with more than 22 arguments

2015-09-23 Thread Petr Novak
You can implement your own case class supporting more then 22 fields. It is something like: class MyRecord(val val1: String, val val2: String, ... more then 22, in this case f.e. 26) extends Product with Serializable { def canEqual(that: Any): Boolean = that.isInstanceOf[MyRecord] def

Re: Scala Limitation - Case Class definition with more than 22 arguments

2015-09-23 Thread Petr Novak
If you need to understand what is the magic Product then google up Algebraic Data Types and learn it together with what is Sum type. One option is http://www.stephanboyer.com/post/18/algebraic-data-types Enjoy, Petr On Wed, Sep 23, 2015 at 9:07 AM, Petr Novak wrote: > I'm

Is it possible to merged delayed batches in streaming?

2015-09-23 Thread Bin Wang
I'm using Spark Streaming and there maybe some delays between batches. I'd like to know is it possible to merge delayed batches into one batch to do processing? For example, the interval is set to 5 min but the first batch uses 1 hour, so there are many batches delayed. In the end of processing

How to subtract two RDDs with same size

2015-09-23 Thread Zhiliang Zhu
Hi All, There are two RDDs :  RDD rdd1, and RDD rdd2,that is to say, rdd1 and rdd2 are similar with DataFrame, or Matrix with same row number and column number. I would like to get RDD rdd3,  each element in rdd3 is the subtract between rdd1 and rdd2 of thesame position,

Re: Long GC pauses with Spark SQL 1.3.0 and billion row tables

2015-09-23 Thread tridib
Setting spark.sql.shuffle.partitions = 2000 solved my issue. I am able to join 2 1 billion rows tables in 3 minutes. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Long-GC-pauses-with-Spark-SQL-1-3-0-and-billion-row-tables-tp22750p24782.html Sent from the

Re: WAL on S3

2015-09-23 Thread Tathagata Das
Responses inline. On Tue, Sep 22, 2015 at 8:35 PM, Michal Čizmazia wrote: > Can checkpoints be stored to S3 (via S3/S3A Hadoop URL)? > > Yes. Because checkpoints are single files by itself, and does not require flush semantics to work. So S3 is fine. > Trying to answer

How to get RDD from PairRDD<key,value> in Java

2015-09-23 Thread Zhang, Jingyu
Hi All, I want to extract the "value" RDD from PairRDD in Java Please let me know how can I get it easily. Thanks Jingyu -- This message and its attachments may contain legally privileged or confidential information. It is intended solely for the named addressee. If you are not

Re: Spark as standalone or with Hadoop stack.

2015-09-23 Thread Sean Owen
Might be better for another list, but, I suspect it's more than HBase is simply much more integrated with YARN, and because it's run with other services that are as well. On Wed, Sep 23, 2015 at 12:02 AM, Jacek Laskowski wrote: > That sentence caught my attention. Could you

Re: Yarn Shutting Down Spark Processing

2015-09-23 Thread Tathagata Das
Does your simple Spark batch jobs work in the same YARN setup? May be YARN is not able to provide resources that you are asking for. On Tue, Sep 22, 2015 at 5:49 PM, Bryan Jeffrey wrote: > Hello. > > I have a Spark streaming job running on a cluster managed by Yarn.

Topic Modelling- LDA

2015-09-23 Thread Subshiri S
Hi, I am experimenting with Spark LDA. How do I create Topic Model for Prediction in Spark ? How do I evaluate the topics modelled in Spark ? Could you point some examples. Regards, Subshiri

Re: Topic Modelling- LDA

2015-09-23 Thread Sameer Farooqui
Hi Subshri, You may find these 2 blog posts useful: https://databricks.com/blog/2015/03/25/topic-modeling-with-lda-mllib-meets-graphx.html https://databricks.com/blog/2015/09/22/large-scale-topic-modeling-improvements-to-lda-on-spark.html On Tue, Sep 22, 2015 at 11:54 PM, Subshiri S

Re: How to subtract two RDDs with same size

2015-09-23 Thread Zhiliang Zhu
there is matrix add API, might map rdd2 each row element to be negative , then make rdd1 and rdd2 and call add ? Or some more ways ... On Wednesday, September 23, 2015 3:11 PM, Zhiliang Zhu wrote: Hi All, There are two RDDs :  RDD rdd1, and

Re: Streaming Receiver Imbalance Problem

2015-09-23 Thread SLiZn Liu
The imbalance was caused by the stuck partition, after 10s of hours the receiving rate went down. But the second ERR log I mentioned in the first mail now occur at most of tasks(I did’t count, but keep flushing my terminal) and jeopardize the job, as every batch takes 2 min(2-15 seconds before)

Re: Scala Limitation - Case Class definition with more than 22 arguments

2015-09-23 Thread satish chandra j
HI Andy, So I believe if I opt pro grammatically building the schema approach, than it would not have have any restriction as such in "case Class not allowing more than 22 Arguments" As I need to define a schema of around 37 arguments Regards, Satish Chandra On Wed, Sep 23, 2015 at 9:50 AM,

How to control spark.sql.shuffle.partitions per query

2015-09-23 Thread tridib
I am having GC issue with default value of spark.sql.shuffle.partitions (200). When I increase it to 2000, shuffle join works fine. I want to use different values for spark.sql.shuffle.partitions depending on data volume, for different queries which are fired from sane SparkSql context. Thanks

Re: Scala Limitation - Case Class definition with more than 22 arguments

2015-09-23 Thread Andy Huang
Yes and I would recommend it because it can be made generic and reusable too On Wed, Sep 23, 2015 at 5:37 PM, satish chandra j wrote: > HI Andy, > So I believe if I opt pro grammatically building the schema approach, than > it would not have have any restriction as

Checkpoint files are saved before stream is saved to file (rdd.toDF().write ...)?

2015-09-23 Thread Petr Novak
Hi, I have 2 streams and checkpointing with code based on documentation. One stream is transforming data from Kafka and saves them to Parquet file. The other stream uses the same stream and does updateStateByKey to compute some aggregations. There is no gracefulShutdown. Both use about this code

Re: WAL on S3

2015-09-23 Thread Steve Loughran
On 23 Sep 2015, at 07:10, Tathagata Das > wrote: Responses inline. On Tue, Sep 22, 2015 at 8:35 PM, Michal Čizmazia > wrote: Can checkpoints be stored to S3 (via S3/S3A Hadoop URL)? Yes. Because

Re: Is it possible to merged delayed batches in streaming?

2015-09-23 Thread Tathagata Das
Its not possible. And its actually fundamentally challenging to do so in the general case because it becomes hard to reason about the processing semantics - especially when there are per-batch aggregations. On Wed, Sep 23, 2015 at 12:17 AM, Bin Wang wrote: > I'm using Spark

Re: spark + parquet + schema name and metadata

2015-09-23 Thread Borisa Zivkovic
Hi, thanks a lot for this! I will try it out to see if this works ok. I am planning to use "stable" metadata - so those will be same across all parquet files inside directory hierarchy... On Tue, 22 Sep 2015 at 18:54 Cheng Lian wrote: > Michael reminded me that

Custom Hadoop InputSplit, Spark partitions, spark executors/task and Yarn containers

2015-09-23 Thread Anfernee Xu
Hi Spark experts, I'm coming across these terminologies and having some confusions, could you please help me understand them better? For instance I have implemented a Hadoop InputFormat to load my external data in Spark, in turn my custom InputFormat will create a bunch of InputSplit's, my

How to obtain the key in updateStateByKey

2015-09-23 Thread swetha
Hi, How to obtain the current key in updateStateBykey ? Thanks, Swetha -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-obtain-the-key-in-updateStateByKey-tp24792.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

LogisticRegression models consumes all driver memory

2015-09-23 Thread Eugene Zhulenev
We are running Apache Spark 1.5.0 (latest code from 1.5 branch) We are running 2-3 LogisticRegression models in parallel (we'd love to run 10-20 actually), they are not really big at all, maybe 1-2 million rows in each model. Cluster itself, and all executors look good. Enough free memory and no

Re: Yarn Shutting Down Spark Processing

2015-09-23 Thread Tathagata Das
CC;ing Hari who may have a better sense of whats going on. -- Forwarded message -- From: Bryan Date: Wed, Sep 23, 2015 at 3:43 AM Subject: RE: Yarn Shutting Down Spark Processing To: Tathagata Das Cc: user

Re: KafkaProducer using Cassandra as source

2015-09-23 Thread Todd Nist
Hi Kali, If you do not mind sending JSON, you could do something like this, using json4s: val rows = p.collect() map ( row => TestTable(row.getString(0), row.getString(1)) ) val json = parse(write(rows)) producer.send(new KeyedMessage[String, String]("trade", writePretty(json))) // or for

Join over many small files

2015-09-23 Thread Tracewski, Lukasz
Hi all, I would like you to ask for an advise on how to efficiently make a join operation in Spark with tens of thousands of tiny files. A single file has a few KB and ~50 rows. In another scenario they might have 200 KB and 2000 rows. To give you impression how they look like: File 01 ID |

Re: Spark as standalone or with Hadoop stack.

2015-09-23 Thread Ted Yu
HDFS on Mesos framework is still being developed. What I said previously reflected current deployment practice. Things may change in the future. On Tue, Sep 22, 2015 at 4:02 PM, Jacek Laskowski wrote: > On Tue, Sep 22, 2015 at 10:03 PM, Ted Yu wrote: > >

RE: Yarn Shutting Down Spark Processing

2015-09-23 Thread Bryan
Also - I double checked - we're setting the master to "yarn-cluster" -Original Message- From: "Tathagata Das" Sent: ‎9/‎23/‎2015 2:38 PM To: "Bryan" Cc: "user" ; "Hari Shreedharan" Subject:

Re: Yarn Shutting Down Spark Processing

2015-09-23 Thread Marcelo Vanzin
Did you look at your application's logs (using the "yarn logs" command?). That error means your application is failing to create a SparkContext. So either you have a bug in your code, or there will be some error in the log pointing at the actual reason for the failure. On Tue, Sep 22, 2015 at

RE: Yarn Shutting Down Spark Processing

2015-09-23 Thread Bryan
Marcelo, The error below is from the application logs. The spark streaming context is initialized and actively processing data when yarn claims that the context is not initialized. There are a number of errors, but they're all associated with the ssc shutting down. Regards, Bryan Jeffrey

Re: Java Heap Space Error

2015-09-23 Thread Yusuf Can Gürkan
Yes, it’s possible. I use S3 as data source. My external tables has partitioned. Belowed task is 193/200. Job has 2 stages and its 193. task of 200 in 2.stage because of sql.shuffle.partitions. How can i avoid this situation, this is my query: select userid,concat_ws('

Re: Join over many small files

2015-09-23 Thread ayan guha
I think this can be a good case for using sequence file format to pack many files to few sequence files with file name as key andd content as value. Then read it as RDD and produce tuples like you mentioned (key=fileno+id, value=value). After that, it is a simple map operation to generate the diff

Re: Yarn Shutting Down Spark Processing

2015-09-23 Thread Marcelo Vanzin
But that's not the complete application log. You say the streaming context is initialized, but can you show that in the logs? There's something happening that is causing the SparkContext to not be registered with the YARN backend, and that's why your application is being killed. If you can share

Re: Using Spark for portfolio manager app

2015-09-23 Thread ALEX K
Thuy, if you decide to go with Hbase for external storage consider using a light-weight SQL layer such as Apache Phoenix, it has a spark plugin & JDBC driver, and throughput is pretty good even for heavy market data feed (make sure to use batched

Re: Creating BlockMatrix with java API

2015-09-23 Thread Pulasthi Supun Wickramasinghe
Hi YiZhi, Actually i was not able to try it out to see if it was working. I sent the previous reply assuming that Sabarish's solution would work :). Sorry if there was any confusion. Best Regards, Pulasthi On Wed, Sep 23, 2015 at 6:47 AM, YiZhi Liu wrote: > Hi Pulasthi, > >

reduceByKeyAndWindow confusion

2015-09-23 Thread srungarapu vamsi
I create a stream from kafka as belows" val kafkaDStream = KafkaUtils.createDirectStream[String,KafkaGenericEvent,StringDecoder,KafkaGenericEventsDecoder](ssc, kafkaConf, Set(topics)) .window(Minutes(WINDOW_DURATION),Minutes(SLIDER_DURATION)) I have a map ("intToStringList") which is a

Re: Spark 1.5.0 on YARN dynamicAllocation - Initial job has not accepted any resources

2015-09-23 Thread Jonathan Kelly
AHA! I figured it out, but it required some tedious remote debugging of the Spark ApplicationMaster. (But now I understand the Spark codebase a little better than before, so I guess I'm not too put out. =P) Here's what's happening... I am setting spark.dynamicAllocation.minExecutors=1 but am not

Re: SparkR for accumulo

2015-09-23 Thread madhvi.gupta
Ohk.Thanks Thanks and Regards Madhvi Gupta On Thursday 24 September 2015 08:12 AM, Sun, Rui wrote: No. It is possible you create a helper function which can creat accumulo data RDDs in Scala or Java (maybe put such code in a JAR, add using --jar on the command line to start SparkR to use

Re: SparkR for accumulo

2015-09-23 Thread madhvi.gupta
Hi, Is there any other way to proceed with it to create RRDD from a source RDD other than text RDD?Or to use any other format of data stored in HDFS in sparkR? Also please elaborate me the kind of step missing in sparkR fro this. Thanks and Regards Madhvi Gupta On Thursday 24 September 2015

How to fix some WARN when submit job on spark 1.5 YARN

2015-09-23 Thread r7raul1...@163.com
1 WARN netlib.BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS 2 WARN netlib.BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS 3 WARN Unable to load native-hadoop library for your platform r7raul1...@163.com

RE: SparkR for accumulo

2015-09-23 Thread Sun, Rui
No. It is possible you create a helper function which can creat accumulo data RDDs in Scala or Java (maybe put such code in a JAR, add using --jar on the command line to start SparkR to use it ?) and in SparkR you can use the private functions like callJMethod to call it and the created RDD

Re: LogisticRegression models consumes all driver memory

2015-09-23 Thread DB Tsai
You want to reduce the # of partitions to around the # of executors * cores. Since you have so many tasks/partitions which will give a lot of pressure on treeReduce in LoR. Let me know if this helps. Sincerely, DB Tsai -- Blog:

KMeans Model fails to run

2015-09-23 Thread Soong, Eddie
Hi, Why am I getting this error which prevents my KMeans clustering algorithm to work inside of Spark? I'm trying to run a sample Scala model found in Databricks website on my Cloudera-Spark 1-node local VM. For completeness, the Scala program is as follows: Thx import

Re: LogisticRegression models consumes all driver memory

2015-09-23 Thread DB Tsai
Could you paste some of your code for diagnosis? Sincerely, DB Tsai -- Blog: https://www.dbtsai.com PGP Key ID: 0xAF08DF8D On Wed, Sep 23, 2015 at 3:19 PM, Eugene Zhulenev

Re: How to obtain the key in updateStateByKey

2015-09-23 Thread Ted Yu
def updateStateByKey[S: ClassTag]( updateFunc: (Iterator[(K, Seq[V], Option[S])]) => Iterator[(K, S)], updateFunc is given an iterator. You can access the key with _1 on the iterator. On Wed, Sep 23, 2015 at 3:01 PM, swetha wrote: > Hi, > > How to obtain the

[POWERED BY] Please add our organization

2015-09-23 Thread barmaley
Name: Frontline Systems Inc. URL: www.solver.com Description: • We built an interface between Microsoft Excel and Apache Spark - bringing Big Data from the clusters to Excel enabling tools ranging from simple charts and Power View dashboards to add-ins for machine learning and predictive

caching DataFrames

2015-09-23 Thread Zhang, Jingyu
I have A and B DataFrames A has columns a11,a12, a21,a22 B has columns b11,b12, b21,b22 I persistent them in cache 1. A.Cache(), 2. B.Cache() Then, I persistent the subset in cache later 3. DataFrame A1 (a11,a12).cache() 4. DataFrame B1 (b11,b12).cache() 5. DataFrame AB1

[POWERED BY] Please add our organization

2015-09-23 Thread Oleg Shirokikh
Name: Frontline Systems Inc. URL: www.solver.com Description: * We built an interface between Microsoft Excel and Apache Spark - bringing Big Data from the clusters to Excel enabling tools ranging from simple charts and Power View dashboards to add-ins for machine

Re: Custom Hadoop InputSplit, Spark partitions, spark executors/task and Yarn containers

2015-09-23 Thread Sandy Ryza
Hi Anfernee, That's correct that each InputSplit will map to exactly a Spark partition. On YARN, each Spark executor maps to a single YARN container. Each executor can run multiple tasks over its lifetime, both parallel and sequentially. If you enable dynamic allocation, after the stage

Re: LogisticRegression models consumes all driver memory

2015-09-23 Thread DB Tsai
Your code looks correct for me. How many # of features do you have in this training? How many tasks are running in the job? Sincerely, DB Tsai -- Blog: https://www.dbtsai.com PGP Key ID: 0xAF08DF8D

Re: Debugging too many files open exception issue in Spark shuffle

2015-09-23 Thread DB Tsai
in ./apps/mesos-0.22.1/sbin/mesos-daemon.sh #!/usr/bin/env bash prefix=/apps/mesos-0.22.1 exec_prefix=/apps/mesos-0.22.1 deploy_dir=${prefix}/etc/mesos # Increase the default number of open file descriptors. ulimit -n 8192 Sincerely, DB Tsai

Debugging too many files open exception issue in Spark shuffle

2015-09-23 Thread DB Tsai
Hi, Recently, we ran into this notorious exception while doing large shuffle in mesos at Netflix. We ensure that `ulimit -n` is a very large number, but still have the issue. It turns out that mesos overrides the `ulimit -n` to a small number causing the problem. It's very non-trivial to debug

CrossValidator speed - for loop on each parameter map?

2015-09-23 Thread julia
I’m using CrossValidator in pyspark (spark 1.4.1). I’ve seen in the class Estimator that all 'fit' are done sequentially. You can check the method _fit in CrossValidator class for the current implementation: https://spark.apache.org/docs/1.4.1/api/python/_modules/pyspark/ml/tuning.html In the

  1   2   >