Regarding sliding window example from Databricks for DStream

2016-01-11 Thread Cassa L
Hi, I'm trying to work with sliding window example given by databricks. https://databricks.gitbooks.io/databricks-spark-reference-applications/content/logs_analyzer/chapter1/windows.html It works fine as expected. My question is how do I determine when the last phase of of slider has reached. I

Re: Spark 1.6 udf/udaf alternatives in dataset?

2016-01-11 Thread Michael Armbrust
> > Also, while extracting a value into Dataset using as[U] method, how could > I specify a custom encoder/translation to case class (where I don't have > the same column-name mapping or same data-type mapping)? > There is no public API yet for defining your own encoders. You change the column

Re: parquet repartitions and parquet.enable.summary-metadata does not work

2016-01-11 Thread Cheng Lian
Hey Gavin, Could you please provide a snippet of your code to show how did you disabled "parquet.enable.summary-metadata" and wrote the files? Especially, you mentioned you saw "3000 jobs" failed. Were you writing each Parquet file with an individual job? (Usually people use

Re: Too many tasks killed the scheduler

2016-01-11 Thread Gavin Yue
Thank you for the suggestion. I tried the df.coalesce(1000).write.parquet() and yes, the parquet file number drops to 1000, but the parition of parquet stills is like 5000+. When I read the parquet and do a count, it still has the 5000+ tasks. So I guess I need to do a repartition here to drop

Re: XML column not supported in Database

2016-01-11 Thread Reynold Xin
Can you file a JIRA ticket? Thanks. The URL is issues.apache.org/jira/browse/SPARK On Mon, Jan 11, 2016 at 1:44 AM, Gaini Rajeshwar < raja.rajeshwar2...@gmail.com> wrote: > Hi All, > > I am using PostgreSQL database. I am using the following jdbc call to > access a customer table (*customer_id

Re: GroupBy on DataFrame taking too much time

2016-01-11 Thread Xingchi Wang
Error happend at the "Lost task 0.0 in stage 0.0", I think it is not the "groupBy" problem, it's the sql read the "customer" table issue, please check the jdbc link and the data is loaded successfully?? Thanks Xingchi 2016-01-11 15:43 GMT+08:00 Gaini Rajeshwar : >

Re: GroupBy on DataFrame taking too much time

2016-01-11 Thread Gaini Rajeshwar
There is no problem with the sql read. When i do the following it is working fine. *val dataframe1 = sqlContext.load("jdbc", Map("url" -> "jdbc:postgresql://localhost/customerlogs?user=postgres=postgres", "dbtable" -> "customer"))* *dataframe1.filter("country = 'BA'").show()* On Mon, Jan 11,

Re: Getting an error while submitting spark jar

2016-01-11 Thread Gaini Rajeshwar
Hi Sree, I think it has to be *--class mllib.perf.TestRunner* instead of *--class mllib.perf.TesRunner* On Mon, Jan 11, 2016 at 1:19 PM, Sree Eedupuganti wrote: > The way how i submitting jar > > hadoop@localhost:/usr/local/hadoop/spark$ ./bin/spark-submit \ > > --class

Logger overridden when using JavaSparkContext

2016-01-11 Thread Max Schmidt
Hi there, we're haveing a strange Problem here using Spark in a Java application using the JavaSparkContext: We are using java.util.logging.* for logging in our application with 2 Handlers (Console + Filehandler): {{{ .handlers=java.util.logging.ConsoleHandler, java.util.logging.FileHandler

Re: Too many tasks killed the scheduler

2016-01-11 Thread Gavin Yue
Here is more info. The job stuck at: INFO cluster.YarnScheduler: Adding task set 1.0 with 79212 tasks Then got the error: Caused by: org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120 seconds]. This timeout is controlled by spark.rpc.askTimeout So I increased

Getting kafka offsets at beginning of spark streaming application

2016-01-11 Thread Abhishek Anand
Hi, Is there a way so that I can fetch the offsets from where the spark streaming starts reading from Kafka when my application starts ? What I am trying is to create an initial RDD with offsest at a particular time passed as input from the command line and the offsets from where my spark

Re: Spark 1.6 udf/udaf alternatives in dataset?

2016-01-11 Thread Arkadiusz Bicz
Hi, There are some documentation in https://docs.cloud.databricks.com/docs/spark/1.6/index.html#examples/Dataset%20Aggregator.html and also you can check out tests of DatasetSuite in spark sources. BR, Arkadiusz Bicz On Mon, Jan 11, 2016 at 5:37 AM, Muthu Jayakumar

XML column not supported in Database

2016-01-11 Thread Gaini Rajeshwar
Hi All, I am using PostgreSQL database. I am using the following jdbc call to access a customer table (*customer_id int, event text, country text, content xml)* in my database. *val dataframe1 = sqlContext.load("jdbc", Map("url" ->

ibsnappyjava.so: failed to map segment from shared object

2016-01-11 Thread yatinla
I'm trying to get pyspark running on a shared web host. I can get into the pyspark shell but whenever I run a simple command like sc.parallelize([1,2,3,4]).sum() I get an error that seems to stem from some kind of permission issue with libsnappyjava.so: Caused by: java.lang.UnsatisfiedLinkError:

Re: Best practices for sharing/maintaining large resource files for Spark jobs

2016-01-11 Thread Jörn Franke
You can look at ignite as a HDFS cache or for storing rdds. > On 11 Jan 2016, at 21:14, Dmitry Goldenberg wrote: > > We have a bunch of Spark jobs deployed and a few large resource files such as > e.g. a dictionary for lookups or a statistical model. > > Right now,

Re: Best practices for sharing/maintaining large resource files for Spark jobs

2016-01-11 Thread Gavin Yue
Has anyone used Ignite in production system ? On Mon, Jan 11, 2016 at 11:44 PM, Jörn Franke wrote: > You can look at ignite as a HDFS cache or for storing rdds. > > > On 11 Jan 2016, at 21:14, Dmitry Goldenberg > wrote: > > > > We have a bunch

Re: [KafkaRDD]: rdd.cache() does not seem to work

2016-01-11 Thread charles li
cache is the default storage level of persist, and it is lazy [ not cached indeed ] until the first time it is computed. ​ On Tue, Jan 12, 2016 at 5:13 AM, ponkin wrote: > Hi, > > Here is my use case : > I have kafka topic. The job is fairly simple - it reads topic and

Re: Best practices for sharing/maintaining large resource files for Spark jobs

2016-01-11 Thread Sabarish Sasidharan
One option could be to store them as blobs in a cache like Redis and then read + broadcast them from the driver. Or you could store them in HDFS and read + broadcast from the driver. Regards Sab On Tue, Jan 12, 2016 at 1:44 AM, Dmitry Goldenberg wrote: > We have a

Re: Put all elements of RDD into array

2016-01-11 Thread Daniel Valdivia
The ArrayBuffer did the trick! Thanks a lot! I'm learning Scala through spark so these details are still new to me Sent from my iPhone > On Jan 11, 2016, at 5:18 PM, Jakob Odersky wrote: > > Hey, > I just reread your question and saw I overlooked some crucial

Various ways to use --jars? Some undocumented ways?

2016-01-11 Thread jiml
(Sorry to "repost" I originally answered/replied to an older question but my part was not expanding) Question is: Looking for all the ways to specify a set of jars using --jars on spark-submit? I know this is old but I am about to submit a proposed docs change on --jars, and I had an issue with

Fwd: how to submit multiple jar files when using spark-submit script in shell?

2016-01-11 Thread UMESH CHAUDHARY
Could you build a fat jar by including all your dependencies along with you application. See here and here

Unshaded google guava classes in spark-network-common jar

2016-01-11 Thread Jake Yoon
I found an unshaded google guava classes used internally in spark-network-common while working with ElasticSearch. Following link discusses about duplicate dependencies conflict cause by guava classes and how I solved the build conflict issue.

Re: ibsnappyjava.so: failed to map segment from shared object

2016-01-11 Thread Josh Rosen
This is due to the snappy-java library; I think that you'll have to configure either java.io.tmpdir or org.xerial.snappy.tempdir; see https://github.com/xerial/snappy-java/blob/1198363176ad671d933fdaf0938b8b9e609c0d8a/src/main/java/org/xerial/snappy/SnappyLoader.java#L335 On Mon, Jan 11, 2016

Re: how to submit multiple jar files when using spark-submit script in shell?

2016-01-11 Thread jiml
Question is: Looking for all the ways to specify a set of jars using --jars on spark-submit I know this is old but I am about to submit a proposed docs change on --jars, and I had an issue with --jars today When this user submitted the following command line, is that a proper way to reference a

Re: XML column not supported in Database

2016-01-11 Thread Gaini Rajeshwar
Hi Reynold, I did create a issue in JIRA. It is SPARK-12764 On Tue, Jan 12, 2016 at 4:55 AM, Reynold Xin wrote: > Can you file a JIRA ticket? Thanks. > > The URL is issues.apache.org/jira/browse/SPARK > > On Mon, Jan 11,

Re: How to find cause(waiting threads etc) of hanging job for 7 hours?

2016-01-11 Thread Prabhu Joseph
Umesh, Running task is a thread within the executor process. We need to take stack trace for the executor process. The executor will be running in any NodeManager machine as a container. YARN RM UI running jobs will have the host details where executor is running. Login to that NodeManager

Re: Spark 1.6 udf/udaf alternatives in dataset?

2016-01-11 Thread Muthu Jayakumar
Hello Michael, Thank you for the suggestion. This should do the trick for column names. But how could I transform columns value type? Do I have to use an UDF? In case if I use UDF, then the other question I may have is pertaining to the map step in dataset, where I am running into an error when I

[KafkaRDD]: rdd.cache() does not seem to work

2016-01-11 Thread ponkin
Hi, Here is my use case : I have kafka topic. The job is fairly simple - it reads topic and save data to several hdfs paths. I create rdd with the following code val r = KafkaUtils.createRDD[Array[Byte],Array[Byte],DefaultDecoder,DefaultDecoder](context,kafkaParams,range) Then I am trying to

Re: Too many tasks killed the scheduler

2016-01-11 Thread Shixiong(Ryan) Zhu
Could you use "coalesce" to reduce the number of partitions? Shixiong Zhu On Mon, Jan 11, 2016 at 12:21 AM, Gavin Yue wrote: > Here is more info. > > The job stuck at: > INFO cluster.YarnScheduler: Adding task set 1.0 with 79212 tasks > > Then got the error: > Caused

Re: Windows driver cannot run job on Linux cluster

2016-01-11 Thread Andrew Wooster
I am running spark-1.6.0-bin-hadoop2.6/ There is no stack trace because there is no exception; the driver just seems to be waiting. Here is driver DEBUG log: 16/01/11 16:31:51 INFO SparkContext: Running Spark version 1.6.0 16/01/11 16:31:51 DEBUG MutableMetricsFactory: field

Windows driver cannot run job on Linux cluster

2016-01-11 Thread Andrew Wooster
I have a very simple program that runs fine on my Linux server that runs Spark master and worker in standalone mode. public class SimpleSpark { public int sum () { SparkConf conf = new SparkConf() .setAppName("Magellan") .setMaster("spark://

Re: Read from AWS s3 with out having to hard-code sensitive keys

2016-01-11 Thread Jonathan Kelly
Yes, IAM roles are actually required now for EMR. If you use Spark on EMR (vs. just EC2), you get S3 configuration for free (it goes by the name EMRFS), and it will use your IAM role for communicating with S3. Here is the corresponding documentation:

Re: Windows driver cannot run job on Linux cluster

2016-01-11 Thread Ted Yu
Which release of Spark are you using ? Can you pastebin stack trace of executor(s) so that we can have some more clue ? Thanks On Mon, Jan 11, 2016 at 1:10 PM, Andrew Wooster wrote: > I have a very simple program that runs fine on my Linux server that runs > Spark

Read from AWS s3 with out having to hard-code sensitive keys

2016-01-11 Thread Krishna Rao
Hi all, Is there a method for reading from s3 without having to hard-code keys? The only 2 ways I've found both require this: 1. Set conf in code e.g.: sc.hadoopConfiguration().set("fs.s3.awsAccessKeyId", "") sc.hadoopConfiguration().set("fs.s3.awsSecretAccessKey", "") 2. Set keys in URL, e.g.:

Deploying model built in SparkR

2016-01-11 Thread Chandan Verma
Hi All, Does any one over here has deployed a model produced in SparkR or atleast help me with the steps for deployment. Regards, Chandan Verma Sent from my Sony Xperia™ smartphone

Spark integration with HCatalog (specifically regarding partitions)

2016-01-11 Thread Elliot West
Hello, I am in the process of evaluating Spark (1.5.2) for a wide range of use cases. In particular I'm keen to understand the depth of the integration with HCatalog (aka the Hive Metastore). I am very encouraged when browsing the source contained within the org.apache.spark.sql.hive package. My

Re: Unable to compile from source

2016-01-11 Thread Gaini Rajeshwar
Hey Hareesh, Thanks. That solved the issue. Thanks, Rajeshwar Gaini. On Fri, Jan 8, 2016 at 5:20 PM, hareesh makam wrote: > Are you behind a proxy? > > Or > > Try disabling the SSL check while building. > > >

Manipulate Twitter Stream Filter on runtime

2016-01-11 Thread Filli Alem
Hi, I try to implement a twitter stream processing, where I would want to change the filtered keywords during run time. I implemented the twitter stream with a custom receiver which works fine. I'm stuck with the runtime alteration now. Any ideas? Thanks Alem

Re: Deploying model built in SparkR

2016-01-11 Thread Yanbo Liang
Hi Chandan, Could you tell us the meaning of deploying model? Using the model to make prediction by R? Thanks Yanbo 2016-01-11 20:40 GMT+08:00 Chandan Verma : > Hi All, > > Does any one over here has deployed a model produced in SparkR or atleast > help me with

Long running jobs in CDH

2016-01-11 Thread Jan Holmberg
Hi, any preferences how to run constantly running jobs (streaming) in CDH? Oozie? Cmdline? Something else? cheers, -jan - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail:

Re: Create a n x n graph given only the vertices no

2016-01-11 Thread praveen S
Yes I was looking something of that sort.. Thank you. Actually I was looking for a way to connect nodes based on the property of the nodes.. I have a set of nodes and I know the condition on which I can create an edge.. On 11 Jan 2016 14:06, "Robin East" wrote: > Do you

Re: broadcast params to workers at the very beginning

2016-01-11 Thread Yanbo Liang
Hi, The parameters should be broadcasted again after you update it at driver side, then you can get updated version at worker side. Thanks Yanbo 2016-01-09 23:12 GMT+08:00 octavian.ganea : > Hi, > > In my app, I have a Params scala object that keeps all the specific

Trying to understand dynamic resource allocation

2016-01-11 Thread Yiannis Gkoufas
Hi, I am exploring a bit the dynamic resource allocation provided by the Standalone Cluster Mode and I was wondering whether this behavior I am experiencing is expected. In my configuration I have 3 slaves with 24 cores each. I have in my spark-defaults.conf: spark.shuffle.service.enabled true

Re: How to find cause(waiting threads etc) of hanging job for 7 hours?

2016-01-11 Thread Umesh Kacha
Hi Prabhu thanks for the response. How do I find pid of a slow running task. Task is running in yarn cluster node. When I try to see pid of a running task using my user I see some 7-8 digit number instead of user running process any idea why spark creates this number instead of displaying user On

Re: StandardScaler in spark.ml.feature requires vector input?

2016-01-11 Thread Yanbo Liang
Hi Kristina, The input column of StandardScaler must be vector type, because it's usually used as feature scaling before model training and the type of feature column should be vector in most cases. If you only want to standardize a numeric column, you can wrap it as a vector and feed into

Re: Logger overridden when using JavaSparkContext

2016-01-11 Thread Max Schmidt
I checked the handlers of my rootLogger (java.util.logging.Logger.getLogger("")) which where a Console and a FileHandler. After the JavaSparkContext was created, the rootLogger only contained a 'org.slf4j.bridge.SLF4JBridgeHandler'. Am 11.01.2016 um 10:56 schrieb Max Schmidt: > Hi there, > >

Re: Getting kafka offsets at beginning of spark streaming application

2016-01-11 Thread kundan kumar
Hi Cody, My use case is something like follows : My application dies at X time and I write the offsets to a DB. Now when my application starts at time Y (few minutes later) and spark streaming reads the latest offsets using createDirectStream method. Now here I want to get the exact offset that

Re: Spark on Apache Ingnite?

2016-01-11 Thread RodrigoB
Although I haven't work explicitly with either, they do seem to differ in design and consequently in usage scenarios. Ignite is claimed to be a pure in-memory distributed database. With Ignite, updating existing keys is something that is self-managed comparing with Tachyon. In Tachyon once a

Re: Trying to understand dynamic resource allocation

2016-01-11 Thread Dean Wampler
It works on Mesos, too. I'm not sure about Standalone mode. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition (O'Reilly) Typesafe @deanwampler http://polyglotprogramming.com On Mon,

Spark SQL "partition stride"?

2016-01-11 Thread Keith Freeman
The spark docs section for "JDBC to Other Databases" (https://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases) describes the partitioning as "... Notice that lowerBound and upperBound are just used to decide the partition stride, not for filtering the rows in

Re: Logger overridden when using JavaSparkContext

2016-01-11 Thread Max Schmidt
Okay, i solved this problem... It was my own fault by setting the RootLogger for the java.util.logging*. An explicit name for the handler/level solved it. Am 2016-01-11 12:33, schrieb Max Schmidt: I checked the handlers of my rootLogger (java.util.logging.Logger.getLogger("")) which where a

Re: [discuss] dropping Python 2.6 support

2016-01-11 Thread David Chin
FWIW, RHEL 6 still uses Python 2.6, although 2.7.8 and 3.3.2 are available through Red Hat Software Collections. See: https://www.softwarecollections.org/en/ I run an academic compute cluster on RHEL 6. We do, however, provide Python 2.7.x and 3.5.x via modulefiles. On Tue, Jan 5, 2016 at 8:45

Design query regarding dataframe usecase

2016-01-11 Thread Kapil Malik
Hi, We have an analytics usecase where we are collecting user click logs. The data can be considered as hierarchical with 3 type of logs - User (attributes like userId, emailId) - Session (attributes like sessionId, device, OS, browser, city etc.) - - PageView (attributes like url, referrer,

Re: Getting kafka offsets at beginning of spark streaming application

2016-01-11 Thread Cody Koeninger
You can use HasOffsetRanges to get the offsets from the rdd, see http://spark.apache.org/docs/latest/streaming-kafka-integration.html Although if you're already saving the offsets to a DB, why not just use that as the starting point of your application? On Mon, Jan 11, 2016 at 11:00 AM, kundan

Re: Getting kafka offsets at beginning of spark streaming application

2016-01-11 Thread Cody Koeninger
I'm not 100% sure what you're asking. If you're asking if it's possible to start a stream at a particular set of offsets, yes, one of the createDirectStream methods takes a map from topicpartition to starting offset. If you're asking if it's possible to query Kafka for the offset corresponding

Re: Spark on Apache Ingnite?

2016-01-11 Thread Koert Kuipers
where is ignite's resilience/fault-tolerance design documented? i can not find it. i would generally stay away from it if fault-tolerance is an afterthought. On Mon, Jan 11, 2016 at 10:31 AM, RodrigoB wrote: > Although I haven't work explicitly with either, they do

Re: Trying to understand dynamic resource allocation

2016-01-11 Thread Nick Peterson
My understanding is that dynamic allocation is only enabled for Spark-on-Yarn. Those settings likely have no impact in standalone mode. Nick On Mon, Jan 11, 2016, 5:10 AM Yiannis Gkoufas wrote: > Hi, > > I am exploring a bit the dynamic resource allocation provided by the

Re: GroupBy on DataFrame taking too much time

2016-01-11 Thread Todd Nist
Hi Rajeshwar Gaini, dbtable can be any valid sql query, simple define it as a sub query, something like: val query = "(SELECT country, count(*) FROM customer group by country) as X" val df1 = sqlContext.read .format("jdbc") .option("url", url) .option("user", username)

Re: pre-install 3-party Python package on spark cluster

2016-01-11 Thread Andy Davidson
I use https://code.google.com/p/parallel-ssh/ to upgrade all my slaves From: "taotao.li" Date: Sunday, January 10, 2016 at 9:50 PM To: "user @spark" Subject: pre-install 3-party Python package on spark cluster > I have a spark cluster, from

[no subject]

2016-01-11 Thread Daniel Imberman
Hi all, I'm looking for a way to efficiently partition an RDD, but allow the same data to exists on multiple partitions. Lets say I have a key-value RDD with keys {1,2,3,4} I want to be able to to repartition the RDD so that so the partitions look like p1 = {1,2} p2 = {2,3} p3 = {3,4}

Re: Read from AWS s3 with out having to hard-code sensitive keys

2016-01-11 Thread Sabarish Sasidharan
If you are on EMR, these can go into your hdfs site config. And will work with Spark on YARN by default. Regards Sab On 11-Jan-2016 5:16 pm, "Krishna Rao" wrote: > Hi all, > > Is there a method for reading from s3 without having to hard-code keys? > The only 2 ways I've

Re: pre-install 3-party Python package on spark cluster

2016-01-11 Thread Annabel Melongo
When you run spark submit in either client or cluster mode, you can either use the options --packages or -jars to automatically copy your packages to the worker machines. Thanks On Monday, January 11, 2016 12:52 PM, Andy Davidson wrote: I use 

Re: partitioning RDD

2016-01-11 Thread Daniel Imberman
Hi Ted, Sorry about that. I will be more careful in future emails. So the values are representing buckets (so 1 would represent all values with key 1, 2 represents all values with key 2, etc.) The issue is that I would want to have all values of key 1 and key 2 in partition 1, while then

Best practices for sharing/maintaining large resource files for Spark jobs

2016-01-11 Thread Dmitry Goldenberg
We have a bunch of Spark jobs deployed and a few large resource files such as e.g. a dictionary for lookups or a statistical model. Right now, these are deployed as part of the Spark jobs which will eventually make the mongo-jars too bloated for deployments. What are some of the best practices

Re: partitioning RDD

2016-01-11 Thread Ted Yu
Hi, Please use proper subject when sending email to user@ In your example below, what do the values inside curly braces represent ? I assume not the keys since values for same key should go to the same partition. Cheers On Mon, Jan 11, 2016 at 10:51 AM, Daniel Imberman

Re: Read from AWS s3 with out having to hard-code sensitive keys

2016-01-11 Thread Matei Zaharia
In production, I'd recommend using IAM roles to avoid having keys altogether. Take a look at http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/iam-roles-for-amazon-ec2.html. Matei > On Jan 11, 2016, at 11:32 AM, Sabarish Sasidharan > wrote: > > If you are