Re: Problems with Local Checkpoints

2015-09-14 Thread Akhil Das
You need to set your HADOOP_HOME and make sure the winutils.exe is available in the PATH. Here's a discussion around the same issue http://stackoverflow.com/questions/19620642/failed-to-locate-the-winutils-binary-in-the-hadoop-binary-path Also this JIRA

Re: SparkR - Support for Other Models

2015-09-14 Thread Akhil Das
You can look into the Spark JIRA page for the same, if it isn't available there then you could possibly create an issue for support and hopefully in later releases it will be added. Thanks Best Regards On Thu, Sep 10, 2015 at 11:26 AM, Manish

Re: Replacing Esper with Spark Streaming?

2015-09-14 Thread Todd Nist
Stratio offers a CEP implementation based on Spark Streaming and the Siddhi CEP engine. I have not used the below, but they may be of some value to you: http://stratio.github.io/streaming-cep-engine/ https://github.com/Stratio/streaming-cep-engine HTH. -Todd On Sun, Sep 13, 2015 at 7:49 PM,

application failed on large dataset

2015-09-14 Thread 周千昊
Hi, community I am facing a strange problem: all executors does not respond, and then all of them failed with the ExecutorLostFailure. when I look into yarn logs, there are full of such exception 15/09/14 04:35:33 ERROR shuffle.RetryingBlockFetcher: Exception while beginning

RE: Problems with Local Checkpoints

2015-09-14 Thread Bryan
Akhil, This looks like the issue. I'll update my path to include the (soon to be added) winutils & assoc. DLLs. Thank you, Bryan -Original Message- From: "Akhil Das" Sent: ‎9/‎14/‎2015 6:46 AM To: "Bryan Jeffrey" Cc: "user"

Re: java.lang.NullPointerException with Twitter API

2015-09-14 Thread Akhil Das
Some status might not have the geoLocation and hence you are doing a null.toString.contains which ends up in that exception, put a condition or try...catch around it to make it work. Thanks Best Regards On Fri, Sep 11, 2015 at 12:59 AM, Jo Sunad wrote: > Hello! > > I am

Spark Streaming Topology

2015-09-14 Thread defstat
Hi all, I would like to use Spark Streaming for managing the problem below: I have 2 InputStreams, one for one type of input (n-dimensional vectors) and one for question on the infrastructure (explained below). I need to "break" the input first in 4 execution nodes, and produce a stream from

Fwd: Spark job failed

2015-09-14 Thread Renu Yadav
-- Forwarded message -- From: Renu Yadav Date: Mon, Sep 14, 2015 at 4:51 PM Subject: Spark job failed To: d...@spark.apache.org I am getting below error while running spark job: storage.DiskBlockObjectWriter: Uncaught exception while reverting partial writes

Re: connecting to remote spark and reading files on HDFS or s3 in sparkR

2015-09-14 Thread Akhil Das
You can look into this doc regarding the connection (its for gce though but it should be similar). Thanks Best Regards On Thu, Sep 10, 2015 at 11:20 PM, roni wrote: > I have spark installed on a EC2 cluster. Can I

Re: Implement "LIKE" in SparkSQL

2015-09-14 Thread Jorge Sánchez
I think after you get your table as a DataFrame, you can do a filter over it, something like: val t = sqlContext.sql("select * from table t") val df = t.filter(t("a").contains(t("b"))) Let us know the results. 2015-09-12 10:45 GMT+01:00 liam : > > OK, I got another way, it

Re: Spark task hangs infinitely when accessing S3

2015-09-14 Thread Akhil Das
Are you sitting behind a proxy or something? Can you look more into the executor logs? I have a strange feeling that you are blowing the memory (and possibly hitting GC etc). Thanks Best Regards On Thu, Sep 10, 2015 at 10:05 PM, Mario Pastorelli < mario.pastore...@teralytics.ch> wrote: > Dear

Twitter Streming using Twitter Public Streaming API and Apache Spark

2015-09-14 Thread Sadaf
Hi, I wanna fetch PUBLIC tweets (not particular to any account) containing any particular HASHTAG (#) (i.e "CocaCola" in my case) from twitter. I made an APP on twitter to get the credentials, and then used Twitter Public Streaming API. Below is the piece of code. { val config = new

hdfs-ha on mesos - odd bug

2015-09-14 Thread Adrian Bridgett
I'm hitting an odd issue with running spark on mesos together with HA-HDFS, with an even odder workaround. In particular I get an error that it can't find the HDFS nameservice unless I put in a _broken_ url (discovered that workaround by mistake!). core-site.xml, hdfs-site.xml is distributed

Re: Spark job failed

2015-09-14 Thread Ted Yu
Have you considered posting on vendor forum ? FYI On Mon, Sep 14, 2015 at 6:09 AM, Renu Yadav wrote: > > -- Forwarded message -- > From: Renu Yadav > Date: Mon, Sep 14, 2015 at 4:51 PM > Subject: Spark job failed > To:

Approach validation - building merged datasets - Spark SQL

2015-09-14 Thread Vajra L
Folks- I am very new to Spark and Spark-SQL. Here is what I am doing in my application. Can you please validate and let me know if there is a better way? 1. Parsing XML files with nested structures, ingested, into individual datasets Created a custom input format to split XML so each node

Re: JavaRDD using Reflection

2015-09-14 Thread Ankur Srivastava
Hi Rachana I didn't get you r question fully but as the error says you can not perform a rdd transformation or action inside another transformation. In your example you are performing an action "rdd2.values.count()" in side the "map" transformation. It is not allowed and in any case this will be

Where can I learn how to write udf?

2015-09-14 Thread Saif.A.Ellafi
Hi all, I am failing to find a proper guide or tutorial onto how to write proper udf functions in scala. Appreciate the effort saving, Saif

Creating fat jar with all resources.(Spark-Java-Maven)

2015-09-14 Thread Vipul Rai
HI All, I have a spark app written in java,which parses the incoming log using the headers which are in .xml. (There are many headers and logs are from 15-20 devices in various formats and separators). I am able to run it in local mode after specifying all the resources and passing it as

A way to timeout and terminate a laggard 'Stage' ?

2015-09-14 Thread Dmitry Goldenberg
Is there a way in Spark to automatically terminate laggard "stage's", ones that appear to be hanging? In other words, is there a timeout for processing of a given RDD? In the Spark GUI, I see the "kill" function for a given Stage under 'Details for Job <...>". Is there something in Spark that

JavaRDD using Reflection

2015-09-14 Thread Rachana Srivastava
Hello all, I am working a problem that requires us to create different set of JavaRDD based on different input arguments. We are getting following error when we try to use a factory to create JavaRDD. Error message is clear but I am wondering is there any workaround. Question: How to create

Re: Where can I learn how to write udf?

2015-09-14 Thread Silvio Fiorito
Hi Saif, There are 2 types of UDFs. Those used by SQL and those used by the Scala DSL. For SQL, you just register a function like so (this example is from the docs): sqlContext.udf.register(“strLen”, (s: String) => s.length) sqlContext.sql(“select name, strLen(name) from people”).show The

Creating fat jar with all resources.(Spark-Java-Maven)

2015-09-14 Thread vipulrai
HI All, I have a spark app written in java,which parses the incoming log using the headers which are in .xml. (There are many headers and logs are from 15-20 devices in various formats and separators). I am able to run it in local mode after specifying all the resources and passing it as

using existing R packages from SparkR

2015-09-14 Thread bobtreacy
I am trying to use an existing R package in SparkR. I am trying to follow the example at https://amplab-extras.github.io/SparkR-pkg/ in the section "Using existing R packages". Here is the sample in ample extras -- generateSparse <- function(x) { # Use sparseMatrix function from the Matrix

Re: Null Value in DecimalType column of DataFrame

2015-09-14 Thread Yin Huai
btw, move it to user list. On Mon, Sep 14, 2015 at 2:54 PM, Yin Huai wrote: > A scale of 10 means that there are 10 digits at the right of the decimal > point. If you also have precision 10, the range of your data will be [0, 1) > and casting "10.5" to DecimalType(10, 10)

Re: unoin streams not working for streams > 3

2015-09-14 Thread Gerard Maas
How many cores are you assigning to your spark streaming job? On Mon, Sep 14, 2015 at 10:33 PM, Василец Дмитрий wrote: > hello > I have 4 streams from kafka and streaming not working. > without any errors or logs > but with 3 streams everything perfect. > make sense

Re: Is it required to remove checkpoint when submitting a code change?

2015-09-14 Thread Ricardo Paiva
Thanks Cody. You confirmed that I'm not doing something wrong. I will keep investigating and if I find something I let everybody know. Thanks again. Regards, Ricardo On Mon, Sep 14, 2015 at 6:29 PM, Cody Koeninger wrote: > Yeah, looks like you're right about being unable

Re: Is it required to remove checkpoint when submitting a code change?

2015-09-14 Thread Cody Koeninger
Yeah, looks like you're right about being unable to change those. Upon further reading, even though StreamingContext.getOrCreate makes an entirely new spark conf, Checkpoint will only reload certain properties. I'm not sure if it'd be safe to include memory / cores among those properties that

How to convert dataframe to a nested StructType schema

2015-09-14 Thread Hao Wang
Hi, I created a dataframe with 4 string columns (city, state, country, zipcode). I then applied the following nested schema to it by creating a custom StructType. When I run df.take(5), it gives the exception below as expected. The question is how I can convert the Rows in the dataframe to

unoin streams not working for streams > 3

2015-09-14 Thread Василец Дмитрий
hello I have 4 streams from kafka and streaming not working. without any errors or logs but with 3 streams everything perfect. make sense only amount of streams , different triple combinations always working. any ideas how to debug or fix it ?

Re: JavaRDD using Reflection

2015-09-14 Thread Ankur Srivastava
It is not reflection that is the issue here but use of an RDD transformation "featureKeyClassPair.map" inside "lines.mapToPair". >From the code snippet you have sent it is not very clear if getFeatureScore(id,data) invokes executeFeedFeatures, but if that is the case it is not very obvious that

Re: connecting to remote spark and reading files on HDFS or s3 in sparkR

2015-09-14 Thread roni
Thanks Akhil. Very good article. On Mon, Sep 14, 2015 at 4:15 AM, Akhil Das wrote: > You can look into this doc > regarding the > connection (its for gce though but it should be similar). > > Thanks > Best

add external jar file to Spark shell vs. Scala Shell

2015-09-14 Thread Lan Jiang
Hi, there I ran into a problem when I try to pass external jar file to spark-shell. I have a uber jar file that contains all the java codes I created for protobuf and all its dependency. If I simply execute my code using Scala Shell, it works fine without error. I use -cp to pass the

Caching intermediate results in Spark ML pipeline?

2015-09-14 Thread Jingchu Liu
Hi all, I have a question regarding the ability of ML pipeline to cache intermediate results. I've posted this question on stackoverflow but got no answer, hope someone here can help me out.

Re: Spark Streaming Suggestion

2015-09-14 Thread Jörn Franke
Why did you not stay with the batch approach? For me the architecture looks very complex for a simple thing you want to achieve. Why don't you process the data already in storm ? Le mar. 15 sept. 2015 à 6:20, srungarapu vamsi a écrit : > I am pretty new to spark.

Re: Spark aggregateByKey Issues

2015-09-14 Thread Alexis Gillain
I'm not sure about what you want to do. You should try to sort the RDD by (yourKey, date), it ensures that all the keys are in the same partition. You problem after that is you want to aggregate only on yourKey and if you change the Key of the sorted RDD you loose partitionning. Depending of

Change protobuf version or any other third party library version in Spark application

2015-09-14 Thread Lan Jiang
Hi, there, I am using Spark 1.4.1. The protobuf 2.5 is included by Spark 1.4.1 by default. However, I would like to use Protobuf 3 in my spark application so that I can use some new features such as Map support. Is there anyway to do that? Right now if I build a uber.jar with dependencies

Spark aggregateByKey Issues

2015-09-14 Thread 毕岩
Hi: There is such one case about using reduce operation like that: I Need to reduce large data made up of billions of records with a Key-Value pair. For the following: *First,group by Key, and the records with the same Key need to be in order of one field called “date” in Value* *

Re: unoin streams not working for streams > 3

2015-09-14 Thread Василец Дмитрий
I use local[*]. And i have 4 cores on laptop. On 14 Sep 2015 23:19, "Gerard Maas" wrote: > How many cores are you assigning to your spark streaming job? > > On Mon, Sep 14, 2015 at 10:33 PM, Василец Дмитрий < > pronix.serv...@gmail.com> wrote: > >> hello >> I have 4

Re: Caching intermediate results in Spark ML pipeline?

2015-09-14 Thread Feynman Liang
Lewis, Many pipeline stages implement save/load methods, which can be used if you instantiate and call the underlying pipeline stages `transform` methods individually (instead of using the Pipeline.setStages API). See associated JIRAs . Pipeline

RE: Best way to merge final output part files created by Spark job

2015-09-14 Thread java8964
For text file, this merge works fine, but for binary format like "ORC", "Parquet" or "AVOR", not sure this will work. These kind of formats in fact are not append-able, as they write the detail data information either in the head or at tail part of the file. You have to use the format specified

Spark Streaming Suggestion

2015-09-14 Thread srungarapu vamsi
I am pretty new to spark. Please suggest a better model for the following use case. I have few (about 1500) devices in field which keep emitting about 100KB of data every minute. The nature of data sent by the devices is just a list of numbers. As of now, we have Storm is in the architecture

Setting Executor memory

2015-09-14 Thread Thomas Gerber
Hello, I was looking for guidelines on what value to set executor memory to (via spark.executor.memory for example). This seems to be important to avoid OOM during tasks, especially in no swap environments (like AWS EMR clusters). This setting is really about the executor JVM heap. Hence, in

Re: Spark aggregateByKey Issues

2015-09-14 Thread biyan900116
Hi Alexis: Thank you for your replying. My case is that each operation to one record need to depend on one value that will be set by the operating to the last record. So your advise is that i can use “sortByKey”. “sortByKey” will put all records with the same Key in one partition. Need I

why spark and kafka always crash

2015-09-14 Thread Joanne Contact
How to prevent it? - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Re: Caching intermediate results in Spark ML pipeline?

2015-09-14 Thread Feynman Liang
You can persist the transformed Dataframes, for example val data : DF = ... val hashedData = hashingTF.transform(data) hashedData.cache() // to cache DataFrame in memory Future usage of hashedData read from an in-memory cache now. You can also persist to disk, eg:

Re: Caching intermediate results in Spark ML pipeline?

2015-09-14 Thread Jingchu Liu
Hey Feynman, Thanks for your response, but I'm afraid "model save/load" is not exactly the feature I'm looking for. What I need to cache and reuse are the intermediate outputs of transformations, not transformer themselves. Do you know any related dev. activities or plans? Best, Lewis

Re: hdfs-ha on mesos - odd bug

2015-09-14 Thread Sam Bessalah
I don't know about the broken url. But are you running HDFS as a mesos framework? If so is it using mesos-dns? Then you should resolve the namenode via hdfs:/// On Mon, Sep 14, 2015 at 3:55 PM, Adrian Bridgett wrote: > I'm hitting an odd issue with running spark on

How to clear Kafka offset in Spark streaming?

2015-09-14 Thread Bin Wang
Hi, I'm using spark streaming with kafka and I need to clear the offset and re-compute all things. I deleted checkpoint directory in HDFS and reset kafka offset with "kafka-run-class kafka.tools.ImportZkOffsets". I can confirm the offset is set to 0 in kafka: ~ > kafka-run-class

Interacting with Different Versions of Hive Metastore, how to config?

2015-09-14 Thread bg_spark
spark.sql.hive.metastore.version0.13.1 Version of the Hive metastore. Available options are 0.12.0 through 1.2.1. spark.sql.hive.metastore.jars builtin Location of the jars that should be used to instantiate the HiveMetastoreClient. This property can be one of three options:

Re: Best way to merge final output part files created by Spark job

2015-09-14 Thread Gaspar Muñoz
Hi, check out FileUtil.copyMerge function in the Hadoop API

Parse tab seperated file inc json efficent

2015-09-14 Thread matthes
I try to parse a tab seperated file in Spark 1.5 with a json section as efficent as possible. The file looks like follows: value1value2{json} How can I parse all fields inc the json fields into a RDD directly? If I use this peace of code: val jsonCol = sc.textFile("/data/input").map(l =>

Trouble using dynamic allocation and shuffle service.

2015-09-14 Thread Philip Weaver
Hello, I am trying to use dynamic allocation which requires the shuffle service. I am running Spark on mesos. Whenever I set spark.shuffle.service.enabled=true, my Spark driver fails with an error like this: Caused by: java.net.ConnectException: Connection refused: devspark1/ 172.26.21.70:7337

Re: Trouble using dynamic allocation and shuffle service.

2015-09-14 Thread Tim Chen
Hi Philip, I've included documentation in the Spark/Mesos doc ( http://spark.apache.org/docs/latest/running-on-mesos.html), where you can start the MesosShuffleService with sbin/start-mesos-shuffle-service.sh script. The shuffle service needs to be started manually for Mesos on each slave (one

Re: Is it required to remove checkpoint when submitting a code change?

2015-09-14 Thread Ricardo Paiva
Hi Cody, Thanks for your answer. I had already tried to change the spark submit parameters, but I double checked to reply your answer. Even changing properties file or directly on the spark-submit arguments, none of them work when the application runs from the checkpoint. It seems that

Re: JavaRDD using Reflection

2015-09-14 Thread Ajay Singal
Hello Rachana, The easiest way would be to start with creating a 'parent' JavaRDD and run different filters (based on different input arguments) to create respective 'child' JavaRDDs dynamically. Notice that the creation of these children RDDs is handled by the application driver. Hope this

Re: Trouble using dynamic allocation and shuffle service.

2015-09-14 Thread Philip Weaver
Ah, I missed that, thanks! On Mon, Sep 14, 2015 at 11:45 AM, Tim Chen wrote: > Hi Philip, > > I've included documentation in the Spark/Mesos doc ( > http://spark.apache.org/docs/latest/running-on-mesos.html), where you can > start the MesosShuffleService with

Spark Streaming application code change and stateful transformations

2015-09-14 Thread Ofir Kerker
Hi, My Spark Streaming application consumes messages (events) from Kafka every 10 seconds using the direct stream approach and aggregates these messages into hourly aggregations (to answer analytics questions like: "How many users from Paris visited page X between 8PM to 9PM") and save the data to

Re: Spark Streaming application code change and stateful transformations

2015-09-14 Thread Cody Koeninger
Solution 2 sounds better to me. You aren't always going to have graceful shutdowns. On Mon, Sep 14, 2015 at 1:49 PM, Ofir Kerker wrote: > Hi, > My Spark Streaming application consumes messages (events) from Kafka every > 10 seconds using the direct stream approach and

Re: Is there any Spark SQL reference manual?

2015-09-14 Thread vivek bhaskar
Thanks Richard, Ted. Hope we have some reference available soon. Peymen, I had a look at this link before at this but was looking for something with broader coverage. PS: Richard, Kindly advise me for generating BNF description of the grammar via derby build script. Since this may not be of

Re: Spark Streaming..Exception

2015-09-14 Thread Priya Ch
Hi All, I came across the related old conversation on the above issue ( https://issues.apache.org/jira/browse/SPARK-5594. ) Is the issue fixed? I tried different values for spark.cleaner.ttl -> 0sec, -1sec, 2000sec,..none of them worked. I also tried setting spark.streaming.unpersist -> true.

DAG Scheduler deadlock when two RDDs reference each other, force Stages manually?

2015-09-14 Thread petranidis
Hi all, I am new to spark and I have writen a few spark programs mostly around machine learning applications. I am trying to resolve a particular problem where there are two RDDs that should be updated by using elements of each other. More specifically, if the two pair RDDs are called A and B

Spark Streaming Topology

2015-09-14 Thread defstat
Hi all, I would like to use Spark Streaming for managing the problem below: I have 2 InputStreams, one for one type of input (n-dimensional vectors) and one for question on the infrastructure (explained below). I need to "break" the input first in 4 execution nodes, and produce a stream from

Re: DAG Scheduler deadlock when two RDDs reference each other, force Stages manually?

2015-09-14 Thread Petros Nyfantis
Hi Sean thanks a lot for your reply, yes I understand that as scala is a functional language maps correspond to transforms of immutable objects but the behavior of the program seems like a deadlock as it simply does not proceed beyond the B = B.map (A.aggregate) stage my Spark Web interface