Re: application failed on large dataset

2015-09-15 Thread 周千昊
Hi, after check with the yarn logs, all the error stack looks like below: 15/09/15 19:58:23 ERROR shuffle.OneForOneBlockFetcher: Failed while starting block fetches java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at

Re: Replacing Esper with Spark Streaming?

2015-09-15 Thread Koert Kuipers
fair enough i was trying to say that if esper were obsolete (which i am not suggesting) than that does NOT mean espertech is dead... On Tue, Sep 15, 2015 at 10:20 PM, Thomas Bernhardt < bernhardt...@yahoo.com.invalid> wrote: > Let me say first, I'm the Esper project lead. > Esper is alive and

GraphX, graph clustering, pattern matching

2015-09-15 Thread Alex Karargyris
I am new to Spark and I was wondering if anyone would help me on pointing me to the right direction: Are there any algorithms/tutorials available on Spark's GraphX for graph clustering and pattern matching? More specifically I am interested in: a) querying a small graph against a larger graph and

Re: Spark wastes a lot of space (tmp data) for iterative jobs

2015-09-15 Thread Alexis Gillain
You can try system.gc() considering that checkpointing is enabled by default in graphx : https://forums.databricks.com/questions/277/how-do-i-avoid-the-no-space-left-on-device-error.html 2015-09-15 22:42 GMT+08:00 Ali Hadian : > Hi! > We are executing the PageRank

Re: How to convert dataframe to a nested StructType schema

2015-09-15 Thread Terry Hole
Hao, For spark 1.4.1, you can try this: val rowrdd = df.rdd.map(r => Row(Row(r(3)), Row(r(0), r(1), r(2 val newDF = sqlContext.createDataFrame(rowrdd, yourNewSchema) Thanks! - Terry On Wed, Sep 16, 2015 at 2:10 AM, Hao Wang wrote: > Hi, > > I created a dataframe

Re: Avoiding SQL Injection in Spark SQL

2015-09-15 Thread V Dineshkumar
Hi, I was looking for the support of bind variables as Ruslan pointed out. I came around with a different workaround as we cannot use dataframes in our project,we are more dependent on using the SQL queries. val HC=new HiveContext(sc) val query=HC.sql("select * from eici_view where

Re: Caching intermediate results in Spark ML pipeline?

2015-09-15 Thread Feynman Liang
If you're doing hyperparameter grid search, consider using ml.tuning.CrossValidator which does cache the dataset . Otherwise, perhaps you can elaborate more on your particular

Re: Caching intermediate results in Spark ML pipeline?

2015-09-15 Thread Jingchu Liu
Yeah I understand on the low-level we should do as you said. But since ML pipeline is a high-level API, it is pretty natural to expect the ability to recognize overlapping parameters between successive runs. (Actually, this happen A LOT when we have lots of hyper-params to search for) I can also

Difference between sparkDriver and "executor ID driver"

2015-09-15 Thread Muler
I'm running Spark in local mode and getting these two log messages who appear to be similar. I want to understand what each is doing: 1. [main] util.Utils (Logging.scala:logInfo(59)) - Successfully started service 'sparkDriver' on port 60782. 2. [main] executor.Executor

Re: Replacing Esper with Spark Streaming?

2015-09-15 Thread Koert Kuipers
obsolete is not the same as dead... we have a few very large tech companies to prove that point On Tue, Sep 15, 2015 at 4:32 PM, Bertrand Dechoux wrote: > The big question would be what feature of Esper your are using. Esper is a > CEP solution. I doubt that Spark Streaming

Re: Replacing Esper with Spark Streaming?

2015-09-15 Thread Thomas Bernhardt
Let me say first, I'm the Esper project lead.Esper is alive and well and not at all obsolete. Esper provides event series analysis by providing an SQL92-standards event processing language (EPL). It allows to express situation detection logic very concisely, usually much more concisely then any

Re: hdfs-ha on mesos - odd bug

2015-09-15 Thread Marcelo Vanzin
On Mon, Sep 14, 2015 at 6:55 AM, Adrian Bridgett wrote: > 15/09/14 13:00:25 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, > 10.1.200.245): java.lang.IllegalArgumentException: > java.net.UnknownHostException: nameservice1 > at >

Re: Change protobuf version or any other third party library version in Spark application

2015-09-15 Thread Lan Jiang
I am happy to report that after set spark.dirver.userClassPathFirst, I can use protobuf 3 with spark-shell. Looks like the classloading issue in the driver, not executor. Marcelo, thank you very much for the tip! Lan > On Sep 15, 2015, at 1:40 PM, Marcelo Vanzin wrote:

[ANNOUNCE] Apache Gora 0.6.1 Release

2015-09-15 Thread lewis john mcgibbney
Hi All, The Apache Gora team are pleased to announce the immediate availability of Apache Gora 0.6.1. What is Gora? Gora is a framework which provides an in-memory data model and persistence for big data. Gora supports persisting to column stores, key value stores, document stores and RDBMSs,

How to speed up MLlib LDA?

2015-09-15 Thread Marko Asplund
Hi, I'm trying out MLlib LDA training with 100 topics, 105 K vocabulary size and ~3.4 M documents using EMLDAOptimizer. Training the model took ~2.5 hours with MLlib, whereas with Vowpal Wabbit training with the same data and on the same system set took ~5 minutes. I realize that there are

Re: A way to timeout and terminate a laggard 'Stage' ?

2015-09-15 Thread Akhil Das
As of now i think its a no. Not sure if its a naive approach, but yes you can have a separate program to keep an eye in the webui (possibly parsing the content) and make it trigger the kill task/job once it detects a lag. (Again you will have to figure out the correct numbers before killing any

Re: Spark Streaming Suggestion

2015-09-15 Thread srungarapu vamsi
The batch approach i had implemented takes about 10 minutes to complete all the pre-computation tasks for the one hour worth of data. When i went through my code, i figured out that most of the time consuming tasks are the ones, which read data from cassandra and the places where i perform

Re: hdfs-ha on mesos - odd bug

2015-09-15 Thread Adrian Bridgett
Hi Sam, in short, no, it's a traditional install as we plan to use spot instances and didn't want price spikes to kill off HDFS. We're actually doing a bit of a hybrid, using spot instances for the mesos slaves, ondemand for the mesos masters. So for the time being, putting hdfs on the

Re: why spark and kafka always crash

2015-09-15 Thread Akhil Das
Can you be more precise? Thanks Best Regards On Tue, Sep 15, 2015 at 11:28 AM, Joanne Contact wrote: > How to prevent it? > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For

Re: Change protobuf version or any other third party library version in Spark application

2015-09-15 Thread Marcelo Vanzin
Hi, Just "spark.executor.userClassPathFirst" is not enough. You should also set "spark.driver.userClassPathFirst". Also not that I don't think this was really tested with the shell, but that should work with regular apps started using spark-submit. If that doesn't work, I'd recommend shading, as

Re: A way to timeout and terminate a laggard 'Stage' ?

2015-09-15 Thread Dmitry Goldenberg
Thanks, Mark, will look into that... On Tue, Sep 15, 2015 at 12:33 PM, Mark Hamstra wrote: > There is the Async API ( > https://github.com/clearstorydata/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/AsyncRDDActions.scala), > which makes use of FutureAction

Re: Spark ANN

2015-09-15 Thread Ruslan Dautkhanov
Thank you Alexander. Sounds like quite a lot of good and exciting changes slated for Spark's ANN. Looking forward to it. -- Ruslan Dautkhanov On Wed, Sep 9, 2015 at 7:10 PM, Ulanov, Alexander wrote: > Thank you, Feynman, this is helpful. The paper that I linked

How to convert dataframe to a nested StructType schema

2015-09-15 Thread Hao Wang
Hi, I created a dataframe with 4 string columns (city, state, country, zipcode). I then applied the following nested schema to it by creating a custom StructType. When I run df.take(5), it gives the exception below as expected. The question is how I can convert the Rows in the dataframe to

Managing scheduling delay in Spark Streaming

2015-09-15 Thread Michal Čizmazia
Hi, I have a Reliable Custom Receiver storing messages into Spark. Is there way how to prevent my receiver from storing more messages into Spark when the Scheduling Delay reaches a certain threshold? Possible approaches: #1 Does Spark block on the Receiver.store(messages) call to prevent storing

DStream flatMap "swallows" records

2015-09-15 Thread Jeffrey Jedele
Hi all, I've got a problem with Spark Streaming (both 1.3.1 and 1.5). Following situation: There is a component which gets a DStream of URLs. For each of these URLs, it should access it, retrieve several data elements and pass those on for further processing. The relevant code looks like this:

Re: Change protobuf version or any other third party library version in Spark application

2015-09-15 Thread Steve Loughran
On 15 Sep 2015, at 05:47, Lan Jiang > wrote: Hi, there, I am using Spark 1.4.1. The protobuf 2.5 is included by Spark 1.4.1 by default. However, I would like to use Protobuf 3 in my spark application so that I can use some new features such as Map

Re: hdfs-ha on mesos - odd bug

2015-09-15 Thread Steve Loughran
> On 15 Sep 2015, at 08:55, Adrian Bridgett wrote: > > Hi Sam, in short, no, it's a traditional install as we plan to use spot > instances and didn't want price spikes to kill off HDFS. > > We're actually doing a bit of a hybrid, using spot instances for the mesos >

Re: Spark aggregateByKey Issues

2015-09-15 Thread biyan900116
Hi Alexis: Of course, it’s very useful to me, specially about the operations after sort operation is done. And, i still have one question: How to set the decent number of partition, if it need not to be equal to the number of keys ? > 在 2015年9月15日,下午3:41,Alexis Gillain

Relational Log Data

2015-09-15 Thread 328d95
I am trying to read logs which have many irrelevant lines and whose lines are related by a thread number in each line. Firstly, if I read from a text file using the textFile function and then call multiple filter functions on that file will Spark apply all of the filters using one read pass? Eg

Re: Relational Log Data

2015-09-15 Thread ayan guha
Spark functions are lazy, so none of them actually do anything until an action is encountered. And no, your code will NOT read the file multiple time. On Tue, Sep 15, 2015 at 7:33 PM, 328d95 <20500...@student.uwa.edu.au> wrote: > I am trying to read logs which have many irrelevant lines and

Re: Caching intermediate results in Spark ML pipeline?

2015-09-15 Thread Feynman Liang
Nope, and that's intentional. There is no guarantee that rawData did not change between intermediate calls to searchRun, so reusing a cached data1 would be incorrect. If you want data1 to be cached between multiple runs, you have a few options: * cache it first and pass it in as an argument to

Re: Change protobuf version or any other third party library version in Spark application

2015-09-15 Thread Lan Jiang
I used the --conf spark.files.userClassPathFirst=true in the spark-shell option, it still gave me the eror: java.lang.NoSuchFieldError: unknownFields if I use protobuf 3. The output says spark.files.userClassPathFirst is deprecated and suggest using spark.executor.userClassPathFirst. I tried

Re: A way to timeout and terminate a laggard 'Stage' ?

2015-09-15 Thread Mark Hamstra
There is the Async API ( https://github.com/clearstorydata/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/AsyncRDDActions.scala), which makes use of FutureAction ( https://github.com/clearstorydata/spark/blob/master/core/src/main/scala/org/apache/spark/FutureAction.scala). You could

Re: Change protobuf version or any other third party library version in Spark application

2015-09-15 Thread Guru Medasani
Hi Lan, Reading the pull request below. Looks like you should be able to use the config to both drivers and executors. I would give it a try with the Spark-shell on Yarn client mode. https://github.com/apache/spark/pull/3233 Yarn's config option

Getting parent RDD

2015-09-15 Thread Samya
Hi Team I have the below situation. val ssc = val msgStream = . //SparkKafkaDirectAPI val wordCountPair = TransformStream.transform(msgStream) /wordCountPair.foreachRDD(rdd => try{ //Some action that causes exception }catch { case ex1 : Exception => {

Re: How to speed up MLlib LDA?

2015-09-15 Thread Feynman Liang
Hi Marko, I haven't looked into your case in much detail but one immediate thought is: have you tried the OnlineLDAOptimizer? It's implementation and resulting LDA model (LocalLDAModel) is quite different (doesn't depend on GraphX, assumes the model fits on a single machine) so you may see

Re: application failed on large dataset

2015-09-15 Thread 周千昊
has anyone met the same problems? 周千昊 于2015年9月14日周一 下午9:07写道: > Hi, community > I am facing a strange problem: > all executors does not respond, and then all of them failed with the > ExecutorLostFailure. > when I look into yarn logs, there are full of such

RE: Change protobuf version or any other third party library version in Spark application

2015-09-15 Thread java8964
If you use Standalone mode, just start spark-shell like following: spark-shell --jars your_uber_jar --conf spark.files.userClassPathFirst=true Yong Date: Tue, 15 Sep 2015 09:33:40 -0500 Subject: Re: Change protobuf version or any other third party library version in Spark application From:

Re: hdfs-ha on mesos - odd bug

2015-09-15 Thread Iulian Dragoș
I've seen similar traces, but couldn't track down the failure completely. You are using Kerberos for your HDFS cluster, right? AFAIK Kerberos isn't supported in Mesos deployments. Can you resolve that host name (nameservice1) from the driver machine (ping nameservice1)? Can it be resolved from

Re: Random Forest MLlib

2015-09-15 Thread Yasemin Kaya
Hi Maximo, Is there a way getting precision and recall from pipeline? In MLlib version I get precision and recall metrics from MulticlassMetrics but ML pipeLine says only testErr. Thanks yasemin 2015-09-10 17:47 GMT+03:00 Yasemin Kaya : > Hi Maximo, > Thanks alot.. > Hi

How does driver memory utilized

2015-09-15 Thread Renu
Hi I have query regarding driver memory what are the tasks in which driver memory is used? Please Help -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-does-driver-memory-utilized-tp24699.html Sent from the Apache Spark User List mailing list archive

Re: Directly reading data from S3 to EC2 with PySpark

2015-09-15 Thread Cazen
Good day junHyeok Did you set HADOOP_CONF_DIR? It seems that spark cannot find AWS key properties If it doesn't work after set, How about export AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY before running py-spark shell? BR -- View this message in context:

Re: Spark aggregateByKey Issues

2015-09-15 Thread Alexis Gillain
That's the tricky part. If the volume of data is always the same you can test and learn one. If the volume of data can vary you can use the number of records in your file divide by the number of records you think can fit in memory. Anyway the distribution of your records can still impact the

Re: mappartition's FlatMapFunction help

2015-09-15 Thread Ankur Srivastava
Hi, The signatures are perfect. I also tried same code on eclipse and for some reason eclipse did not import java.util.Iterator. Once I imported it, it is fine. Might be same issue with NetBeans. Thanks Ankur On Tue, Sep 15, 2015 at 10:11 AM, dinizthiagobr wrote: >

mappartition's FlatMapFunction help

2015-09-15 Thread dinizthiagobr
Can't get this one to work and I have no idea why. JavaPairRDD> lel = gen.groupByKey(); JavaRDD partitions = lel.mapPartitions( new FlatMapFunction>>, String> () { public Iterable call(Iterator>>

RE: application failed on large dataset

2015-09-15 Thread java8964
When you saw this error, does any executor die due to whatever error? Do you check to see if any executor restarts during your job? It is hard to help you just with the stack trace. You need to tell us the whole picture when your jobs are running. Yong From: qhz...@apache.org Date: Tue, 15 Sep

Re: Dynamic Workflow Execution using Spark

2015-09-15 Thread Ted Yu
See this thread: http://search-hadoop.com/m/q3RTtUz0cyiPjYX On Tue, Sep 15, 2015 at 1:19 PM, Ashish Soni wrote: > Hi All , > > Are there any framework which can be used to execute workflows with in > spark or Is it possible to use ML Pipeline for workflow execution but

Re: Replacing Esper with Spark Streaming?

2015-09-15 Thread Bertrand Dechoux
The big question would be what feature of Esper your are using. Esper is a CEP solution. I doubt that Spark Streaming can do everything Esper does without any development. Spark (Streaming) is more a general-purpose platform. http://www.espertech.com/products/esper.php But I would be glad to be

Dynamic Workflow Execution using Spark

2015-09-15 Thread Ashish Soni
Hi All , Are there any framework which can be used to execute workflows with in spark or Is it possible to use ML Pipeline for workflow execution but not doing ML . Thanks, Ashish

Re: Spark Streaming Suggestion

2015-09-15 Thread ayan guha
I think you need to make up your mind about storm vs spark. Using both in this context does not make much sense to me. On 15 Sep 2015 22:54, "David Morales" wrote: > Hi there, > > This is exactly our goal in Stratio Sparkta, a real-time aggregation > engine fully developed

Using ML KMeans without hardcoded feature vector creation

2015-09-15 Thread Tóth Zoltán
Hi, I'm wondering if there is a concise way to run ML KMeans on a DataFrame if I have the features in multiple numeric columns. I.e. as in the Iris dataset: (a1=5.1, a2=3.5, a3=1.4, a4=0.2, id=u'id_1', label=u'Iris-setosa', binomial_label=1) I'd like to use KMeans without recreating the DataSet

Re: unoin streams not working for streams > 3

2015-09-15 Thread Cody Koeninger
I assume you're using the receiver based stream (createStream) rather than createDirectStream? Receivers each get scheduled as if they occupy a core, so you need at least one more core than number of receivers if you want to get any work done. Try using the direct stream if you can't combine

Re: unoin streams not working for streams > 3

2015-09-15 Thread Василец Дмитрий
thanks.I will try. On Tue, Sep 15, 2015 at 4:19 PM, Cody Koeninger wrote: > I assume you're using the receiver based stream (createStream) rather than > createDirectStream? > > Receivers each get scheduled as if they occupy a core, so you need at > least one more core than

Re: Change protobuf version or any other third party library version in Spark application

2015-09-15 Thread Lan Jiang
Steve, Thanks for the input. You are absolutely right. When I use protobuf 2.6.1, I also ran into method not defined errors. You suggest using Maven sharding strategy, but I have already built the uber jar to package all my custom classes and its dependencies including protobuf 3. The problem is

spark performance - executor computing time

2015-09-15 Thread patcharee
Hi, I was running a job (on Spark 1.5 + Yarn + java 8). In a stage that lookup (org.apache.spark.rdd.PairRDDFunctions.lookup(PairRDDFunctions.scala:873)) there was an executor that took the executor computing time > 6 times of median. This executor had almost the same shuffle read size and

How does driver memory utilized

2015-09-15 Thread Renu Yadav
Hi I have query regarding driver memory what are the tasks in which driver memory is used? Please Help

Re: Worker Machine running out of disk for Long running Streaming process

2015-09-15 Thread gaurav sharma
Hi TD, Sorry for late reply, I implemented ur suggestion, but unfortunately it didnt help me, i am still able to see very old schuffle files, because of which ultimately my long runnning spark job gets terminated Below is what i did. //This is the spark-submit job public class

Re: Directly reading data from S3 to EC2 with PySpark

2015-09-15 Thread ayan guha
Also you can set hadoop conf through jsc.hadoopConf property. Do a dir (sc) to see exact property name On 15 Sep 2015 22:43, "Gourav Sengupta" wrote: > Hi, > > If you start your EC2 nodes with correct roles (default in most cases > depending on your needs) you should

Prevent spark from serializing some objects

2015-09-15 Thread lev
Hello, As I understand it, using the method /bar/ will result in serializing the /Foo/ instance to the cluster: /class Foo() { val x = 5 def bar(rdd: RDD[Int]): RDD[Int] = { rdd.map(_*x) } }/ and since the /Foo/ instance might be very big, it might cause performance hit. I

Re: Directly reading data from S3 to EC2 with PySpark

2015-09-15 Thread Gourav Sengupta
Hi, If you start your EC2 nodes with correct roles (default in most cases depending on your needs) you should be able to work on S3 and all other AWS resources without giving any keys. I have been doing that for some time now and I have not faced any issues yet. Regards, Gourav On Tue, Sep

RE: Change protobuf version or any other third party library version in Spark application

2015-09-15 Thread java8964
It is a bad idea to use the major version change of protobuf, as it most likely won't work. But you really want to give it a try, set the "user classpath first", so the protobuf 3 coming with your jar will be used. The setting depends on your deployment mode, check this for the parameter:

Re: hdfs-ha on mesos - odd bug

2015-09-15 Thread Adrian Bridgett
Thanks Steve - we are already taking the safe route - putting NN and datanodes on the central mesos-masters which are on demand. Later (much later!) we _may_ put some datanodes on spot instances (and using several spot instance types as the spikes seem to only affect one type - worst case we

Re: How to speed up MLlib LDA?

2015-09-15 Thread Marko Asplund
While doing some more testing I noticed that loading the persisted model from disk (~2 minutes) as well as querying LDA model topic distributions (~4 seconds for one document) are quite slow operations. Our application is querying LDA model topic distribution (for one doc at a time) as part of

Re: Spark Streaming Suggestion

2015-09-15 Thread David Morales
Hi there, This is exactly our goal in Stratio Sparkta, a real-time aggregation engine fully developed with spark streaming (and fully open source). Take a look at: - the docs: http://docs.stratio.com/modules/sparkta/development/ - the repository: https://github.com/Stratio/sparkta -

Spark wastes a lot of space (tmp data) for iterative jobs

2015-09-15 Thread Ali Hadian
Hi! We are executing the PageRank example from the Spark java examples package on a very large input graph. The code is available here. (Spark's github repo). During the execution, the framework generates huge amount of intermediate data per each iteration (i.e. the contribs RDD). The