Re: yarn-cluster mode throwing NullPointerException

2015-10-12 Thread Venkatakrishnan Sowrirajan
Hi Rachana, Are you by any chance saying something like this in your code ​? ​ "sparkConf.setMaster("yarn-cluster");" ​SparkContext is not supported with yarn-cluster mode.​ I think you are hitting this bug -- > https://issues.apache.org/jira/browse/SPARK-7504. This got fixed in Spark-1.4.0,

RE: Streaming Application Unable to get Stream from Kafka

2015-10-12 Thread Prateek .
Hi Terry, Thanks a lot. It was the resource problem , Spark was able to get only one thread. It’s working fine now with local[*]. Cheers, Prateek From: Terry Hoo [mailto:hujie.ea...@gmail.com] Sent: Saturday, October 10, 2015 9:51 AM To: Prateek . Cc:

Define new stage in pipeline

2015-10-12 Thread Nethaji Chandrasiri
Hi, Are there any samples to learn how to define a new stage in pipeline like HashingTF using java ? Thanks -- *Nethaji Chandrasiri* *Software Engineering* *Intern; WSO2, Inc.; http://wso2.com * Mobile : +94 (0) 779171059 <%2B94%20%280%29%20778%20800570> Email :

Re: Spark handling parallel requests

2015-10-12 Thread Akhil Das
Instead of pushing your requests to the socket, why don't you push them to a Kafka or any other message queue and use spark streaming to process them? Thanks Best Regards On Mon, Oct 5, 2015 at 6:46 PM, wrote: > Hi , > i am using Scala , doing a socket

Re: SparkSQL: First query execution is always slower than subsequent queries

2015-10-12 Thread Xiao Li
Hi, Lloyd, Both runs are cold/warm? Memory/cache hit/miss could be a big factor if your application is IO intensive. You need to monitor your system to understand what is your bottleneck. Good lucks, Xiao Li

Spark retrying task indefinietly

2015-10-12 Thread Amit Singh Hora
I am running spark locally to understand how countByValueAndWindow works val Array(brokers, topics) = Array("192.XX.X.XX:9092", "test1") // Create context with 2 second batch interval val sparkConf = new

Re: Configuring Spark for reduceByKey on on massive data sets

2015-10-12 Thread hotdog
hi Daniel, Do you solve your problem? I met the same problem when running massive data using reduceByKey on yarn. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Configuring-Spark-for-reduceByKey-on-on-massive-data-sets-tp5966p25023.html Sent from the

Re: Best practices to call small spark jobs as part of REST api

2015-10-12 Thread Xiao Li
The design majorly depends on your use cases. You have to think about the requirements and rank them. For example, if your application cares the response time and is ok to read the stale data, using a nosql database as a middleware is a good option. Good Luck, Xiao Li 2015-10-11 21:00

Re: Spark handling parallel requests

2015-10-12 Thread Xiao Li
Hi, Tarek, It is hard to answer your question. Are these requests similar? Caching your results or intermediate results in your applications? Or does that mean your throughput requirement is very high? Throttling the number of concurrent requests? ... As Akhil said, Kafka might help in your

TaskMemoryManager. cleanUpAllAllocatedMemory -> Memory leaks ???

2015-10-12 Thread Lei Wu
Dear all, I'm reading source code of TaskMemoryManager.java and I got stuck in the last function, that is cleanUpAllAllocatedMemory. What confuses me is the comments for this function : "A non-zero return value can be used to detect memory leaks". And from Executor.scala where this function is

Data skipped while writing Spark Streaming output to HDFS

2015-10-12 Thread Sathiskumar
I'm running a Spark Streaming application for every 10 seconds, its job is to consume data from kafka, transform it and store it into HDFS based on the key. i.e, a file per unique key. I'm using the Hadoop's saveAsHadoopFile() API to store the output, I see that a file gets generated for every

Re: SQLContext within foreachRDD

2015-10-12 Thread Daniel Haviv
Just wanted to make sure. Thanks. Daniel On Mon, Oct 12, 2015 at 1:07 PM, Adrian Tanase wrote: > Not really, unless you’re doing something wrong (e.g. Call collect or > similar). > > In the foreach loop you’re typically registering a temp table, by > converting an RDD to

Re: SQLContext within foreachRDD

2015-10-12 Thread Adrian Tanase
Not really, unless you’re doing something wrong (e.g. Call collect or similar). In the foreach loop you’re typically registering a temp table, by converting an RDD to data frame. All the subsequent queries are executed in parallel on the workers. I haven’t built production apps with this

SQLContext within foreachRDD

2015-10-12 Thread Daniel Haviv
Hi, As things that run inside foreachRDD run at the driver, does that mean that if we use SQLContext inside foreachRDD the data is sent back to the driver and only then the query is executed or is it executed at the executors? Thank you. Daniel

Re: What is the difference between ml.classification.LogisticRegression and mllib.classification.LogisticRegressionWithLBFGS

2015-10-12 Thread YiZhi Liu
Hi Joseph, Thank you for clarifying the motivation that you setup a different API for ml pipelines, it sounds great. But I still think we could extract some common parts of the training & inference procedures for ml and mllib. In ml.classification.LogisticRegression, you simply transform the

how to use SharedSparkContext

2015-10-12 Thread Fengdong Yu
Hi, How to add dependency in build.sbt if I want to use SharedSparkContext? I’ve added spark-core, but it doesn’t work.(cannot find SharedSparkContext) - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For

Re: "dynamically" sort a large collection?

2015-10-12 Thread Adrian Tanase
I think you’re looking for the flatMap (or flatMapValues) operator – you can do something like sortedRdd.flatMapValues( v => If (v % 2 == 0) { Some(v / 2) } else { None } ) Then you need to sort again. -adrian From: Yifan LI Date: Monday, October 12, 2015 at 1:03 PM To: spark users Subject:

"dynamically" sort a large collection?

2015-10-12 Thread Yifan LI
Hey, I need to scan a large "key-value" collection as below: 1) sort it on an attribute of “value” 2) scan it one by one, from element with largest value 2.1) if the current element matches a pre-defined condition, its value will be reduced and the element will be inserted back to collection.

Re: "dynamically" sort a large collection?

2015-10-12 Thread Yifan LI
Hey Adrian, Thanks for your fast reply. :) Actually the “pre-condition” is not fixed in real application, e.g. it would change based on counting of previous unmatched elements. So I need to use iterator operator, rather than flatMap-like operators… Besides, do you have any idea on how to avoid

Re: "dynamically" sort a large collection?

2015-10-12 Thread Yifan LI
Shiwei, yes, you might be right. Thanks. :) Best, Yifan LI > On 12 Oct 2015, at 12:55, 郭士伟 wrote: > > I think this is not a problem Spark can solve effectively, cause RDD in > immutable. Every time you want to change an RDD, you create a new one, and > sort again.

Re: TaskMemoryManager. cleanUpAllAllocatedMemory -> Memory leaks ???

2015-10-12 Thread Ted Yu
Please note the block where cleanUpAllAllocatedMemory() is called: } finally { val freedMemory = taskMemoryManager.cleanUpAllAllocatedMemory() if (freedMemory > 0) { I think the intention is that allocated memory should have been freed by the time we reach the finally

Creating Custom Receiver for Spark Streaming

2015-10-12 Thread Something Something
Is it safe to assume that Spark will always create a single instance of Custom Receiver? Or would it create multiple instances on each node in a cluster? Wondering if I need to worry about receiving the same message on different nodes etc. Please help. Thanks.

Re: pagination spark sq

2015-10-12 Thread Richard Hillegas
Hi Ravi, If you build Spark with Hive support, then your sqlContext variable will be an instance of HiveContext and you will enjoy the full capabilities of the Hive query language rather than the more limited capabilities of Spark SQL. However, even Hive QL does not support the OFFSET clause, at

Re: What is the abstraction for a Worker process in Spark code

2015-10-12 Thread Shixiong Zhu
Which mode are you using? For standalone, it's org.apache.spark.deploy.worker.Worker. For Yarn and Mesos, Spark just submits its request to them and they will schedule processes for Spark. Best Regards, Shixiong Zhu 2015-10-12 20:12 GMT+08:00 Muhammad Haseeb Javed <11besemja...@seecs.edu.pk> :

read from hive tables and write back to hive

2015-10-12 Thread Hafiz Mujadid
Hi! How can i read/write data from/to hive? Is it necessary to compile spark with hive profile to interact with hive? which maven dependencies are required to interact with hive? i could not find a well documentation to follow step by step to get working with hive. thanks -- View this

Re: How to calculate percentile of a column of DataFrame?

2015-10-12 Thread Umesh Kacha
Hi if you can help it would be great as I am stuck don't know how to remove compilation error in callUdf when we pass three parameters function name string column name as col and lit function please guide On Oct 11, 2015 1:05 AM, "Umesh Kacha" wrote: > Hi any idea? how do

Is there any better way of writing this code

2015-10-12 Thread kali.tumm...@gmail.com
Hi All, just wonderign is there any better way of writing this below code, I am new to spark an I feel what I wrote is pretty simple and basic and straight forward is there any better way of writing using functional paradigm. val QuoteRDD=quotefile.map(x => x.split("\\|")). filter(line

Re: "java.io.IOException: Filesystem closed" on executors

2015-10-12 Thread Akhil Das
Can you look a bit deeper in the executor logs? It could be filling up the memory and getting killed. Thanks Best Regards On Mon, Oct 5, 2015 at 8:55 PM, Lan Jiang wrote: > I am still facing this issue. Executor dies due to > > org.apache.avro.AvroRuntimeException:

Re: Is there any better way of writing this code

2015-10-12 Thread Sean Owen
A few small-scale code style tips which might improve readability: You can in some cases write something like _.split("...") instead of x => x.split("...") A series of if-else conditions on x(15) can be replaced with "x(15) match { case"B" => ... }" .isEmpty may be more readable than == "" With

Re: Spark retrying task indefinietly

2015-10-12 Thread Adrian Tanase
To answer your question specifically - you can bump the value on spark.streaming.kafka.maxRetries (see configuration guide: http://spark.apache.org/docs/latest/configuration.html). That being said, you should avoid this by adding some validation in your deserializaiton / parse code. A quick

What is the abstraction for a Worker process in Spark code

2015-10-12 Thread Muhammad Haseeb Javed
I understand that each executor that is processing a Spark job is emulated in Spark code by the Executor class in Executor.scala and CoarseGrainedExecutorBackend is the abstraction which facilitates communication between an Executor and the Driver. But what is the abstraction for a Worker process

Re: How to change verbosity level and redirect verbosity to file?

2015-10-12 Thread Akhil Das
Have a look http://stackoverflow.com/questions/28454080/how-to-log-using-log4j-to-local-file-system-inside-a-spark-application-that-runs Thanks Best Regards On Mon, Oct 5, 2015 at 9:42 PM, wrote: > Hi, > > I would like to read the full spark-submit log once a job

Re: How to calculate percentile of a column of DataFrame?

2015-10-12 Thread Richard Eggert
I think the problem may be that callUDF takes a DataType indicating the return type of the UDF as its second argument. On Oct 12, 2015 9:27 AM, "Umesh Kacha" wrote: > Hi if you can help it would be great as I am stuck don't know how to > remove compilation error in callUdf

Re: How to calculate percentile of a column of DataFrame?

2015-10-12 Thread Ted Yu
Umesh: Have you tried calling callUdf without the lit() parameter ? Cheers On Mon, Oct 12, 2015 at 6:27 AM, Umesh Kacha wrote: > Hi if you can help it would be great as I am stuck don't know how to > remove compilation error in callUdf when we pass three parameters

pagination spark sq

2015-10-12 Thread Ravisankar Mani
Hi everyone, Can you please share optimized query for pagination spark sql? In Ms SQL Server, They have supported "offset" method query for specific row selection. Please find the following query Select BusinessEntityID,[FirstName], [LastName],[JobTitle] from HumanResources.vEmployee Order By

Can't create UDF's in spark 1.5 while running using the hive thrift service

2015-10-12 Thread Trystan Leftwich
Hi everyone, Since upgrading to spark 1.5 I've been unable to create and use UDF's when we run in thrift server mode. Our setup: We start the thrift-server running against yarn in client mode, (we've also built our own spark from github branch-1.5 with the following args, -Pyarn -Phive

Spark job is running infinitely

2015-10-12 Thread Saurav Sinha
Hi Experts, I am facing issue in which spark job is running infinitely. When I start spark job on 4 node cluster. In which there is no space left on one machine then it is running infinity. Does any one can across such an issue. Is any why to kill job when such thing happens. -- Thanks and

Re: Handling expirying state in UDF

2015-10-12 Thread Davies Liu
Could you try this? my_token = None def my_udf(a): global my_token if my_token is None: # create token # do something In this way, a new token will be created for each pyspark task On Sun, Oct 11, 2015 at 5:14 PM, brightsparc wrote: > Hi, > > I have

RE: Question about GraphX connected-components

2015-10-12 Thread John Lilley
Thanks Igor, We are definitely thinking along these lines, but I am hoping to shortcut our search of the Spark/GraphX tuning parameter space to find a reasonable set of starting points. There are simultaneous questions of “what should we expect form GraphX?” and “what are the best parameters

Re: Spark job is running infinitely

2015-10-12 Thread Saurav Sinha
Hi Ted, *Do you have monitoring put in place to detect 'no space left' scenario ?* No, I don't have any monitoring in place. *By 'way to kill job', do you mean automatic kill ?* Yes, I need some way by which my job will detect this failure and kill itself. Thanks, Saurav On Mon, Oct 12, 2015

Spark UI consuming lots of memory

2015-10-12 Thread pnpritchard
Hi, In my application, the Spark UI is consuming a lot of memory, especially the SQL tab. I have set the following configurations to reduce the memory consumption: - spark.ui.retainedJobs=20 - spark.ui.retainedStages=40 - spark.sql.ui.retainedExecutions=0 However, I still get OOM errors in the

Re: Spark job is running infinitely

2015-10-12 Thread Ted Yu
Do you have monitoring put in place to detect 'no space left' scenario ? By 'way to kill job', do you mean automatic kill ? Please include the release of Spark, command line for 'spark-submit' in your reply. Thanks On Mon, Oct 12, 2015 at 10:07 AM, Saurav Sinha wrote:

Re: SQLcontext changing String field to Long

2015-10-12 Thread shobhit gupta
Great, that helped a lot, issue is fixed now. :) Thank you very much! On Sun, Oct 11, 2015 at 12:29 PM, Yana Kadiyska wrote: > In our case, we do not actually need partition inference so the > workaround was easy -- instead of using the path as >

Does Spark use more memory than MapReduce?

2015-10-12 Thread YaoPau
I had this question come up and I'm not sure how to answer it. A user said that, for a big job, he thought it would be better to use MapReduce since it writes to disk between iterations instead of keeping the data in memory the entire time like Spark generally does. I mentioned that Spark can

Re: How to calculate percentile of a column of DataFrame?

2015-10-12 Thread Umesh Kacha
Hi Ted thanks if I dont pass lit function then how can I tell percentile_approx function to give me 25% or 50% like we do in Hive percentile_approx(mycol,0.25). Regards On Mon, Oct 12, 2015 at 7:20 PM, Ted Yu wrote: > Umesh: > Have you tried calling callUdf without the

Re: How to calculate percentile of a column of DataFrame?

2015-10-12 Thread Ted Yu
Using spark-shell, I did the following exercise (master branch) : SQL context available as sqlContext. scala> val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id", "value") df: org.apache.spark.sql.DataFrame = [id: string, value: int] scala> sqlContext.udf.register("simpleUDF", (v: Int,

Re: Spark job is running infinitely

2015-10-12 Thread Ted Yu
I would suggest you install monitoring service. 'no space left' condition would affect other services, not just Spark. For the second part, Spark experts may have answer for you. On Mon, Oct 12, 2015 at 11:09 AM, Saurav Sinha wrote: > Hi Ted, > > *Do you have

Re: Spark job is running infinitely

2015-10-12 Thread Saurav Sinha
Hi Ted, Which monitoring service would you suggest for me. Thanks, Saurav On Mon, Oct 12, 2015 at 11:55 PM, Saurav Sinha wrote: > Hi Ted, > > Which would you suggest for monitoring service for me. > > Thanks, > Saurav > > On Mon, Oct 12, 2015 at 11:47 PM, Ted Yu

Re: Spark job is running infinitely

2015-10-12 Thread Ted Yu
Each vendor provides own monitoring tool. There is: http://ganglia.sourceforge.net/ There is: http://bosun.org/ which I haven't used FYI On Mon, Oct 12, 2015 at 11:25 AM, Saurav Sinha wrote: > Hi Ted, > > Which would you suggest for monitoring service for me. > >

Re: Spark job is running infinitely

2015-10-12 Thread Saurav Sinha
Hi Ted, Which would you suggest for monitoring service for me. Thanks, Saurav On Mon, Oct 12, 2015 at 11:47 PM, Ted Yu wrote: > I would suggest you install monitoring service. > 'no space left' condition would affect other services, not just Spark. > > For the second

Re: How to calculate percentile of a column of DataFrame?

2015-10-12 Thread Umesh Kacha
Sorry forgot to tell that I am using Spark 1.4.1 as callUdf is available in Spark 1.4.0 as per JAvadocx On Tue, Oct 13, 2015 at 12:22 AM, Umesh Kacha wrote: > Hi Ted thanks much for the detailed answer and appreciate your efforts. Do > we need to register Hive UDFs? > >

Re: How to calculate percentile of a column of DataFrame?

2015-10-12 Thread Umesh Kacha
Hi Ted thanks much for the detailed answer and appreciate your efforts. Do we need to register Hive UDFs? sqlContext.udf.register("percentile_approx");???//is it valid? I am calling Hive UDF percentile_approx in the following manner which gives compilation error

Re: How to calculate percentile of a column of DataFrame?

2015-10-12 Thread Ted Yu
SQL context available as sqlContext. scala> val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id", "value") df: org.apache.spark.sql.DataFrame = [id: string, value: int] scala> df.select(callUDF("percentile_approx",col("value"), lit(0.25))).show() +--+

Re: How to calculate percentile of a column of DataFrame?

2015-10-12 Thread Umesh Kacha
Hi Ted thanks much are you saying above code will work in only 1.5.1? I tried upgrading to 1.5.1 but I have found potential bug my Spark job creates hive partitions using hiveContext.sql("insert into partitions") when I use Spark 1.5.1 I cant see any partitions files orc files getting created in

Re: How to calculate percentile of a column of DataFrame?

2015-10-12 Thread Ted Yu
I would suggest using http://search-hadoop.com/ to find literature on the empty partitions directory problem. If there is no answer there, please start a new thread with the following information: release of Spark release of hadoop code snippet symptom Cheers On Mon, Oct 12, 2015 at 12:08 PM,

Re: Why Spark Stream job stops producing outputs after a while?

2015-10-12 Thread Uthayan Suthakar
Any suggestions? Is there anyway that I could debug this issue? Cheers, Uthay On 11 October 2015 at 18:39, Uthayan Suthakar wrote: > Hello all, > > I have a Spark Streaming job that run and produce results successfully. > However, after a few days the job stop

Re: Why Spark Stream job stops producing outputs after a while?

2015-10-12 Thread Tathagata Das
Are you sure that there are not log4j errors in the driver logs? What if you try enabling debug level? And what does the streaming UI say? On Mon, Oct 12, 2015 at 12:50 PM, Uthayan Suthakar < uthayan.sutha...@gmail.com> wrote: > Any suggestions? Is there anyway that I could debug this issue? >

Re: why would a spark Job fail without throwing run-time exceptions?

2015-10-12 Thread pnpritchard
I'm not sure why spark is not showing the runtime exception in the logs. However, I think I can point out why the stage is failing. 1. "lineMapToStockPriceInfoObjectRDD.map(new stockDataFilter(_).requirementsMet.get)" The ".get" will throw a runtime exception when "requirementsMet" is None. I

Re: What is the difference between ml.classification.LogisticRegression and mllib.classification.LogisticRegressionWithLBFGS

2015-10-12 Thread DB Tsai
Hi Liu, In ML, even after extracting the data into RDD, the versions between MLib and ML are quite different. Due to legacy design, in MLlib, we use Updater for handling regularization, and this layer of abstraction also does adaptive step size which is only for SGD. In order to get it working

OutOfMemoryError OOM ByteArrayOutputStream.hugeCapacity

2015-10-12 Thread Alexander Pivovarov
I have one job which fails if I enable KryoSerializer I use spark 1.5.0 on emr-4.1.0 Settings: spark.serializer org.apache.spark.serializer.KryoSerializer spark.kryoserializer.buffer.max 1024m spark.executor.memory47924M spark.yarn.executor.memoryOverhead 5324 The

Re: Does Spark use more memory than MapReduce?

2015-10-12 Thread Jean-Baptiste Onofré
Hi, I think it depends of the storage level you use (MEMORY, DISK, or MEMORY_AND_DISK). By default, micro-batching as designed in Spark requires more memory but much faster: when you use MapReduce, each map and reduce tasks have to use HDFS as backend of the data pipeline between the tasks.

Re: Why Spark Stream job stops producing outputs after a while?

2015-10-12 Thread Uthayan Suthakar
Hi Tathagata, Yes, I'm pretty sure there are no errors in driver logs and workers logs. The streaming UI appears to be showing that job is running fine. I can see the tasks are being completed. I also, see the receiver is picking up new messages (in UI). I'm running this same job twice that read

Storing object in spark streaming

2015-10-12 Thread Something Something
In my custom receiver for Spark Streaming I've code such as this: messages.toArray().foreach(msg => { val m = msg.asInstanceOf[Message] * store(m.getBody)* }) Instead of 'body' which is of type 'String', I would rather pass the

Re: ClassCastException when use spark1.5.1

2015-10-12 Thread pnpritchard
I'm not sure why this would have have changed from 1.4.1 to 1.5.1, but I have seen similar exceptions in my code. It seems to me that values with SQL type "ArrayType" are stored internally as an instance of the Scala "WrappedArray" class (regardless if is was originally an instance of Scala

Dev Setup for Python/Scala Packages

2015-10-12 Thread bsowell
Hi, I'm working on a Spark package involving both Scala and Python development and I'm trying to figure out the right dev setup. This will be a private internal package, at least for now, so I'm not concerned about publishing. I've been using the sbt-spark-package plugin on the scala side,

Problem installing Sparck on Windows 8

2015-10-12 Thread Marco Mistroni
HI all i have downloaded spark-1.5.1-bin-hadoop.2.4 i have extracted it on my machine, but when i go to the \bin directory and invoke spark-shell i get the following exception Could anyone assist pls? I followed instructions in ebook Learning Spark, but mayb the instructions are old? kr marco

Re: Storing object in spark streaming

2015-10-12 Thread Jeff Nadler
Your receiver must extend Receiver[String].Try changing it to extend Receiver[Message]? On Mon, Oct 12, 2015 at 2:03 PM, Something Something < mailinglist...@gmail.com> wrote: > In my custom receiver for Spark Streaming I've code such as this: > >

DEBUG level log in receivers and executors

2015-10-12 Thread Spark Newbie
Hi Spark users, Is there an easy way to turn on DEBUG logs in receivers and executors? Setting sparkContext.setLogLevel seems to turn on DEBUG level only on the Driver. Thanks,

Calculate Hierarchical root as new column

2015-10-12 Thread epheatt
I am new to spark (and scala) and I need to figure out the correct approach to add a column to a dataframe that is calculated using a recursive approach based on the value of another column against data within the same dataframe. Given a simple hierarchical adjacency schema of my source dataframe

Re: DEBUG level log in receivers and executors

2015-10-12 Thread Kartik Mathur
You can create log4j.properties under your SPARK_HOME/conf and set up these properties - log4j.rootCategory=INFO, console log4j.appender.console=org.apache.log4j.ConsoleAppender log4j.appender.console.target=System.err log4j.appender.console.layout=org.apache.log4j.PatternLayout

Spark Streaming Latency in practice

2015-10-12 Thread xweb
What kind of latency are people in achieving in production using spark streaming? Is it in 1 second+ range. or have people been able to achieve latency in say 250 ms range. Any best practices on achieving sub second latency if even possible? -- View this message in context:

Re: Cannot get spark-streaming_2.10-1.5.0.pom from the maven repository

2015-10-12 Thread Keiji Yoshida
Thanks for the reply. Yes, I know 1.5.1 is available but I need to use 1.5.0 because I need to run Spark applications on Cloud Dataproc ( https://cloud.google.com/dataproc/ ) which supports only 1.5.0. On Tue, Oct 13, 2015 at 12:13 PM, Ted Yu wrote: > I got 404 as well. >

Re: Cannot get spark-streaming_2.10-1.5.0.pom from the maven repository

2015-10-12 Thread Sean Owen
IIRC we have seen transient problems like this from Maven Central, where files were visible to some but not others or reappeared later. I can't get that file now either, but, try again later first. If it's really persistent we may have to ask what's going on with Maven Central. It's not a Spark or

Cannot get spark-streaming_2.10-1.5.0.pom from the maven repository

2015-10-12 Thread y
When I access the following URL, I often get a 404 error and I cannot get the POM file of "spark-streaming_2.10-1.5.0.pom". http://central.maven.org/maven2/org/apache/spark/spark-streaming_2.10/1.5.0/spark-streaming_2.10-1.5.0.pom Are there any problems inside the maven repository? Are there any

Re: Cannot get spark-streaming_2.10-1.5.0.pom from the maven repository

2015-10-12 Thread Ted Yu
I got 404 as well. BTW 1.5.1 has been released. I was able to access: http://central.maven.org/maven2/org/apache/spark/spark-streaming_2.10/1.5.1/spark-streaming_2.10-1.5.1.pom FYI On Mon, Oct 12, 2015 at 8:09 PM, y wrote: > When I access the following URL, I often

Re: Cannot get spark-streaming_2.10-1.5.0.pom from the maven repository

2015-10-12 Thread Keiji Yoshida
Thank you very much for your reply. I noticed this issue first yesterday. I'm hoping this issue will be solved soon. On Tue, Oct 13, 2015 at 1:57 PM, Sean Owen wrote: > IIRC we have seen transient problems like this from Maven Central, > where files were visible to some but

Re: TaskMemoryManager. cleanUpAllAllocatedMemory -> Memory leaks ???

2015-10-12 Thread Ted Yu
Please take a look at the design doc attached to SPARK-1 The answer is on page 2 of that doc. On Mon, Oct 12, 2015 at 8:55 AM, Ted Yu wrote: > Please note the block where cleanUpAllAllocatedMemory() is called: > } finally { > val freedMemory =

Re: Spark on YARN using Java 1.8 fails

2015-10-12 Thread Abhisheks
Did you get any resolution for this? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-YARN-using-Java-1-8-fails-tp24925p25039.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Cannot get spark-streaming_2.10-1.5.0.pom from the maven repository

2015-10-12 Thread Ted Yu
I checked commit history of streaming/pom.xml There should be no difference between 1.5.0 and 1.5.1 You can download 1.5.1's pom.xml and rename it so that you get unblocked. On Mon, Oct 12, 2015 at 8:50 PM, Keiji Yoshida wrote: > Thanks for the reply. > > Yes, I

Re: Spark UI consuming lots of memory

2015-10-12 Thread Nicholas Pritchard
As an update, I did try disabling the ui with "spark.ui.enabled=false", but the JobListener and SQLListener still consume a lot of memory, leading to OOM error. Has anyone encountered this before? Is the only solution just to increase the driver heap size? Thanks, Nick On Mon, Oct 12, 2015 at

Re: Spark UI consuming lots of memory

2015-10-12 Thread Nicholas Pritchard
I set those configurations by passing to spark-submit script: "bin/spark-submit --conf spark.ui.retainedJobs=20 ...". I have verified that these configurations are being passed correctly because they are listed in the environments tab and also by counting the number of job/stages that are listed.

Install via directions in "Learning Spark". Exception when running bin/pyspark

2015-10-12 Thread David Bess
Greetings all, Excited to be learning spark. I am working through the "Learning Spark" book and I am having trouble getting Spark installed and running. This is what I have done so far. I installed Spark from here: http://spark.apache.org/downloads.html selecting 1.5.1, prebuilt for

Re: Spark UI consuming lots of memory

2015-10-12 Thread Shixiong Zhu
In addition, you cannot turn off JobListener and SQLListener now... Best Regards, Shixiong Zhu 2015-10-13 11:59 GMT+08:00 Shixiong Zhu : > Is your query very complicated? Could you provide the output of `explain` > your query that consumes an excessive amount of memory? If

Re: Spark UI consuming lots of memory

2015-10-12 Thread Shixiong Zhu
Could you show how did you set the configurations? You need to set these configurations before creating SparkContext and SQLContext. Moreover, the history sever doesn't support SQL UI. So "spark.eventLog.enabled=true" doesn't work now. Best Regards, Shixiong Zhu 2015-10-13 2:01 GMT+08:00

Re: Unexplained sleep time

2015-10-12 Thread Shixiong Zhu
You don't need to care about this sleep. It runs in a separate thread and usually won't affect the performance of your application. Best Regards, Shixiong Zhu 2015-10-09 6:03 GMT+08:00 yael aharon : > Hello, > I am working on improving the performance of our Spark on

Re: Spark UI consuming lots of memory

2015-10-12 Thread Shixiong Zhu
Is your query very complicated? Could you provide the output of `explain` your query that consumes an excessive amount of memory? If this is a small query, there may be a bug that leaks memory in SQLListener. Best Regards, Shixiong Zhu 2015-10-13 11:44 GMT+08:00 Nicholas Pritchard <

Re: Creating Custom Receiver for Spark Streaming

2015-10-12 Thread Shixiong Zhu
Each ReceiverInputDStream will create one Receiver. If you only use one ReceiverInputDStream, there will be only one Receiver in the cluster. But if you create multiple ReceiverInputDStreams, there will be multiple Receivers. Best Regards, Shixiong Zhu 2015-10-12 23:47 GMT+08:00 Something

Re: Data skipped while writing Spark Streaming output to HDFS

2015-10-12 Thread Shixiong Zhu
Could you print the content of RDD to check if there are multiple values for a key in a batch? Best Regards, Shixiong Zhu 2015-10-12 18:25 GMT+08:00 Sathiskumar : > I'm running a Spark Streaming application for every 10 seconds, its job is > to > consume data from

Re: What is the difference between ml.classification.LogisticRegression and mllib.classification.LogisticRegressionWithLBFGS

2015-10-12 Thread YiZhi Liu
Hi Tsai, Thank you for pointing out the implementation details which I missed. Yes I saw several jira issues with the intercept, regularization and standardization, I just didn't realize it made such a big impact. Thanks again. 2015-10-13 4:32 GMT+08:00 DB Tsai : > Hi Liu, > >

How to add transformations as pipeline Stages ?

2015-10-12 Thread Nethaji Chandrasiri
Hi, I need to add my transformations to pipeline as stages. Can you please provide me some sort of a guideline on how to do that in *java*? Thanks -- *Nethaji Chandrasiri* *Software Engineering* *Intern; WSO2, Inc.; http://wso2.com * Mobile : +94 (0) 779171059