Worker is KILLED for no reason

2015-06-15 Thread nizang
hi, I'm using the new 1.4.0 installation, and ran a job there. The job finished and everything seems fine. When I enter the application, I can see that the job is marked as KILLED: Removed Executors ExecutorID Worker Cores Memory State Logs 0

Re: How to set up a Spark Client node?

2015-06-15 Thread ayan guha
I feel he wanted to ask about workers. In that case, pplease launch workers on Node 3,4,5 (and/or Node 8,9,10 etc). You need to go to each worker and start worker daemon with master URL:Port (typically7077) as parameter (so workers can talk to master). You shoud be able to see 1 masterr and N

RE: If not stop StreamingContext gracefully, will checkpoint data be consistent?

2015-06-15 Thread Haopu Wang
Akhil, thank you for the response. I want to explore more. If the application is just monitoring a HDFS folder and output the word count of each streaming batch into also HDFS. When I kill the application _before_ spark takes a checkpoint, after recovery, spark will resume the processing

Re: Reading file from S3, facing java.lang.NoClassDefFoundError: org/jets3t/service/ServiceException

2015-06-15 Thread shahab
Thanks Akhil, it solved the problem. best /Shahab On Fri, Jun 12, 2015 at 8:50 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Looks like your spark is not able to pick up the HADOOP_CONF. To fix this, you can actually add jets3t-0.9.0.jar to the classpath

Re: How to set up a Spark Client node?

2015-06-15 Thread Akhil Das
I'm assuming by spark-client you mean the spark driver program. In that case you can pick any machine (say Node 7), create your driver program in it and use spark-submit to submit it to the cluster or if you create the SparkContext within your driver program (specifying all the properties) then

RE: Optimizing Streaming from Websphere MQ

2015-06-15 Thread Chaudhary, Umesh
Hi Akhil, Thanks for your response. I have 10 cores which sums of all my 3 machines and I am having 5-10 receivers. I have tried to test the processed number of records per second by varying number of receivers. If I am having 10 receivers (i.e. one receiver for each core), then I am not

Re: How to use spark for map-reduce flow to filter N columns, top M rows of all csv files under a folder?

2015-06-15 Thread Akhil Das
Something like this? val huge_data = sc.textFile(/path/to/first.csv).map(x = (x.split(\t)(1), x.split(\t)(0)) val gender_data = sc.textFile(/path/to/second.csv),map(x = (x.split(\t)(0), x)) val joined_data = huge_data.join(gender_data) joined_data.take(1000) Its scala btw, python api should

settings from props file seem to be ignored in mesos

2015-06-15 Thread Gary Ogden
I'm loading these settings from a properties file: spark.executor.memory=256M spark.cores.max=1 spark.shuffle.consolidateFiles=true spark.task.cpus=1 spark.deploy.defaultCores=1 spark.driver.cores=1 spark.scheduler.mode=FAIR Once the job is submitted to mesos, I can go to the spark UI for that

tasks won't run on mesos when using fine grained

2015-06-15 Thread Gary Ogden
My Mesos cluster has 1.5 CPU and 17GB free. If I set: conf.set(spark.mesos.coarse, true); conf.set(spark.cores.max, 1); in the SparkConf object, the job will run in the mesos cluster fine. But if I comment out those settings above so that it defaults to fine grained, the task never finishes.

sql.catalyst.ScalaReflection scala.reflect.internal.MissingRequirementError

2015-06-15 Thread patcharee
Hi, I use spark 0.14. I tried to create dataframe from RDD below, but got scala.reflect.internal.MissingRequirementError val partitionedTestDF2 = pairVarRDD.toDF(column1,column2,column3) //pairVarRDD is RDD[Record4Dim_2], and Record4Dim_2 is a Case Class How can I fix this? Exception in

*Metrics API is odd in MLLib

2015-06-15 Thread Sam
Google+ https://plus.google.com/app/basic?nopromo=1source=moggl=uk http://mail.google.com/mail/x/mog-/gp/?source=moggl=uk Calendar https://www.google.com/calendar/gpcal?source=moggl=uk Web http://www.google.co.uk/?source=moggl=uk more Inbox Apache Spark Email GmailNot Work S

Re: UNRESOLVED DEPENDENCIES while building Spark 1.3.0

2015-06-15 Thread François Garillot
You may want to have a look there if you (or the original author) are using JDK 8.0: https://issues.apache.org/jira/browse/SPARK-4193 https://issues.apache.org/jira/browse/SPARK-4543 Cheers, On Sat, Apr 4, 2015 at 10:39 PM mas mas.ha...@gmail.com wrote: Hi All, I am trying to build spark

Running spark1.4 inside intellij idea HttpServletResponse - ClassNotFoundException

2015-06-15 Thread Wwh 吴
name := SparkLeaning version := 1.0 scalaVersion := 2.10.4 //scalaVersion := 2.11.2 libraryDependencies ++= Seq( //org.apache.hive% hive-jdbc % 0.13.0 //io.spray % spray-can % 1.3.1, //io.spray % spray-routing % 1.3.1, io.spray % spray-testkit % 1.3.1 % test, io.spray %% spray-json %

Re: Join highly skewed datasets

2015-06-15 Thread Night Wolf
How far did you get? On Tue, Jun 2, 2015 at 4:02 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: We use Scoobi + MR to perform joins and we particularly use blockJoin() API of scoobi /** Perform an equijoin with another distributed list where this list is considerably smaller * than the

Re: BigDecimal problem in parquet file

2015-06-15 Thread Bipin Nag
HI Davies, I have tried recent 1.4 and 1.5-snapshot to 1) open the parquet and save it again or 2 apply schema to rdd and save dataframe as parquet but now I get this error (right in the beginning): java.lang.OutOfMemoryError: Java heap space at

Re: Running spark1.4 inside intellij idea HttpServletResponse - ClassNotFoundException

2015-06-15 Thread Tarek Auel
Hey, I had some similar issues in the past when I used Java 8. Are you using Java 7 or 8. (it's just an idea, because I had a similar issue) On Mon 15 Jun 2015 at 6:52 am Wwh 吴 wwyando...@hotmail.com wrote: name := SparkLeaning version := 1.0 scalaVersion := 2.10.4 //scalaVersion := 2.11.2

Using queueStream

2015-06-15 Thread anshu shukla
JavaDStreamString inputStream = ssc.queueStream(rddQueue); Can this rddQueue be of dynamic type in nature .If yes then how to make it run untill rddQueue is not finished . Any other way to get rddQueue from a dynamically updatable Normal Queue . -- Thanks Regards, SERC-IISC Anshu Shukla

Re: Spark DataFrame Reduce Job Took 40s for 6000 Rows

2015-06-15 Thread Todd Nist
Hi Proust, Is it possible to see the query you are running and can you run EXPLAIN EXTENDED to show the physical plan for the query. To generate the plan you can do something like this from $SPARK_HOME/bin/beeline: 0: jdbc:hive2://localhost:10001 explain extended select * from YourTableHere;

number of partitions in join: Spark documentation misleading!

2015-06-15 Thread mrm
Hi all, I was looking for an explanation on the number of partitions for a joined rdd. The documentation of Spark 1.3.1. says that: For distributed shuffle operations like reduceByKey and join, the largest number of partitions in a parent RDD.

Re: Spark DataFrame Reduce Job Took 40s for 6000 Rows

2015-06-15 Thread Proust GZ Feng
Thanks a lot Akhil, after try some suggestions in the tuning guide, there seems no improvement at all. And below is the job detail when running locally(8cores) which took 3min to complete the job, we can see it is the map operation took most of time, looks like the mapPartitions took too long

Not getting event logs = spark 1.3.1

2015-06-15 Thread Tsai Li Ming
Hi, I have this in my spark-defaults.conf (same for hdfs): spark.eventLog.enabled true spark.eventLog.dir file:/tmp/spark-events spark.history.fs.logDirectory file:/tmp/spark-events While the app is running, there is a “.inprogress” directory. However when the job

Re: Spark standalone mode and kerberized cluster

2015-06-15 Thread Borja Garrido Bear
I tried running the job in a standalone cluster and I'm getting this: java.io.IOException: Failed on local exception: java.io.IOException: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]; Host Details : local host is: worker-node/0.0.0.0;

Re: Not albe to run FP-growth Example

2015-06-15 Thread masoom alam
even if the following POM is also not working: project xmlns=http://maven.apache.org/POM/4.0.0; xmlns:xsi= http://www.w3.org/2001/XMLSchema-instance; xsi:schemaLocation= http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd; parent

Creating RDD from Iterable from groupByKey results

2015-06-15 Thread Nirav Patel
I am trying to create new RDD based on given PairRDD. I have a PairRDD with few keys but each keys have large (about 100k) values. I want to somehow repartition, make each `Iterablev` into RDD[v] so that I can further apply map, reduce, sortBy etc effectively on those values. I am sensing

How does one decide no of executors/cores/memory allocation?

2015-06-15 Thread shreesh
How do I decide in how many partitions I break up my data into, how many executors should I have? I guess memory and cores will be allocated based on the number of executors I have. Thanks -- View this message in context:

Re: How does one decide no of executors/cores/memory allocation?

2015-06-15 Thread gaurav sharma
When you submit a job, spark breaks down it into stages, as per DAG. the stages run transformations or actions on the rdd's. Each rdd constitutes of N partitions. The tasks creates by spark to execute the stage are equal to the number of partitions. Every task is executed on the cored utilized

ALS predictALL not completing

2015-06-15 Thread afarahat
Hello; I have a data set of about 80 Million users and 12,000 items (very sparse ). I can get the training part working no problem. (model has 20 factors), However, when i try using Predict all for 80 Million x 10 items , the jib does not complete. When i use a smaller data set say 500k or a

Re: Spark DataFrame 1.4 write to parquet/saveAsTable tasks fail

2015-06-15 Thread Night Wolf
Looking at the logs of the executor, looks like it fails to find the file; e.g. for task 10323.0 15/06/16 13:43:13 ERROR output.FileOutputCommitter: Hit IOException trying to rename

Re: RDD of Iterable[String]

2015-06-15 Thread nir
Have you found answer to this? I am also looking for exact same solution. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDD-of-Iterable-String-tp15016p23329.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Spark DataFrame 1.4 write to parquet/saveAsTable tasks fail

2015-06-15 Thread Night Wolf
Hi guys, Using Spark 1.4, trying to save a dataframe as a table, a really simple test, but I'm getting a bunch of NPEs; The code Im running is very simple; qc.read.parquet(/user/sparkuser/data/staged/item_sales_basket_id.parquet).write.format(parquet).saveAsTable(is_20150617_test2) Logs of

Re: Spark DataFrame 1.4 write to parquet/saveAsTable tasks fail

2015-06-15 Thread Yin Huai
I saw it once but I was not clear how to reproduce it. The jira I created is https://issues.apache.org/jira/browse/SPARK-7837. More information will be very helpful. Were those errors from speculative tasks or regular tasks (the first attempt of the task)? Is this error deterministic (can you

Re: flatmapping with other data

2015-06-15 Thread dizzy5112
Sorry cut and paste error, the resulting data set i want is this: ({(101,S)=3},piece_of_data_1)) ({(101,S)=3},piece_of_data_2)) ({(101,S)=1},piece_of_data_3)) ({(109,S)=2},piece_of_data_3)) -- View this message in context:

Re: Spark job throwing “java.lang.OutOfMemoryError: GC overhead limit exceeded”

2015-06-15 Thread Deng Ching-Mallete
Hi Raj, Since the number of executor cores is equivalent to the number of tasks that can be executed in parallel in the executor, in effect, the 6G executor memory configured for an executor is being shared by 6 tasks plus factoring in the memory allocation for caching task execution. I would

Help!!!Map or join one large datasets then suddenly remote Akka client disassociated

2015-06-15 Thread Jia Yu
Hi folks, Help me! I met a very weird problem. I really need some help!! Here is my situation: Case: Assign keys to two datasets (one is 96GB with 2.7 billion records and one 1.5GB with 30k records) via MapPartitions first, and

Re: Spark DataFrame 1.4 write to parquet/saveAsTable tasks fail

2015-06-15 Thread Night Wolf
Hey Yin, Thanks for the link to the JIRA. I'll add details to it. But I'm able to reproduce it, at least in the same shell session, every time I do a write I get a random number of tasks failing on the first run with the NPE. Using dynamic allocation of executors in YARN mode. No speculative

Error using spark 1.3.0 with maven

2015-06-15 Thread Ritesh Kumar Singh
Hi, I'm getting this error while running spark as a java project using maven : 15/06/15 17:11:38 INFO SparkContext: Running Spark version 1.3.0 15/06/15 17:11:38 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 15/06/15

DataFrame insertIntoJDBC parallelism while writing data into a DB table

2015-06-15 Thread Mohammad Tariq
Hello list, The method *insertIntoJDBC(url: String, table: String, overwrite: Boolean)* provided by Spark DataFrame allows us to copy a DataFrame into a JDBC DB table. Similar functionality is provided by the *createJDBCTable(url: String, table: String, allowExisting: Boolean) *method. But if you

Spark job throwing “java.lang.OutOfMemoryError: GC overhead limit exceeded”

2015-06-15 Thread diplomatic Guru
Hello All, I have a Spark job that throws java.lang.OutOfMemoryError: GC overhead limit exceeded. The job is trying to process a filesize 4.5G. I've tried following spark configuration: --num-executors 6 --executor-memory 6G --executor-cores 6 --driver-memory 3G I tried increasing more

akka configuration not found

2015-06-15 Thread Ritesh Kumar Singh
Hi, Though my project has nothing to do with akka, I'm getting this error : Exception in thread main com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'akka.version' at com.typesafe.config.impl.SimpleConfig.findKey(SimpleConfig.java:124) at

Does spark performance really scale out with multiple machines?

2015-06-15 Thread Wang, Ningjun (LNG-NPV)
I try to measure how spark standalone cluster performance scale out with multiple machines. I did a test of training the SVM model which is heavy in memory computation. I measure the run time for spark standalone cluster of 1 - 3 nodes, the result is following 1 node: 35 minutes 2 nodes: 30.1

spark sql and cassandra. spark generate 769 tasks to read 3 lines from cassandra table

2015-06-15 Thread Serega Sheypak
Hi, I'm running spark sql against Cassandra table. I have 3 C* nodes, Each of them has spark worker. The problem is that spark runs 869 task to read 3 lines: select bar from foo. I've tried these properties: #try to avoid 769 tasks per dummy select foo from bar qeury

Re: How can I use Tachyon with SPARK?

2015-06-15 Thread Himanshu Mehra
Hi June, As i understand your problem, you are running spark 1.3 and want to use Tachyon with it. what you need to do is simply build the latest Spark and Tachyon and set some configuration is Spark. In fact spark 1.3 has spark/core/pom.xm, you have to find the core folder in your spark home and

Problem: Custom Receiver for getting events from a Dynamic Queue

2015-06-15 Thread anshu shukla
I have written a custom receiver for converting the tuples in the Dynamic Queue/EventGen to the Dstream.But i dont know why It is only processing data for some time (3-4 sec.) only and then shows Queue as Empty .ANy suggestions please .. --code // public class JavaCustomReceiver extends

missing part of the file while using newHadoopApi

2015-06-15 Thread igor.berman
Hi Have anyone experienced problem with uploading to s3 with s3n protocol with spark newHadoopApi, when job completes successfully(there is _SUCCESS marker), but in reality one of the parts of the file is missing ? Thanks in advance ps: we are trying s3a now(which needs upgrade to hadoop2.7)

Re: Does spark performance really scale out with multiple machines?

2015-06-15 Thread William Briggs
There are a lot of variables to consider. I'm not an expert on Spark, and my ML knowledge is rudimentary at best, but here are some questions whose answers might help us to help you: - What type of Spark cluster are you running (e.g., Stand-alone, Mesos, YARN)? - What does the HTTP UI

Re: Spark application in production without HDFS

2015-06-15 Thread nsalian
Hi, Spark on YARN should help in the memory management for Spark jobs. Here is a good starting point: https://spark.apache.org/docs/latest/running-on-yarn.html YARN integrates well with HDFS and should be a good solution for a large cluster. What specific features are you looking for that HDFS

Re: Does spark performance really scale out with multiple machines?

2015-06-15 Thread William Briggs
I just wanted to clarify - when I said you hit your maximum level of parallelism, I meant that the default number of partitions might not be large enough to take advantage of more hardware, not that there was no way to increase your parallelism - the documentation I linked gives a few suggestions

RE: Join between DStream and Periodically-Changing-RDD

2015-06-15 Thread Evo Eftimov
“turn (keep turning) your HDFS file (Batch RDD) into a stream of messages (outside spark streaming)” – what I meant by that was “turn the Updates to your HDFS dataset into Messages” and send them as such to spark streaming From: Evo Eftimov [mailto:evo.efti...@isecc.com] Sent: Monday, June

[SparkStreaming] NPE in DStreamCheckPointData.scala:125

2015-06-15 Thread Haopu Wang
I use the attached program to test checkpoint. It's quite simple. When I run the program second time, it will load checkpoint data, that's expected, however I see NPE in driver log. Do you have any idea about the issue? I'm on Spark 1.4.0, thank you very much! == logs ==

Re: Spark DataFrame Reduce Job Took 40s for 6000 Rows

2015-06-15 Thread Akhil Das
Have a look here https://spark.apache.org/docs/latest/tuning.html Thanks Best Regards On Mon, Jun 15, 2015 at 11:27 AM, Proust GZ Feng pf...@cn.ibm.com wrote: Hi, Spark Experts I have played with Spark several weeks, after some time testing, a reduce operation of DataFrame cost 40s on a

RE: Join between DStream and Periodically-Changing-RDD

2015-06-15 Thread Evo Eftimov
Then go for the second option I suggested - simply turn (keep turning) your HDFS file (Batch RDD) into a stream of messages (outside spark streaming) – then spark streaming consumes and aggregates the messages FOR THE RUNTIME LIFETIME of your application in some of the following ways: 1.

Re: If not stop StreamingContext gracefully, will checkpoint data be consistent?

2015-06-15 Thread Akhil Das
I think it should be fine, that's the whole point of check-pointing (in case of driver failure etc). Thanks Best Regards On Mon, Jun 15, 2015 at 6:54 AM, Haopu Wang hw...@qilinsoft.com wrote: Hi, can someone help to confirm the behavior? Thank you! -Original Message- From: Haopu

Re: Spark application in production without HDFS

2015-06-15 Thread rahulkumar-aws
Hi If your data is not so huge you can use both cloudera and HDP's free stack. Cloudera Express is 100% opensource free. - Software Developer SigmoidAnalytics, Bangalore -- View this message in context:

Re: Limit Spark Shuffle Disk Usage

2015-06-15 Thread rahulkumar-aws
Check this link https://forums.databricks.com/questions/277/how-do-i-avoid-the-no-space-left-on-device-error.html https://forums.databricks.com/questions/277/how-do-i-avoid-the-no-space-left-on-device-error.html Hope this will solve your problem. - Software Developer Sigmoid