Save and read parquet from the same path

2015-03-04 Thread Karlson
Hi all, what would happen if I save a RDD via saveAsParquetFile to the same path that RDD is originally read from? Is that a safe thing to do in Pyspark? Thanks! - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org

Re: Nested Case Classes (Found and Required Same)

2015-03-04 Thread Bojan Kostic
Did you find any other way for this issue? I just found out that i have 22 columns data set... And now i am searching for best solution. Anyone else have experienced with this problem? Best Bojan -- View this message in context:

RE: Does SparkSQL support ..... having count (fieldname) in SQL statement?

2015-03-04 Thread Cheng, Hao
I’ve tried with latest code, seems it works, which version are you using Shahab? From: yana [mailto:yana.kadiy...@gmail.com] Sent: Wednesday, March 4, 2015 8:47 PM To: shahab; user@spark.apache.org Subject: RE: Does SparkSQL support . having count (fieldname) in SQL statement? I think the

Re: scala.Double vs java.lang.Double in RDD

2015-03-04 Thread Imran Rashid
This doesn't involve spark at all, I think this is entirely an issue with how scala deals w/ primitives and boxing. Often it can hide the details for you, but IMO it just leads to far more confusing errors when things don't work out. The issue here is that your map has value type Any, which

Re: Is the RDD's Partitions determined before hand ?

2015-03-04 Thread Imran Rashid
You can set the number of partitions dynamically -- its just a parameter to a method, so you can compute it however you want, it doesn't need to be some static constant: val dataSizeEstimate = yourMagicFunctionToEstimateDataSize() val numberOfPartitions =

Error communicating with MapOutputTracker

2015-03-04 Thread Thomas Gerber
Hello, We are using spark 1.2.1 on a very large cluster (100 c3.8xlarge workers). We use spark-submit to start an application. We got the following error which leads to a failed stage: Job aborted due to stage failure: Task 3095 in stage 140.0 failed 4 times, most recent failure: Lost task

Re: how to save Word2VecModel

2015-03-04 Thread Xiangrui Meng
+user On Wed, Mar 4, 2015, 8:21 AM Xiangrui Meng men...@gmail.com wrote: You can use the save/load implementation in naive Bayes as reference: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala Ping me on the JIRA page to

Re: issue Running Spark Job on Yarn Cluster

2015-03-04 Thread sachin Singh
Not yet, Please let. Me know if you found solution, Regards Sachin On 4 Mar 2015 21:45, mael2210 [via Apache Spark User List] ml-node+s1001560n21909...@n3.nabble.com wrote: Hello, I am facing the exact same issue. Could you solve the problem ? Kind regards --

Re: Does SparkSQL support ..... having count (fieldname) in SQL statement?

2015-03-04 Thread shahab
Thanks Cheng, my problem was some misspelling problem which I just fixed, unfortunately the exception message sometimes does not pin point to exact reason. Sorry my bad. On Wed, Mar 4, 2015 at 5:02 PM, Cheng, Hao hao.ch...@intel.com wrote: I’ve tried with latest code, seems it works, which

Re: GraphX path traversal

2015-03-04 Thread Robin East
Actually your Pregel code works for me: import org.apache.spark._ import org.apache.spark.graphx._ import org.apache.spark.rdd.RDD val vertexlist = Array((1L,One), (2L,Two), (3L,Three), (4L,Four),(5L,Five),(6L,Six)) val edgelist = Array(Edge(6,5,6 to 5),Edge(5,4,5 to 4),Edge(4,3,4 to 3),

Re: TreeNodeException: Unresolved attributes

2015-03-04 Thread Anusha Shamanur
I tried. I still get the same error. 15/03/04 09:01:50 INFO parse.ParseDriver: Parsing command: select * from TableName where value like '%Restaurant%' 15/03/04 09:01:50 INFO parse.ParseDriver: Parse Completed. 15/03/04 09:01:50 INFO metastore.HiveMetaStore: 0: get_table : db=default

Passing around SparkContext with in the Driver

2015-03-04 Thread kpeng1
Hi All, I am trying to create a class that wraps functionalities that I need; some of these functions require access to the SparkContext, which I would like to pass in. I know that the SparkContext is not seralizable, and I am not planning on passing it to worker nodes or anything, I just want

Re: Spark Monitoring UI for Hadoop Yarn Cluster

2015-03-04 Thread Srini Karri
Hi Todd and Marcelo, Thanks for helping me. I was to able to lunch the history server on windows with out any issues. One problem I am running into right now. I always get the message no completed applications found in history server UI. But I was able to browse through these applications from

Does anyone integrate HBASE on Spark

2015-03-04 Thread sandeep vura
Hi Sparkers, How do i integrate hbase on spark !!! Appreciate for replies !! Regards, Sandeep.v

Re: Error communicating with MapOutputTracker

2015-03-04 Thread Thomas Gerber
Follow up: We re-retried, this time after *decreasing* spark.parallelism. It was set to 16000 before, (5 times the number of cores in our cluster). It is now down to 6400 (2 times the number of cores). And it got past the point where it failed before. Does the MapOutputTracker have a limit on

Spark logs in standalone clusters

2015-03-04 Thread Thomas Gerber
Hello, I was wondering where all the logs files were located on a standalone cluster: 1. the executor logs are in the work directory on each slave machine (stdout/stderr) - I've notice that GC information is in stdout, and stage information in stderr - *Could we get more

configure number of cached partition in memory on SparkSQL

2015-03-04 Thread Judy Nash
Hi, I am tuning a hive dataset on Spark SQL deployed via thrift server. How can I change the number of partitions after caching the table on thrift server? I have tried the following but still getting the same number of partitions after caching: Spark.default.parallelism

Re: Spark Monitoring UI for Hadoop Yarn Cluster

2015-03-04 Thread Srini Karri
Yes. I do see files, actually I missed copying the other settings: spark.master spark:// skarri-lt05.redmond.corp.microsoft.com:7077 spark.eventLog.enabled true spark.rdd.compress true spark.storage.memoryFraction 1 spark.core.connection.ack.wait.timeout 6000

Re: Error communicating with MapOutputTracker

2015-03-04 Thread Thomas Gerber
I meant spark.default.parallelism of course. On Wed, Mar 4, 2015 at 10:24 AM, Thomas Gerber thomas.ger...@radius.com wrote: Follow up: We re-retried, this time after *decreasing* spark.parallelism. It was set to 16000 before, (5 times the number of cores in our cluster). It is now down to

Re: Spark Monitoring UI for Hadoop Yarn Cluster

2015-03-04 Thread Marcelo Vanzin
On Wed, Mar 4, 2015 at 10:08 AM, Srini Karri skarri@gmail.com wrote: spark.executor.extraClassPath D:\\Apache\\spark-1.2.1-bin-hadoop2\\spark-1.2.1-bin-hadoop2.4\\bin\\classes spark.eventLog.dir D:/Apache/spark-1.2.1-bin-hadoop2/spark-1.2.1-bin-hadoop2.4/bin/tmp/spark-events

spark sql median and standard deviation

2015-03-04 Thread tridib
Hello, Is there in built function for getting median and standard deviation in spark sql? Currently I am converting the schemaRdd to DoubleRdd and calling doubleRDD.stats(). But still it does not have median. What is the most efficient way to get the median? Thanks Regards Tridib -- View

Re: Spark Monitoring UI for Hadoop Yarn Cluster

2015-03-04 Thread Srini Karri
Hi Marcelo, I found the problem from http://mail-archives.apache.org/mod_mbox/spark-user/201409.mbox/%3cCAL+LEBfzzjugOoB2iFFdz_=9TQsH=DaiKY=cvydfydg3ac5...@mail.gmail.com%3e this link. The problem is the application I am running, is not generating APPLICATION_COMPLETE file. If I add this file

Re: Save and read parquet from the same path

2015-03-04 Thread Michael Armbrust
No, this is not safe to do. On Wed, Mar 4, 2015 at 7:14 AM, Karlson ksonsp...@siberie.de wrote: Hi all, what would happen if I save a RDD via saveAsParquetFile to the same path that RDD is originally read from? Is that a safe thing to do in Pyspark? Thanks!

Re: Spark SQL Static Analysis

2015-03-04 Thread Justin Pihony
Thanks! On Wed, Mar 4, 2015 at 3:58 PM, Michael Armbrust mich...@databricks.com wrote: It is somewhat out of data, but here is what we have so far: https://github.com/marmbrus/sql-typed On Wed, Mar 4, 2015 at 12:53 PM, Justin Pihony justin.pih...@gmail.com wrote: I am pretty sure that I

Driver disassociated

2015-03-04 Thread Thomas Gerber
Hello, sometimes, in the *middle* of a job, the job stops (status is then seen as FINISHED in the master). There isn't anything wrong in the shell/submit output. When looking at the executor logs, I see logs like this: 15/03/04 21:24:51 INFO MapOutputTrackerWorker: Doing the fetch; tracker

Re: spark sql median and standard deviation

2015-03-04 Thread Ted Yu
Please take a look at DoubleRDDFunctions.scala : /** Compute the mean of this RDD's elements. */ def mean(): Double = stats().mean /** Compute the variance of this RDD's elements. */ def variance(): Double = stats().variance /** Compute the standard deviation of this RDD's elements.

Integer column in schema RDD from parquet being considered as string

2015-03-04 Thread gtinside
Hi , I am coverting jsonRDD to parquet by saving it as parquet file (saveAsParquetFile) cacheContext.jsonFile(file:///u1/sample.json).saveAsParquetFile(sample.parquet) I am reading parquet file and registering it as a table : val parquet = cacheContext.parquetFile(sample_trades.parquet)

Spark SQL Static Analysis

2015-03-04 Thread Justin Pihony
I am pretty sure that I saw a presentation where SparkSQL could be executed with static analysis, however I cannot find the presentation now, nor can I find any documentation or research papers on the topic. So, I am curious if there is indeed any work going on for this topic. The two things I

Re: Spark SQL Static Analysis

2015-03-04 Thread Michael Armbrust
It is somewhat out of data, but here is what we have so far: https://github.com/marmbrus/sql-typed On Wed, Mar 4, 2015 at 12:53 PM, Justin Pihony justin.pih...@gmail.com wrote: I am pretty sure that I saw a presentation where SparkSQL could be executed with static analysis, however I cannot

Re: Does anyone integrate HBASE on Spark

2015-03-04 Thread gen tang
Hi, There are some examples in spark/example https://github.com/apache/spark/tree/master/examples and there are also some examples in spark package http://spark-packages.org/. And I find this blog http://www.abcn.net/2014/07/lighting-spark-with-hbase-full-edition.html is quite good. Hope it

Re: issue Running Spark Job on Yarn Cluster

2015-03-04 Thread roni
look at the logs yarn logs --applicationId applicationId That should give the error. On Wed, Mar 4, 2015 at 9:21 AM, sachin Singh sachin.sha...@gmail.com wrote: Not yet, Please let. Me know if you found solution, Regards Sachin On 4 Mar 2015 21:45, mael2210 [via Apache Spark User List]

Issues with maven dependencies for version 1.2.0 but not version 1.1.0

2015-03-04 Thread kpeng1
Hi All, I am currently having problem with the maven dependencies for version 1.2.0 of spark-core and spark-hive. Here are my dependencies: dependency groupIdorg.apache.spark/groupId artifactIdspark-core_2.10/artifactId version1.2.0/version /dependency dependency

Re: Issues with maven dependencies for version 1.2.0 but not version 1.1.0

2015-03-04 Thread Marcelo Vanzin
Hi Kevin, If you're using CDH, I'd recommend using the CDH repo [1], and also the CDH version when building your app. [1] http://www.cloudera.com/content/cloudera/en/documentation/core/v5-2-x/topics/cdh_vd_cdh5_maven_repo.html On Wed, Mar 4, 2015 at 4:34 PM, Kevin Peng kpe...@gmail.com wrote:

Re: spark master shut down suddenly

2015-03-04 Thread lisendong
I ‘m sorry, but how to look at the mesos logs? where are them? 在 2015年3月4日,下午6:06,Akhil Das ak...@sigmoidanalytics.com 写道: You can check in the mesos logs and see whats really happening. Thanks Best Regards On Wed, Mar 4, 2015 at 3:10 PM, lisendong lisend...@163.com

distribution of receivers in spark streaming

2015-03-04 Thread Du Li
Hi, I have a set of machines (say 5) and want to evenly launch a number (say 8) of kafka receivers on those machines. In my code I did something like the following, as suggested in the spark docs:        val streams = (1 to numReceivers).map(_ = ssc.receiverStream(new MyKafkaReceiver()))       

Re: Issues with maven dependencies for version 1.2.0 but not version 1.1.0

2015-03-04 Thread Marcelo Vanzin
Seems like someone set up m2.mines.com as a mirror in your pom file or ~/.m2/settings.xml, and it doesn't mirror Spark 1.2 (or does but is in a messed up state). On Wed, Mar 4, 2015 at 3:49 PM, kpeng1 kpe...@gmail.com wrote: Hi All, I am currently having problem with the maven dependencies for

Re: Driver disassociated

2015-03-04 Thread Thomas Gerber
Also, I was experiencing another problem which might be related: Error communicating with MapOutputTracker (see email in the ML today). I just thought I would mention it in case it is relevant. On Wed, Mar 4, 2015 at 4:07 PM, Thomas Gerber thomas.ger...@radius.com wrote: 1.2.1 Also, I was

RE: Where can I find more information about the R interface forSpark?

2015-03-04 Thread Haopu Wang
Thanks, it's an active project. Will it be released with Spark 1.3.0? From: 鹰 [mailto:980548...@qq.com] Sent: Thursday, March 05, 2015 11:19 AM To: Haopu Wang; user Subject: Re: Where can I find more information about the R interface forSpark? you can

Re: RDD coalesce or repartition by #records or #bytes?

2015-03-04 Thread Zhan Zhang
It use HashPartitioner to distribute the record to different partitions, but the key is just integer evenly across output partitions. From the code, each resulting partition will get very similar number of records. Thanks. Zhan Zhang On Mar 4, 2015, at 3:47 PM, Du Li

Extra output from Spark run

2015-03-04 Thread cjwang
When I run Spark 1.2.1, I found these display that wasn't in the previous releases: [Stage 12:= (6 + 1) / 16] [Stage 12:(8 + 1) / 16] [Stage 12:==

Re: Where can I find more information about the R interface for Spark?

2015-03-04 Thread haopu
Do you have any update on SparkR? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Where-can-I-find-more-information-about-the-R-interface-for-Spark-tp155p21922.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Issues with maven dependencies for version 1.2.0 but not version 1.1.0

2015-03-04 Thread Kevin Peng
Marcelo, Thanks. The one in the CDH repo fixed it :) On Wed, Mar 4, 2015 at 4:37 PM, Marcelo Vanzin van...@cloudera.com wrote: Hi Kevin, If you're using CDH, I'd recommend using the CDH repo [1], and also the CDH version when building your app. [1]

Re: Where can I find more information about the R interface forSpark?

2015-03-04 Thread ??
you can search SparkR on google or search it on github

Re: Driver disassociated

2015-03-04 Thread Ted Yu
What release are you using ? SPARK-3923 went into 1.2.0 release. Cheers On Wed, Mar 4, 2015 at 1:39 PM, Thomas Gerber thomas.ger...@radius.com wrote: Hello, sometimes, in the *middle* of a job, the job stops (status is then seen as FINISHED in the master). There isn't anything wrong in

Re: Issues with maven dependencies for version 1.2.0 but not version 1.1.0

2015-03-04 Thread Kevin Peng
Ted, I am currently using CDH 5.3 distro, which has Spark 1.2.0, so I am not too sure about the compatibility issues between 1.2.0 and 1.2.1, that is why I would want to stick to 1.2.0. On Wed, Mar 4, 2015 at 4:25 PM, Ted Yu yuzhih...@gmail.com wrote: Kevin: You can try with 1.2.1 See

RDD coalesce or repartition by #records or #bytes?

2015-03-04 Thread Du Li
Hi, My RDD's are created from kafka stream. After receiving a RDD, I want to do coalesce/repartition it so that the data will be processed in a set of machines in parallel as even as possible. The number of processing nodes is larger than the receiving nodes. My question is how the

how to update als in mllib?

2015-03-04 Thread lisendong
I 'm using spark1.0.0 with cloudera. but I want to use new als code which supports more features, such as rdd cache level(MEMORY ONLY), checkpoint, and so on. What is the easiest way to use the new als code? I only need the mllib als code, so maybe I don't need to update all the spark mllib

Re: Driver disassociated

2015-03-04 Thread Thomas Gerber
1.2.1 Also, I was using the following parameters, which are 10 times the default ones: spark.akka.timeout 1000 spark.akka.heartbeat.pauses 6 spark.akka.failure-detector.threshold 3000.0 spark.akka.heartbeat.interval 1 which should have helped *avoid* the problem if I understand

Re: Having lots of FetchFailedException in join

2015-03-04 Thread Jianshi Huang
One really interesting is that when I'm using the netty-based spark.shuffle.blockTransferService, there's no OOM error messages (java.lang.OutOfMemoryError: Java heap space). Any idea why it's not here? I'm using Spark 1.2.1. Jianshi On Thu, Mar 5, 2015 at 1:56 PM, Jianshi Huang

Re: spark master shut down suddenly

2015-03-04 Thread Denny Lee
It depends on your setup but one of the locations is /var/log/mesos On Wed, Mar 4, 2015 at 19:11 lisendong lisend...@163.com wrote: I ‘m sorry, but how to look at the mesos logs? where are them? 在 2015年3月4日,下午6:06,Akhil Das ak...@sigmoidanalytics.com 写道: You can check in the mesos logs

Re: distribution of receivers in spark streaming

2015-03-04 Thread Du Li
Figured it out: I need to override method preferredLocation() in MyReceiver class. On Wednesday, March 4, 2015 3:35 PM, Du Li l...@yahoo-inc.com.INVALID wrote: Hi, I have a set of machines (say 5) and want to evenly launch a number (say 8) of kafka receivers on those machines. In

RE: distribution of receivers in spark streaming

2015-03-04 Thread Shao, Saisai
Hi Du, You could try to sleep for several seconds after creating streaming context to let all the executors registered, then all the receivers can distribute to the nodes more evenly. Also setting locality is another way as you mentioned. Thanks Jerry From: Du Li

Re: Having lots of FetchFailedException in join

2015-03-04 Thread Jianshi Huang
There're some skew. 6461640SUCCESSPROCESS_LOCAL200 / 2015/03/04 23:45:471.1 min6 s198.6 MB21.1 GB240.8 MB5961590SUCCESSPROCESS_LOCAL30 / 2015/03/04 23:45:4744 s5 s200.7 MB4.8 GB154.0 MB But I expect this kind of skewness to be quite common. Jianshi On Thu, Mar 5, 2015 at 3:48 PM,

RE: distribution of receivers in spark streaming

2015-03-04 Thread Shao, Saisai
Yes, hostname is enough. I think currently it is hard for user code to get the worker list from standalone master. If you can get the Master object, you could get the worker list, but AFAIK may be it is difficult to get this object. All you could do is to manually get the worker list and

RE: Having lots of FetchFailedException in join

2015-03-04 Thread Shao, Saisai
Yes, if one key has too many values, there still has a chance to meet the OOM. Thanks Jerry From: Jianshi Huang [mailto:jianshi.hu...@gmail.com] Sent: Thursday, March 5, 2015 3:49 PM To: Shao, Saisai Cc: Cheng, Hao; user Subject: Re: Having lots of FetchFailedException in join I see. I'm using

Re: scala.Double vs java.lang.Double in RDD

2015-03-04 Thread Tobias Pfeiffer
Hi, On Thu, Mar 5, 2015 at 12:20 AM, Imran Rashid iras...@cloudera.com wrote: This doesn't involve spark at all, I think this is entirely an issue with how scala deals w/ primitives and boxing. Often it can hide the details for you, but IMO it just leads to far more confusing errors when

In the HA master mode, how to identify the alive master?

2015-03-04 Thread Xuelin Cao
Hi, In our project, we use stand alone duo master + zookeeper to make the HA of spark master. Now the problem is, how do we know which master is the current alive master? We tried to read the info that the master stored in zookeeper. But we found there is no information to

Re: TreeNodeException: Unresolved attributes

2015-03-04 Thread Zhan Zhang
Which spark version did you use? I tried spark-1.2.1 and didn’t meet this problem. scala val m = hiveContext.sql( select * from testtable where value like '%Restaurant%') 15/03/05 02:02:30 INFO ParseDriver: Parsing command: select * from testtable where value like '%Restaurant%' 15/03/05

Re: Where can I find more information about the R interface forSpark?

2015-03-04 Thread Ted Yu
Please follow SPARK-5654 On Wed, Mar 4, 2015 at 7:22 PM, Haopu Wang hw...@qilinsoft.com wrote: Thanks, it's an active project. Will it be released with Spark 1.3.0? -- *From:* 鹰 [mailto:980548...@qq.com] *Sent:* Thursday, March 05, 2015 11:19 AM *To:*

RE: Passing around SparkContext with in the Driver

2015-03-04 Thread Kapil Malik
Replace val sqlContext = new SQLContext(sparkContext) with @transient val sqlContext = new SQLContext(sparkContext) -Original Message- From: kpeng1 [mailto:kpe...@gmail.com] Sent: 04 March 2015 23:39 To: user@spark.apache.org Subject: Passing around SparkContext with in the Driver Hi

Unable to Read/Write Avro RDD on cluster.

2015-03-04 Thread ๏̯͡๏
I am trying to read RDD avro, transform and write. I am able to run it locally fine but when i run onto cluster, i see issues with Avro. export SPARK_HOME=/home/dvasthimal/spark/spark-1.0.2-bin-2.4.1 export SPARK_YARN_USER_ENV=CLASSPATH=/apache/hadoop/conf export

Re: distribution of receivers in spark streaming

2015-03-04 Thread Du Li
Hi Jerry, Thanks for your response. Is there a way to get the list of currently registered/live workers? Even in order to provide preferredLocation, it would be safer to know which workers are active. Guess I only need to provide the hostname, right? Thanks,Du On Wednesday, March 4, 2015

RE: Having lots of FetchFailedException in join

2015-03-04 Thread Shao, Saisai
Hi Jianshi, From my understanding, it may not be the problem of NIO or Netty, looking at your stack trace, the OOM is occurred in EAOM(ExternalAppendOnlyMap), theoretically EAOM can spill the data into disk if memory is not enough, but there might some issues when join key is skewed or key

Re: Having lots of FetchFailedException in join

2015-03-04 Thread Jianshi Huang
Hi Saisai, What's your suggested settings on monitoring shuffle? I've enabled -XX:+PrintGCDetails -XX:+PrintGCTimeStamps for GC logging. I found SPARK-3461 (Support external groupByKey using repartitionAndSortWithinPartitions) want to make groupByKey using external storage. It's still open

RE: Having lots of FetchFailedException in join

2015-03-04 Thread Shao, Saisai
I think what you could do is to monitor through web UI to see if there’s any skew or other symptoms in shuffle write and read. For GC you could use the below configuration as you mentioned. From Spark core side, all the shuffle related operations can spill the data into disk and no need to

Re: Having lots of FetchFailedException in join

2015-03-04 Thread Jianshi Huang
I see. I'm using core's join. The data might have some skewness (checking). I understand shuffle can spill data to disk but when consuming it, say in cogroup or groupByKey, it still needs to read the whole group elements, right? I guess OOM happened there when reading very large groups. Jianshi

using log4j2 with spark

2015-03-04 Thread Lior Chaga
Hi, Trying to run spark 1.2.1 w/ hadoop 1.0.4 on cluster and configure it to run with log4j2. Problem is that spark-assembly.jar contains log4j and slf4j classes compatible with log4j 1.2 in it, and so it detects it should use log4j 1.2 (

How to parse Json formatted Kafka message in spark streaming

2015-03-04 Thread Cui Lin
Friends, I'm trying to parse json formatted Kafka messages and then send back to cassandra.I have two problems: 1. I got the exception below. How to check an empty RDD? Exception in thread main java.lang.UnsupportedOperationException: empty collection at

Re: spark master shut down suddenly

2015-03-04 Thread Benjamin Stickel
Generally the location of logs in /var/log/mesos but the definitive configuration can be found via the /etc/mesos-master/... configuration files. There should be a configuration file labeled log_dir. ps -ax | grep mesos should also show the output of the configuration if it is configured.

Re: Having lots of FetchFailedException in join

2015-03-04 Thread Jianshi Huang
I changed spark.shuffle.blockTransferService to nio and now I'm getting OOM errors, I'm doing a big join operation. 15/03/04 19:04:07 ERROR Executor: Exception in task 107.0 in stage 2.0 (TID 6207) java.lang.OutOfMemoryError: Java heap space at

Re: Parallel execution of JavaDStream/JavaPairDStream

2015-03-04 Thread Jishnu Prathap
14/06/19 15:03:36 WARN LoadSnappy: Snappy native library not loaded The problem is Snappy library is not loaded in the workers. This is because you would have written the system.loadlibrary outside map function which is not shipped to the workers. Regards Jishnu Prathap -- View this

Re: Why can't Spark Streaming recover from the checkpoint directory when using a third party library for processingmulti-line JSON?

2015-03-04 Thread Emre Sevinc
I've also tried the following: Configuration hadoopConfiguration = new Configuration(); hadoopConfiguration.set(multilinejsoninputformat.member, itemSet); JavaStreamingContext ssc = JavaStreamingContext.getOrCreate(checkpointDirectory, hadoopConfiguration, factory, false); but I

Re: spark.local.dir leads to Job cancelled because SparkContext was shut down

2015-03-04 Thread Akhil Das
When you say multiple directories, make sure those directories are available and spark have permission to write to those directories. You can look at the worker logs to see the exact reason of failure. Thanks Best Regards On Tue, Mar 3, 2015 at 6:45 PM, lisendong lisend...@163.com wrote: As

Re: Running Spark jobs via oozie

2015-03-04 Thread Felix C
We have gotten it to work... --- Original Message --- From: nitinkak001 nitinkak...@gmail.com Sent: March 3, 2015 7:46 AM To: user@spark.apache.org Subject: Re: Running Spark jobs via oozie I am also starting to work on this one. Did you get any solution to this issue? -- View this message

Re: Connecting a PHP/Java applications to Spark SQL Thrift Server

2015-03-04 Thread أنس الليثي
Thanks very much, I used it and works fine with me. On 4 March 2015 at 11:56, Arush Kharbanda ar...@sigmoidanalytics.com wrote: For java You can use hive-jdbc connectivity jars to connect to Spark-SQL. The driver is inside the hive-jdbc Jar.

spark master shut down suddenly

2015-03-04 Thread lisendong
15/03/04 09:26:36 INFO ClientCnxn: Client session timed out, have not heard from server in 26679ms for sessionid 0x34bbf3313a8001b, closing socket connection and attempting reconnect 15/03/04 09:26:36 INFO ConnectionStateManager: State change: SUSPENDED 15/03/04 09:26:36 INFO

Re: spark master shut down suddenly

2015-03-04 Thread Akhil Das
You can check in the mesos logs and see whats really happening. Thanks Best Regards On Wed, Mar 4, 2015 at 3:10 PM, lisendong lisend...@163.com wrote: 15/03/04 09:26:36 INFO ClientCnxn: Client session timed out, have not heard from server in 26679ms for sessionid 0x34bbf3313a8001b, closing

Re: Why can't Spark Streaming recover from the checkpoint directory when using a third party library for processingmulti-line JSON?

2015-03-04 Thread Tathagata Das
That could be a corner case bug. How do you add the 3rd party library to the class path of the driver? Through spark-submit? Could you give the command you used? TD On Wed, Mar 4, 2015 at 12:42 AM, Emre Sevinc emre.sev...@gmail.com wrote: I've also tried the following: Configuration

Re: delay between removing the block manager of an executor, and marking that as lost

2015-03-04 Thread Akhil Das
You can look at the following - spark.akka.timeout - spark.akka.heartbeat.pauses from http://spark.apache.org/docs/1.2.0/configuration.html Thanks Best Regards On Tue, Mar 3, 2015 at 4:46 PM, twinkle sachdeva twinkle.sachd...@gmail.com wrote: Hi, Is there any relation between removing

Re: Is the RDD's Partitions determined before hand ?

2015-03-04 Thread Sean Owen
Hm, what do you mean? You can control, to some extent, the number of partitions when you read the data, and can repartition if needed. You can set the default parallelism too so that it takes effect for most ops thay create an RDD. One # of partitions is usually about right for all work (2x or so

Spark Streaming and SchemaRDD usage

2015-03-04 Thread Haopu Wang
Hi, in the roadmap of Spark in 2015 (link: http://files.meetup.com/3138542/Spark%20in%202015%20Talk%20-%20Wendell.p ptx), I saw SchemaRDD is designed to be the basis of BOTH Spark Streaming and Spark SQL. My question is: what's the typical usage of SchemaRDD in a Spark Streaming application?

Re: Is FileInputDStream returned by fileStream method a reliable receiver?

2015-03-04 Thread Tathagata Das
The file stream does not use receiver. May be that was not clear in the programming guide. I am updating it for 1.3 release right now, I will make it more clear. And file stream has full reliability. Read this in the programming guide.

Spark RDD Python, Numpy Shape command

2015-03-04 Thread rui li
I am a beginner to Spark, having some simple questions regarding the use of RDD in python. Suppose I have a matrix called data_matrix, I pass it to RDD using RDD_matrix = sc.parallelize(data_matrix) but I will have a problem if I want to know the dimension of the matrix in Spark, because Sparkk

scala.Double vs java.lang.Double in RDD

2015-03-04 Thread Tobias Pfeiffer
Hi, I have a function with signature def aggFun1(rdd: RDD[(Long, (Long, Double))]): RDD[(Long, Any)] and one with def aggFun2[_Key: ClassTag, _Index](rdd: RDD[(_Key, (_Index, Double))]): RDD[(_Key, Double)] where all Double classes involved are scala.Double classes (according to

Is FileInputDStream returned by fileStream method a reliable receiver?

2015-03-04 Thread Emre Sevinc
Is FileInputDStream returned by fileStream method a reliable receiver? In the Spark Streaming Guide it says: There can be two kinds of data sources based on their *reliability*. Sources (like Kafka and Flume) allow the transferred data to be acknowledged. If the system receiving data from

Re: Is the RDD's Partitions determined before hand ?

2015-03-04 Thread Jeff Zhang
Hi Sean, If you know a stage needs unusually high parallelism for example you can repartition further for that stage. The problem is we may don't know whether high parallelism is needed. e.g. for the join operator, high parallelism may only be necessary for some dataset that lots of data can

Re: Why can't Spark Streaming recover from the checkpoint directory when using a third party library for processingmulti-line JSON?

2015-03-04 Thread Emre Sevinc
I'm adding this 3rd party library to my Maven pom.xml file so that it's embedded into the JAR I send to spark-submit: dependency groupIdjson-mapreduce/groupId artifactIdjson-mapreduce/artifactId version1.0-SNAPSHOT/version exclusions exclusion

Re:

2015-03-04 Thread Akhil Das
You may look at https://issues.apache.org/jira/browse/SPARK-4516 Thanks Best Regards On Wed, Mar 4, 2015 at 12:25 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi, I got this error message: 15/03/03 10:22:41 ERROR OneForOneBlockFetcher: Failed while starting block fetches

Re: Connecting a PHP/Java applications to Spark SQL Thrift Server

2015-03-04 Thread Arush Kharbanda
For java You can use hive-jdbc connectivity jars to connect to Spark-SQL. The driver is inside the hive-jdbc Jar. *http://hive.apache.org/javadocs/r0.11.0/api/org/apache/hadoop/hive/jdbc/HiveDriver.html http://hive.apache.org/javadocs/r0.11.0/api/org/apache/hadoop/hive/jdbc/HiveDriver.html*

Re: Unable to submit spark job to mesos cluster

2015-03-04 Thread Sarath Chandra
From the lines pointed in the exception log, I figured out that my code is unable to get the spark context. To isolate the problem, I've written a small code as below - *import org.apache.spark.SparkConf;* *import org.apache.spark.SparkContext;* *public class Test {* *public static void

Re: Is the RDD's Partitions determined before hand ?

2015-03-04 Thread Sean Owen
Parallelism doesn't really affect the throughput as long as it's: - not less than the number of available execution slots, - ... and probably some low multiple of them to even out task size effects - not so high that the bookkeeping overhead dominates Although you may need to select different

Does SparkSQL support ..... having count (fieldname) in SQL statement?

2015-03-04 Thread shahab
Hi, It seems that SparkSQL, even the HiveContext, does not support SQL statements like : SELECT category, count(1) AS cnt FROM products GROUP BY category HAVING cnt 10; I get this exception: Error: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved attributes:

Re: insert Hive table with RDD

2015-03-04 Thread patcharee
Hi, I guess that toDF() api in spark 1.3 which is required build from source code? Patcharee On 03. mars 2015 13:42, Cheng, Hao wrote: Using the SchemaRDD / DataFrame API via HiveContext Assume you're using the latest code, something probably like: val hc = new HiveContext(sc) import

Unable to submit spark job to mesos cluster

2015-03-04 Thread Sarath Chandra
Hi, I have a cluster running on CDH5.2.1 and I have a Mesos cluster (version 0.18.1). Through a Oozie java action I'm want to submit a Spark job to mesos cluster. Before configuring it as Oozie job I'm testing the java action from command line and getting exception as below. While running I'm

RE: Does SparkSQL support ..... having count (fieldname) in SQL statement?

2015-03-04 Thread yana
I think the problem is that you are using an alias in the having clause. I am not able to try just now but see if HAVING count (*) 2 works ( ie dont use cnt) Sent on the new Sprint Network from my Samsung Galaxy S®4. div Original message /divdivFrom: shahab

Re: TreeNodeException: Unresolved attributes

2015-03-04 Thread Arush Kharbanda
Why don't you formulate a string before you pass it to the hql function (appending strings), and hql function is deprecated. You should use sql. http://spark.apache.org/docs/1.1.0/api/scala/index.html#org.apache.spark.sql.hive.HiveContext On Wed, Mar 4, 2015 at 6:15 AM, Anusha Shamanur

Re: Unable to submit spark job to mesos cluster

2015-03-04 Thread Akhil Das
Looks like you are having 2 netty jars in the classpath. Thanks Best Regards On Wed, Mar 4, 2015 at 5:14 PM, Sarath Chandra sarathchandra.jos...@algofusiontech.com wrote: From the lines pointed in the exception log, I figured out that my code is unable to get the spark context. To isolate

Re: Unable to submit spark job to mesos cluster

2015-03-04 Thread Arush Kharbanda
You can try increasing the Akka time out in the config, you can set the following in your config. spark.core.connection.ack.wait.timeout: 600 spark.akka.timeout: 1000 (In secs) spark.akka.frameSize:50 On Wed, Mar 4, 2015 at 5:14 PM, Sarath Chandra sarathchandra.jos...@algofusiontech.com wrote:

Re: Speed Benchmark

2015-03-04 Thread Guillaume Guy
Sorry for the confusion. All are running Hadoop services. Node 1 is the namenode whereas Nodes 2 and 3 are datanodes. Best, Guillaume Guy * +1 919 - 972 - 8750* On Sat, Feb 28, 2015 at 1:09 AM, Sean Owen so...@cloudera.com wrote: Is machine 1 the only one running an HDFS data node? You