Re: Running Spark in Yarn-client mode

2015-10-08 Thread Sushrut Ikhar
Hey Jean, Thanks for the quick response. I am using spark 1.4.1 pre-built with hadoop 2.6. Yes the Yarn cluster has multiple running worker nodes. It would a great help if you can tell how to look for the executors logs. Regards, Sushrut Ikhar [image: https://]about.me/sushrutikhar

Build Failure

2015-10-08 Thread shahid qadri
hi I tried to build latest master branch of spark build/mvn -DskipTests clean package Reactor Summary: [INFO] [INFO] Spark Project Parent POM ... SUCCESS [03:46 min] [INFO] Spark Project Test Tags SUCCESS [01:02 min] [INFO] Spark Project

Re: Build Failure

2015-10-08 Thread Jean-Baptiste Onofré
Hi, I just tried and it works for me (I don't have any Maven mirror on my subnet). Can you try again ? Maybe it was a temporary issue to access to Maven central. The artifact is present on central: http://repo1.maven.org/maven2/com/twitter/algebird-core_2.10/0.9.0/ Regards JB On

Spark ganglia jClassNotFoundException: org.apache.spark.metrics.sink.GangliaSink

2015-10-08 Thread gtanguy
I build spark with ganglia : $SPARK_HOME/build/sbt -Pspark-ganglia-lgpl -Phadoop-1 -Phive -Phive-thriftserver assembly ... [info] Including from cache: metrics-ganglia-3.1.0.jar ... In the master log : ERROR actor.OneForOneStrategy: org.apache.spark.metrics.sink.GangliaSink

Spark 1.5.1 standalone cluster - wrong Akka remoting config?

2015-10-08 Thread Barak Yaish
Doing my firsts steps with Spark, I'm facing problems submitting jobs to cluster from the application code. Digging the logs, I noticed some periodic WARN messages on master log: 15/10/08 13:00:00 WARN remote.ReliableDeliverySupervisor: Association with remote system

Re: Example of updateStateByKey with initial RDD?

2015-10-08 Thread Aniket Bhatnagar
Ohh I see. You could have to add underscore after ProbabilityCalculator.updateCountsOfProcessGivenRole. Try: dstream.map(x => (x.keyWithTime, x)) .updateStateByKey(ProbabilityCalculator.updateCountsOfProcessGivenRole _, new HashPartitioner(3), initialProcessGivenRoleRdd) Here is an example: def

How can I read file from HDFS i sparkR from RStudio

2015-10-08 Thread Amit Behera
Hi All, I am very new to SparkR. I am able to run a sample code from example given in the link : http://www.r-bloggers.com/installing-and-starting-sparkr-locally-on-windows-os-and-rstudio/ Then I am trying to read a file from HDFS in RStudio, but unable to read. Below is my code.

Re: Is coalesce smart while merging partitions?

2015-10-08 Thread Iulian Dragoș
It's smart. Have a look at https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/CoalescedRDD.scala#L123 On Thu, Oct 8, 2015 at 4:00 AM, Cesar Flores wrote: > It is my understanding that the default behavior of coalesce function when > the user

Re: Example of updateStateByKey with initial RDD?

2015-10-08 Thread Bryan Jeffrey
Aniket, Thank you for the example - but that's not quite what I'm looking for. I've got a call to updateStateByKey that looks like the following: dstream.map(x => (x.keyWithTime, x)) .updateStateByKey(ProbabilityCalculator.updateCountsOfProcessGivenRole) def updateCountsOfProcessGivenRole(a :

Spark 1.5.1 standalone cluster - wrong Akka remoting config?

2015-10-08 Thread baraky
Doing my firsts steps with Spark, I'm facing problems submitting jobs to cluster from the application code. Digging the logs, I noticed some periodic WARN messages on master log: 15/10/08 13:00:00 WARN remote.ReliableDeliverySupervisor: Association with remote system

Re: Spark Streaming: Doing operation in Receiver vs RDD

2015-10-08 Thread Iulian Dragoș
You can have a look at http://spark.apache.org/docs/latest/streaming-programming-guide.html#receiver-reliability for details on Receiver reliability. If you go the receiver way you'll need to enable Write Ahead Logs to ensure no data loss. In Kafka direct you don't have this problem. Regarding

Re: Optimal way to avoid processing null returns in Spark Scala

2015-10-08 Thread Iulian Dragoș
On Wed, Oct 7, 2015 at 6:42 PM, swetha wrote: Hi, > > I have the following functions that I am using for my job in Scala. If you > see the getSessionId function I am returning null sometimes. If I return > null the only way that I can avoid processing those records is

Best practises to clean up RDDs for old applications

2015-10-08 Thread Jens Rantil
Hi, I have a couple of old application RDDs under /var/lib/spark/rdd that haven't been properly cleaned up after themselves. Example: # du -shx /var/lib/spark/rdd/* 44K /var/lib/spark/rdd/liblz4-java1011984124691611873.so 48K /var/lib/spark/rdd/snappy-1.0.5-libsnappyjava.so 2.3G

Re: Spark 1.5.1 standalone cluster - wrong Akka remoting config?

2015-10-08 Thread michal.klo...@gmail.com
Try setting spark.driver.host to the actual ip or hostname of the box submitting the work. More info the networking section in this link: http://spark.apache.org/docs/latest/configuration.html Also check the spark config for your application for these driver settings in the application web UI

Launching EC2 instances with Spark compiled for Scala 2.11

2015-10-08 Thread Theodore Vasiloudis
Hello, I was wondering if there is an easy way launch EC2 instances which have a Spark built for Scala 2.11. The only way I can think of is to prepare the sources for 2.11 as shown in the Spark build instructions ( http://spark.apache.org/docs/latest/building-spark.html#building-for-scala-211),

Re: Example of updateStateByKey with initial RDD?

2015-10-08 Thread Bryan Jeffrey
Honestly, that's what I already did - I am working to test it now. It looked like 'add an underscore' was ignoring some implicit argument that I was failing to provide. On Thu, Oct 8, 2015 at 8:34 AM, Aniket Bhatnagar wrote: > Ohh I see. You could have to add

RowNumber in HiveContext returns null or negative values

2015-10-08 Thread Saif.A.Ellafi
Hi all, would this be a bug?? val ws = Window. partitionBy("clrty_id"). orderBy("filemonth_dtt") val nm = "repeatMe" df.select(df.col("*"), rowNumber().over(ws).cast("int").as(nm))

Using a variable (a column name) in an IF statement in Spark SQL

2015-10-08 Thread Maheshakya Wijewardena
Hi, Suppose there is data frame called goods with columns "barcode" and "items". Some of the values in the column "items" can be null. I want to the barcode and the respective items from the table adhering the following rules: - If "items" is null -> output 0 - else -> output "items" (

Re: Is coalesce smart while merging partitions?

2015-10-08 Thread Daniel Darabos
> For example does spark try to merge the small partitions first or the election of partitions to merge is random? It is quite smart as Iulian has pointed out. But it does not try to merge small partitions first. Spark doesn't know the size of partitions. (The partitions are represented as

Re: does KafkaCluster can be public ?

2015-10-08 Thread Cody Koeninger
If anyone is interested in keeping tabs on it, the jira for this is https://issues.apache.org/jira/browse/SPARK-10963 On Wed, Oct 7, 2015 at 3:16 AM, Erwan ALLAIN wrote: > Thanks guys ! > > On Wed, Oct 7, 2015 at 1:41 AM, Cody Koeninger wrote: > >>

Re: Launching EC2 instances with Spark compiled for Scala 2.11

2015-10-08 Thread Aniket Bhatnagar
Is it possible for you to use EMR instead of EC2? If so, you may be able to tweak EMR bootstrap scripts to install your custom spark build. Thanks, Aniket On Thu, Oct 8, 2015 at 5:58 PM Theodore Vasiloudis < theodoros.vasilou...@gmail.com> wrote: > Hello, > > I was wondering if there is an easy

Dataframes - sole data structure for parallel computations?

2015-10-08 Thread Tracewski, Lukasz
Hi, Many people interpret this slide from Databricks https://ogirardot.files.wordpress.com/2015/05/future-of-spark.png as indication that Dataframes API is going to be the main processing unit of Spark and sole access point to MLlib, Streaming and such. Is it true? My impression was that

JDBC thrift server

2015-10-08 Thread Younes Naguib
Hi, We've been using the JDBC thrift server for a couple of weeks now and running queries on it like a regular RDBMS. We're about to deploy it in a shared production cluster. Any advice, warning on a such setup. Yarn or Mesos? How about dynamic resource allocation in a already running thrift

RE: SparkR Error in sparkR.init(master=“local”) in RStudio

2015-10-08 Thread Sun, Rui
Can you extract the spark-submit command from the console output, and run it on the Shell, and see if there is any error message? From: Khandeshi, Ami [mailto:ami.khande...@fmr.com] Sent: Wednesday, October 7, 2015 9:57 PM To: Sun, Rui; Hossein Cc: akhandeshi; user@spark.apache.org Subject: RE:

Re: Parquet file size

2015-10-08 Thread Cheng Lian
How many tasks are there in the write job? Since each task may write one file for each partition, you may end up with taskNum * 31 files. Increasing SPLIT_MINSIZE does help reducing task number. Another way to address this issue is to use DataFrame.coalesce(n) to shrink task number to n

How to register udf with Any or generic Type in spark

2015-10-08 Thread dugasani jcreddy
Hi,   I have a requirement  to use udf whose return type is not known in spark data frame sql. I have below requirementfunction takes either String  Or Boolean  data types.function returns 1 or 0 based on whether Input argument is True or False. If input string is in the form Integer or long

Re: sql query orc slow

2015-10-08 Thread Zhan Zhang
Hi Patcharee, Did you enable the predicate pushdown in the second method? Thanks. Zhan Zhang On Oct 8, 2015, at 1:43 AM, patcharee wrote: > Hi, > > I am using spark sql 1.5 to query a hive table stored as partitioned orc > file. We have the total files is about

Error executing using alternating least square

2015-10-08 Thread haridass saisriram
Hi, I downloaded spark 1.5.0 on windows 7 and built it using build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package and tried running the Alternating least square example ( http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html ) using spark-shell

Re: JDBC thrift server

2015-10-08 Thread Sathish Kumaran Vairavelu
Which version of spark you are using? You might encounter SPARK-6882 if Kerberos is enabled. -Sathish On Thu, Oct 8, 2015 at 10:46 AM Younes Naguib < younes.nag...@tritondigital.com> wrote: > Hi, > > > > We’ve been using the JDBC thrift server

Re: Using Sqark SQL mapping over an RDD

2015-10-08 Thread Michael Armbrust
You can't do nested operations on RDDs or DataFrames (i.e. you can't create a DataFrame from within a map function). Perhaps if you explain what you are trying to accomplish someone can suggest another way. On Thu, Oct 8, 2015 at 10:10 AM, Afshartous, Nick wrote: > >

Re: Default size of a datatype in SparkSQL

2015-10-08 Thread Michael Armbrust
Its purely for estimation, when guessing when its safe to do a broadcast join. We picked a random number that we thought was larger than the common case (its better to over estimate to avoid OOM). On Wed, Oct 7, 2015 at 10:11 PM, vivek bhaskar wrote: > I want to understand

Applicative logs on Yarn

2015-10-08 Thread nibiau
Hello, I submit spark streaming inside Yarn, I have configured yarn to generate custom logs. It works fine and yarn aggregate very well the logs inside HDFS, nevertheless the log files are only usable via "yarn logs" command. I would prefer to be able to navigate inside via hdfs command like a

Re: RowNumber in HiveContext returns null or negative values

2015-10-08 Thread Michael Armbrust
Which version of Spark? On Thu, Oct 8, 2015 at 7:25 AM, wrote: > Hi all, would this be a bug?? > > val ws = Window. > partitionBy("clrty_id"). > orderBy("filemonth_dtt") > > val nm = "repeatMe" >

Re: Dataframes - sole data structure for parallel computations?

2015-10-08 Thread Jerry Lam
I just read the article by ogirardot but I don’t agree It is like saying pandas dataframe is the sole data structure for analyzing data in python. Can Pandas dataframe replace Numpy array? The answer is simply no from an efficiency perspective for some computations. Unless there is a computer

Re: Using a variable (a column name) in an IF statement in Spark SQL

2015-10-08 Thread Michael Armbrust
Hmm, that looks like it should work to me. What version of Spark? What is the schema of goods? On Thu, Oct 8, 2015 at 6:13 AM, Maheshakya Wijewardena wrote: > Hi, > > Suppose there is data frame called goods with columns "barcode" and > "items". Some of the values in the

RE: Using Sqark SQL mapping over an RDD

2015-10-08 Thread Afshartous, Nick
> You can't do nested operations on RDDs or DataFrames (i.e. you can't create a > DataFrame from within a map function). Perhaps if you explain what you are > trying to accomplish someone can suggest another way. The code below what I had in mind. For each Id, I'd like to run a query using

RE: RowNumber in HiveContext returns null or negative values

2015-10-08 Thread Saif.A.Ellafi
Hi, thanks for looking into. v1.5.1. I am really worried. I dont have hive/hadoop for real in the environment. Saif From: Michael Armbrust [mailto:mich...@databricks.com] Sent: Thursday, October 08, 2015 2:57 PM To: Ellafi, Saif A. Cc: user Subject: Re: RowNumber in HiveContext returns null or

Re: Using Sqark SQL mapping over an RDD

2015-10-08 Thread Michael Armbrust
You are probably looking for a groupby instead: sqlContext.sql("SELECT COUNT(*) FROM ad_info GROUP BY deviceId") On Thu, Oct 8, 2015 at 10:27 AM, Afshartous, Nick wrote: > > > You can't do nested operations on RDDs or DataFrames (i.e. you can't > create a DataFrame

Re: Dataframes - sole data structure for parallel computations?

2015-10-08 Thread Michael Armbrust
Don't worry, the ability to work with domain objects and lambda functions is not going to go away. However, we are looking at ways to leverage Tungsten's improved performance when processing structured data. More details can be found here: https://issues.apache.org/jira/browse/SPARK- On

RE: RowNumber in HiveContext returns null or negative values

2015-10-08 Thread Saif.A.Ellafi
It turns out this does not happen in local[32] mode. Only happens when submiting to standalone cluster. Don’t have YARN/MESOS to compare. Will keep diagnosing. Saif From: saif.a.ell...@wellsfargo.com [mailto:saif.a.ell...@wellsfargo.com] Sent: Thursday, October 08, 2015 3:01 PM To:

Re: Applicative logs on Yarn

2015-10-08 Thread Ted Yu
This question seems better suited for u...@hadoop.apache.org FYI On Thu, Oct 8, 2015 at 10:37 AM, wrote: > Hello, > I submit spark streaming inside Yarn, I have configured yarn to generate > custom logs. > It works fine and yarn aggregate very well the logs inside HDFS, >

Re: Insert via HiveContext is slow

2015-10-08 Thread Daniel Haviv
Forgot to mention that my insert is a multi table insert : sqlContext2.sql("""from avro_events lateral view explode(usChnlList) usParamLine as usParamLine lateral view explode(dsChnlList) dsParamLine as dsParamLine insert into table UpStreamParam

Re: How to increase Spark partitions for the DataFrame?

2015-10-08 Thread Umesh Kacha
Hi Lan, thanks for the response yes I know and I have confirmed in UI that it has only 12 partitions because of 12 HDFS blocks and hive orc file strip size is 33554432. On Thu, Oct 8, 2015 at 11:55 PM, Lan Jiang wrote: > The partition number should be the same as the HDFS

Re: How to increase Spark partitions for the DataFrame?

2015-10-08 Thread Lan Jiang
Hmm, that’s odd. You can always use repartition(n) to increase the partition number, but then there will be shuffle. How large is your ORC file? Have you used NameNode UI to check how many HDFS blocks each ORC file has? Lan > On Oct 8, 2015, at 2:08 PM, Umesh Kacha

Re: How to increase Spark partitions for the DataFrame?

2015-10-08 Thread Umesh Kacha
Hi Lan thanks for the reply. I have tried to do the following but it did not increase partition DataFrame df = hiveContext.read().format("orc").load("/hdfs/path/to/orc/ files/").repartition(100); Yes I have checked in namenode ui ORC files contains 12 files/blocks of 128 MB each and ORC files

Re: "Too many open files" exception on reduceByKey

2015-10-08 Thread Tian Zhang
I hit this issue with spark 1.3.0 stateful application (with updateStateByKey) function on mesos. It will fail after running fine for about 24 hours. The error stack trace as below, I checked ulimit -n and we have very large numbers set on the machines. What else can be wrong? 15/09/27 18:45:11

Re: Streaming DirectKafka assertion errors

2015-10-08 Thread Cody Koeninger
It sounds like you moved the job from one environment to another? This may sound silly, but make sure (eg using lsof) the brokers the job is connecting to are actually the ones you expect. As far as the checkpoint goes, the log output should indicate whether the job is restoring from checkpoint.

Re: ValueError: can not serialize object larger than 2G

2015-10-08 Thread Ted Yu
To fix the problem, consider increasing number of partitions for your job. Showing code snippet would help us understand your use case better. Cheers On Thu, Oct 8, 2015 at 1:39 PM, Ted Yu wrote: > See the comment of FramedSerializer() in serializers.py : > >

Unexplained sleep time

2015-10-08 Thread yael aharon
Hello, I am working on improving the performance of our Spark on Yarn applications. Scanning through the logs I found the following lines: [2015-10-07T16:25:17.245-04:00] [DataProcessing] [INFO] [] [org.apache.spark.Logging$class] [tid:main] [userID:yarn] Started progress reporter thread - sleep

Re: ValueError: can not serialize object larger than 2G

2015-10-08 Thread Ted Yu
See the comment of FramedSerializer() in serializers.py : Serializer that writes objects as a stream of (length, data) pairs, where C{length} is a 32-bit integer and data is C{length} bytes. Hence the limit on the size of object. On Thu, Oct 8, 2015 at 12:56 PM, XIANDI

unsubscribe

2015-10-08 Thread Jürgen Fey

Re: "Too many open files" exception on reduceByKey

2015-10-08 Thread DB Tsai
Try to run to see actual ulimit. We found that mesos overrides the ulimit which causes the issue. import sys.process._ val p = 1 to 100 val rdd = sc.parallelize(p, 100) val a = rdd.map(x=> Seq("sh", "-c", "ulimit -n").!!.toDouble.toLong).collect Sincerely, DB Tsai

FW: ValueError: can not serialize object larger than 2G

2015-10-08 Thread Xiandi Zhang
--def parse_record(x):formatted = list(x[0])if (type(x[1])!=list) & (type(x[1])!=tuple):formatted.append(x[1])else:

Re: unsubscribe

2015-10-08 Thread Ted Yu
Take a look at the first section of: http://spark.apache.org/community On Thu, Oct 8, 2015 at 2:10 PM, Jürgen Fey wrote: > >

Re: How to increase Spark partitions for the DataFrame?

2015-10-08 Thread Ted Yu
bq. contains 12 files/blocks Looks like you hit the limit of parallelism these files can provide. If you have larger dataset, you would have more partitions. On Thu, Oct 8, 2015 at 12:21 PM, Umesh Kacha wrote: > Hi Lan thanks for the reply. I have tried to do the

[Spark 1.5] Kinesis receivers not starting

2015-10-08 Thread Bharath Mukkati
Hi Spark Users, I am testing my application on Spark 1.5 and kinesis-asl-1.5. The streaming application starts but I see a ton of stages scheduled for ReceiverTracker (submitJob at ReceiverTracker.scala:557 ). In the driver logs I see this sequence repeat: 15/10/09 00:10:54 INFO INFO

error in sparkSQL 1.5 using count(1) in nested queries

2015-10-08 Thread Jeff Thompson
After upgrading from 1.4.1 to 1.5.1 I found some of my spark SQL queries no longer worked. Seems to be related to using count(1) or count(*) in a nested query. I can reproduce the issue in a pyspark shell with the sample code below. The ‘people’ table is from spark-1.5.1-bin-hadoop2.4/

Re: Spark Streaming: Doing operation in Receiver vs RDD

2015-10-08 Thread Tathagata Das
Since it is about encryption and decryption, its also good know how the raw data is actually saved in disk. If the write ahead log is enabled, then the raw data will be saved to the WAL in HDFS. You probably do not want to save decrypted data in that. So its better not to decrupt in the receiver,

RE: Insert via HiveContext is slow

2015-10-08 Thread Cheng, Hao
I think that’s a known performance issue(Compared to Hive) of Spark SQL in multi-inserts. A workaround is create a temp cached table for the projection first, and then do the multiple inserts base on the cached table. We are actually working on the POC of some similar cases, hopefully it comes

Re: Using a variable (a column name) in an IF statement in Spark SQL

2015-10-08 Thread Maheshakya Wijewardena
Spark version: 1.4.1 The schema is "barcode STRING, items INT" On Thu, Oct 8, 2015 at 10:48 PM, Michael Armbrust wrote: > Hmm, that looks like it should work to me. What version of Spark? What > is the schema of goods? > > On Thu, Oct 8, 2015 at 6:13 AM, Maheshakya

Re: Streaming DirectKafka assertion errors

2015-10-08 Thread Roman Garcia
Thanks Cody for your help. Actually i found out it was a issue on our network. After doing a ping from spark node to kafka node i found there were dup packages. After rebooting the kafka node everything went back to normal! Thanks for your help! Roman El jue., 8 de octubre de 2015 17:13, Cody

Re: [Spark 1.5] Kinesis receivers not starting

2015-10-08 Thread Tathagata Das
How many executors and cores do you acquire? td On Thu, Oct 8, 2015 at 6:11 PM, Bharath Mukkati wrote: > Hi Spark Users, > > I am testing my application on Spark 1.5 and kinesis-asl-1.5. The > streaming application starts but I see a ton of stages scheduled for >

Re: Re: Error in load hbase on spark

2015-10-08 Thread Ted Yu
The second code snippet is similar to: examples//src/main/scala/org/apache/spark/examples/HBaseTest.scala See the comment in HBaseTest.scala : // please ensure HBASE_CONF_DIR is on classpath of spark driver // e.g: set it through spark.driver.extraClassPath property // in

Architecture for a Spark batch job.

2015-10-08 Thread Renato Perini
I have started a project using Spark 1.5.1 consisting of several jobs I launch (actually manually) using shell scripts against a small Spark standalone cluster. Those jobs generally read a Cassandra table (using a RDD of type JavaRDD or using plain DataFrames), compute results on that data and

Error in load hbase on spark

2015-10-08 Thread Roy Wang
I want to load hbase table into spark. JavaPairRDD hBaseRDD = sc.newAPIHadoopRDD(conf, TableInputFormat.class, ImmutableBytesWritable.class, Result.class); *when call hBaseRDD.count(),got error.* Caused by: java.lang.IllegalStateException: The input format

RE: How can I read file from HDFS i sparkR from RStudio

2015-10-08 Thread Sun, Rui
Amit, sqlContext <- sparkRSQL.init(sc) peopleDF <- read.df(sqlContext, "hdfs://master:9000/sears/example.csv") have you restarted the R session in RStudio between the two lines? From: Amit Behera [mailto:amit.bd...@gmail.com] Sent: Thursday, October 8, 2015 5:59 PM To: user@spark.apache.org

Re: Error in load hbase on spark

2015-10-08 Thread Ted Yu
One possibility was that hbase config, including hbase.zookeeper.quorum, was not passed to your job. hbase-site.xml should be on the classpath. Can you show snippet of your code ? Looks like you were running against hbase 1.x Cheers On Thu, Oct 8, 2015 at 7:29 PM, Roy Wang

weird issue with sqlContext.createDataFrame - pyspark 1.3.1

2015-10-08 Thread ping yan
I really cannot figure out what this is about.. (tried to import pandas, in case that is a dependency, but it didn't help.) >>> from pyspark.sql import SQLContext >>> sqlContext=SQLContext(sc) >>> sqlContext.createDataFrame(l).collect() Traceback (most recent call last): File "", line 1, in

Insert via HiveContext is slow

2015-10-08 Thread Daniel Haviv
Hi, I'm inserting into a partitioned ORC table using an insert sql statement passed via HiveContext. The performance I'm getting is pretty bad and I was wondering if there are ways to speed things up. Would saving the DF like this

How to increase Spark partitions for the DataFrame?

2015-10-08 Thread unk1102
Hi I have the following code where I read ORC files from HDFS and it loads directory which contains 12 ORC files. Now since HDFS directory contains 12 files it will create 12 partitions by default. These directory is huge and when ORC files gets decompressed it becomes around 10 GB how do I

failed spark job reports on YARN as successful

2015-10-08 Thread Lan Jiang
Hi, there I have a spark batch job running on CDH5.4 + Spark 1.3.0. Job is submitted in “yarn-client” mode. The job itself failed due to YARN kills several executor containers because the containers exceeded the memory limit posed by YARN. However, when I went to the YARN resource manager

Re: How to increase Spark partitions for the DataFrame?

2015-10-08 Thread Lan Jiang
The partition number should be the same as the HDFS block number instead of file number. Did you confirmed from the spark UI that only 12 partitions were created? What is your ORC orc.stripe.size? Lan > On Oct 8, 2015, at 1:13 PM, unk1102 wrote: > > Hi I have the

RE: RowNumber in HiveContext returns null or negative values

2015-10-08 Thread Saif.A.Ellafi
Repartition and default parallelism to 1, in cluster mode, is still broken. So the problem is not the parallelism, but the cluster mode itself. Something wrong with HiveContext + cluster mode. Saif From: saif.a.ell...@wellsfargo.com [mailto:saif.a.ell...@wellsfargo.com] Sent: Thursday, October

Why dataframe.persist(StorageLevels.MEMORY_AND_DISK_SER) hangs for long time?

2015-10-08 Thread unk1102
Hi as recommended I am caching my Spark job dataframe as dataframe.persist(StorageLevels.MEMORY_AND_DISK_SER) but what I see in Spark job UI is this persist stage runs for so long showing 10 GB of shuffle read and 5 GB of shuffle write it takes to long to finish and because of that sometimes my

Re: RowNumber in HiveContext returns null or negative values

2015-10-08 Thread Michael Armbrust
Can you open a JIRA? On Thu, Oct 8, 2015 at 11:24 AM, wrote: > Repartition and default parallelism to 1, in cluster mode, is still > *broken*. > > > > So the problem is not the parallelism, but the cluster mode itself. > Something wrong with HiveContext + cluster

Re: Spark REST Job server feedback?

2015-10-08 Thread Tim Smith
I am curious too - any comparison between the two. Looks like one is Datastax sponsored and the other is Cloudera. Other than that, any major/core differences in design/approach? Thanks, Tim On Mon, Sep 28, 2015 at 8:32 AM, Ramirez Quetzal wrote: > Anyone has

Streaming DirectKafka assertion errors

2015-10-08 Thread Roman Garcia
I'm running Spark Streaming using Kafka Direct stream, expecting exactly-once semantics using checkpoints (which are stored onto HDFS). My Job is really simple, it opens 6 Kafka streams (6 topics with 4 parts each) and stores every row to ElasticSearch using ES-Spark integration. This job was

ValueError: can not serialize object larger than 2G

2015-10-08 Thread XIANDI
File "/home/hadoop/spark/python/pyspark/worker.py", line 101, in main process() File "/home/hadoop/spark/python/pyspark/worker.py", line 96, in process serializer.dump_stream(func(split_index, iterator), outfile) File "/home/hadoop/spark/python/pyspark/serializers.py", line 126, in