Re: Huge shuffle data size

2015-10-23 Thread pratik khadloya
Sorry i sent the wrong join code snippet, the actual snippet is ggImpsDf.join( aggRevenueDf, aggImpsDf("id_1") <=> aggRevenueDf("id_1") && aggImpsDf("id_2") <=> aggRevenueDf("id_2") && aggImpsDf("day_hour") <=> aggRevenueDf("day_hour") && aggImpsDf("day_hour_2") <=>

Saprk error:- Not a valid DFS File name

2015-10-23 Thread kali.tumm...@gmail.com
Hi All, got this weird error when I tried to run spark on YARN-CLUSTER mode , I have 33 files and I am looping spark in bash one by one most of them worked ok except few files. Is this below error HDFS or spark error ? Exception in thread "Driver" java.lang.IllegalArgumentException: Pathname

Re: Saprk error:- Not a valid DFS File name

2015-10-23 Thread pratik khadloya
Check what you have at SimpleMktDataFlow.scala:106 ~Pratik On Fri, Oct 23, 2015 at 11:47 AM kali.tumm...@gmail.com < kali.tumm...@gmail.com> wrote: > Full Error:- > at > > org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:195) > at > >

Re: Large number of conf broadcasts

2015-10-23 Thread Koert Kuipers
oh no wonder... it undoes the glob (i was reading from /some/path/*), creates a hadoopRdd for every path, and then creates a union of them using UnionRDD. thats not what i want... no need to do union. AvroInpuFormat already has the ability to handle globs (or multiple paths comma separated) very

Re: Saprk error:- Not a valid DFS File name

2015-10-23 Thread pratik khadloya
I had face a similar issue. The actual problem was not in the file name. We run Spark on Yarn. The actual problem was seen in the logs by running the command: $ yarn logs -applicationId Scroll from the beginning to know the actual error. ~Pratik On Fri, Oct 23, 2015 at 11:40 AM

Strange problem of SparkLauncher

2015-10-23 Thread ??????
I need to run Spark Job as a service in my project, so there is a "ServiceManager" in it and it use SparkLauncher(org.apache.spark.launcher.SparkLauncher) to submit Spark jobs. First, I tried to write a demo, putting only the SparkLauncher codes in the main and run it with java -jar, it's

Re: java.lang.NegativeArraySizeException? as iterating a big RDD

2015-10-23 Thread Yifan LI
Thanks for your advice, Jem. :) I will increase the partitioning and see if it helps. Best, Yifan LI > On 23 Oct 2015, at 12:48, Jem Tucker wrote: > > Hi Yifan, > > I think this is a result of Kryo trying to seriallize something too large. > Have you tried to

Re: [Spark Streaming] How do we reset the updateStateByKey values.

2015-10-23 Thread Uthayan Suthakar
Hi Sander, Thank you for your very informative email. From your email, I've learned a quite a bit. >>>Is the condition determined somehow from the data coming through streamLogs, and is newData streamLogs again (rather than a whole data source?) No, they are two different Streams. I have two

NoSuchMethodException : com.google.common.io.ByteStreams.limit

2015-10-23 Thread jinhong lu
Hi, I run spark to write data to hbase, but found NoSuchMethodException: 15/10/23 18:45:21 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1, dn18-formal.i.nease.net): java.lang.NoSuchMethodError: com.google.common.io.ByteStreams.limit(Ljava/io/InputStream;J)Ljava/io/InputStream; I found

How to set memory for SparkR with master="local[*]"

2015-10-23 Thread Matej Holec
Hello! How to adjust the memory settings properly for SparkR with master="local[*]" in R? *When running from R -- SparkR doesn't accept memory settings :(* I use the following commands: R> library(SparkR) R> sc <- sparkR.init(master = "local[*]", sparkEnvir = list(spark.driver.memory =

Re: Multiple Spark Streaming Jobs on Single Master

2015-10-23 Thread gaurav sharma
Hi, I created 2 workers on same machine each with 4 cores and 6GB ram I submitted first job, and it allocated 2 cores on each of the worker processes, and utilized full 4 GB ram for each executor process When i submit my second job it always say in WAITING state. Cheers!! On Tue, Oct 20,

Improve parquet write speed to HDFS and spark.sql.execution.id is already set ERROR

2015-10-23 Thread morfious902002
I have a spark job that creates 6 million rows in RDDs. I convert the RDD into Data-frame and write it to HDFS. Currently it takes 3 minutes to write it to HDFS. I am using spark 1.5.1 with YARN. Here is the snippet:- RDDList.parallelStream().forEach(mapJavaRDD -> { if

Re: How to close connection in mapPartitions?

2015-10-23 Thread Sujit Pal
Hi Bin, Very likely the RedisClientPool is being closed too quickly before map has a chance to get to it. One way to verify would be to comment out the .close line and see what happens. FWIW I saw a similar problem writing to Solr where I put a commit where you have a close, and noticed that the

Re: Running 2 spark application in parallel

2015-10-23 Thread Debasish Das
You can run 2 threads in driver and spark will fifo schedule the 2 jobs on the same spark context you created (executors and cores)...same idea is used for spark sql thriftserver flow... For streaming i think it lets you run only one stream at a time even if you run them on multiple threads on

Re: How to set memory for SparkR with master="local[*]"

2015-10-23 Thread Dirceu Semighini Filho
Hi Matej, I'm also using this and I'm having the same behavior here, my driver has only 530mb which is the default value. Maybe this is a bug. 2015-10-23 9:43 GMT-02:00 Matej Holec : > Hello! > > How to adjust the memory settings properly for SparkR with > master="local[*]" >

Re: Maven build failed (Spark master)

2015-10-23 Thread Sean Owen
This doesn't show the actual error output from Maven. I have a strong guess that you haven't set MAVEN_OPTS to increase the memory Maven can use. On Fri, Oct 23, 2015 at 6:14 AM, Kayode Odeyemi wrote: > Hi, > > I can't seem to get a successful maven build. Please see command

Re: Cannot start REPL shell since 1.4.0

2015-10-23 Thread Jonathan Coveney
do you have JAVA_HOME set to a java 7 jdk? 2015-10-23 7:12 GMT-04:00 emlyn : > xjlin0 wrote > > I cannot enter REPL shell in 1.4.0/1.4.1/1.5.0/1.5.1(with pre-built with > > or without Hadoop or home compiled with ant or maven). There was no > error > > message in v1.4.x,

Re: I don't understand what this sentence means."7.1 GB of 7 GB physical memory used"

2015-10-23 Thread Sean Owen
Spark asked YARN to let an executor use 7GB of memory, but it used more so was killed. In each case you see that the exectuor memory plus overhead equals the YARN allocation requested. What's the issue with that? On Fri, Oct 23, 2015 at 6:46 AM, JoneZhang wrote: > Here is

Re: Unable to build Spark 1.5, is build broken or can anyone successfully build?

2015-10-23 Thread Robineast
Both Spark 1.5 and 1.5.1 are released so it certainly shouldn't be a problem - Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co. http://www.manning.com/books/spark-graphx-in-action -- View this message in context:

Re: spark streaming failing to replicate blocks

2015-10-23 Thread Akhil Das
If you can reproduce, then i think you can open up a jira for this. Thanks Best Regards On Fri, Oct 23, 2015 at 1:37 PM, Eugen Cepoi wrote: > When fixing the port to the same values as in the stack trace it works > too. The network config of the slaves seems correct. > >

Re: Cannot start REPL shell since 1.4.0

2015-10-23 Thread Emlyn Corrin
JAVA_HOME is unset. I've also tried setting it with: export JAVA_HOME=$(/usr/libexec/java_home) which sets it to "/Library/Java/JavaVirtualMachines/jdk1.8.0_31.jdk/Contents/Home" and I still get the same problem. On 23 October 2015 at 14:37, Jonathan Coveney wrote: > do you

Re: Spark issue running jar on Linux vs Windows

2015-10-23 Thread Michael Lewis
Thanks for the advice. In my case it turned out to be two issues. - use Java rather than Scala to launch the process, putting the core Scala libs on the class path. - I needed a merge strategy of Concat for reference.conf files in my build.sbt Regards, Mike > On 23 Oct 2015, at 01:00, Ted Yu

I don't understand what this sentence means."7.1 GB of 7 GB physical memory used"

2015-10-23 Thread JoneZhang
Here is the spark configure and error log spark.dynamicAllocation.enabled true spark.shuffle.service.enabled true spark.dynamicAllocation.minExecutors10 spark.executor.cores1 spark.executor.memory 6G

Maven build failed (Spark master)

2015-10-23 Thread Kayode Odeyemi
Hi, I can't seem to get a successful maven build. Please see command output below: bash-3.2$ ./make-distribution.sh --name spark-latest --tgz --mvn mvn -Dhadoop.version=2.7.0 -Phadoop-2.7 -Phive -Phive-thriftserver -DskipTests clean package +++ dirname ./make-distribution.sh ++ cd . ++ pwd +

Running many small Spark jobs repeatedly

2015-10-23 Thread Stephan Kepser
Hi there, we have a set of relatively light weight jobs that we would like to run repeatedly on our Spark cluster. The situation is as follows. we have a reliable source of data, a Cassandra database. One table contains time series data, which we would like to analyse. To do so we read a window

Re: Cannot start REPL shell since 1.4.0

2015-10-23 Thread emlyn
xjlin0 wrote > I cannot enter REPL shell in 1.4.0/1.4.1/1.5.0/1.5.1(with pre-built with > or without Hadoop or home compiled with ant or maven). There was no error > message in v1.4.x, system prompt nothing. On v1.5.x, once I enter > $SPARK_HOME/bin/pyspark or spark-shell, I got > > Error:

java.lang.NegativeArraySizeException? as iterating a big RDD

2015-10-23 Thread Yifan LI
Hi, I have a big sorted RDD sRdd(~962million elements), and need to scan its elements in order(using sRdd.toLocalIterator). But the process failed when the scanning was done after around 893million elements, returned with following exception: Anyone has idea? Thanks! Exception in thread

Re: Spark 1.5 on CDH 5.4.0

2015-10-23 Thread Deenar Toraskar
Sandy The assembly jar does contain org.apache.spark.deploy.yarn.ExecutorLauncher. I am trying to find out how i can increase the logging level, so I know the exact classpath used by Yarn ContainerLaunch. Deenar On 23 October 2015 at 03:30, Sandy Ryza wrote: > Hi

How to get inverse Matrix / RDD or how to solve linear system of equations

2015-10-23 Thread Zhiliang Zhu
Hi Sujit, and All, Currently I lost in large difficulty, I am eager to get some help from you. There is some big linear system of equations as:Ax = b,  A with N number of row and N number of column, N is very large, b = [0, 0, ..., 0, 1]TThen, I will sovle it to get x = [x1, x2, ..., xn]T. The

Re: java.lang.NegativeArraySizeException? as iterating a big RDD

2015-10-23 Thread Jem Tucker
Hi Yifan, I think this is a result of Kryo trying to seriallize something too large. Have you tried to increase your partitioning? Cheers, Jem On Fri, Oct 23, 2015 at 11:24 AM Yifan LI wrote: > Hi, > > I have a big sorted RDD sRdd(~962million elements), and need to scan

Re: java.lang.NegativeArraySizeException? as iterating a big RDD

2015-10-23 Thread Todd Nist
Hi Yifan, You could also try increasing the spark.kryoserializer.buffer.max.mb *spark.kryoserializer.buffer.max.mb *(64 Mb by default) : useful if your default buffer size goes further than 64 Mb; Per doc: Maximum allowable size of Kryo serialization buffer. This must be larger than any object

Re: NoSuchMethodException : com.google.common.io.ByteStreams.limit

2015-10-23 Thread Steve Loughran
just try dropping in that Jar. Hadoop core ships with an out of date guava JAR to avoid breaking old code downstream, but 2.7.x is designed to work with later versions too (i.e. it has moved off any of the now-removed methods. See https://issues.apache.org/jira/browse/HADOOP-10101 for the

Re: Best way to use Spark UDFs via Hive (Spark Thrift Server)

2015-10-23 Thread Deenar Toraskar
You can do the following. Start the spark-shell. Register the UDFs in the shell using sqlContext, then start the Thrift Server using startWithContext from the spark shell:

Re: How to get inverse Matrix / RDD or how to solve linear system of equations

2015-10-23 Thread Sujit Pal
Hi Zhiliang, For a system of equations AX = y, Linear Regression will give you a best-fit estimate for A (coefficient vector) for a matrix of feature variables X and corresponding target variable y for a subset of your data. OTOH, what you are looking for here is to solve for x a system of

Stream are not serializable

2015-10-23 Thread crakjie
Hello. I have activated the file checkpointing for a DStream to unleach the updateStateByKey. My unit test worked with no problem but when I have integrated this in my full stream I got this exception. : java.io.NotSerializableException: DStream checkpointing has been enabled but the DStreams

Re: Spark 1.5 on CDH 5.4.0

2015-10-23 Thread Deenar Toraskar
I got this working. For others trying this It turns out in Spark 1.3/CDH5.4 spark.yarn.jar=local:/opt/cloudera/parcels/ I had changed this to reflect the 1.5.1 version of spark assembly jar spark.yarn.jar=/opt/spark-1.5.1-bin/... and this didn't work, I had to drop the "local:" prefix

Re: Large number of conf broadcasts

2015-10-23 Thread Koert Kuipers
https://github.com/databricks/spark-avro/pull/95 On Fri, Oct 23, 2015 at 5:01 AM, Koert Kuipers wrote: > oh no wonder... it undoes the glob (i was reading from /some/path/*), > creates a hadoopRdd for every path, and then creates a union of them using > UnionRDD. > > thats

Re: Stream are not serializable

2015-10-23 Thread Ted Yu
Mind sharing your code, if possible ? Thanks On Fri, Oct 23, 2015 at 9:49 AM, crakjie wrote: > Hello. > > I have activated the file checkpointing for a DStream to unleach the > updateStateByKey. > My unit test worked with no problem but when I have integrated this in my > full

question about HadoopFsRelation

2015-10-23 Thread Koert Kuipers
i noticed in the comments for HadoopFsRelation.buildScan it says: * @param inputFiles For a non-partitioned relation, it contains paths of all data files in the *relation. For a partitioned relation, it contains paths of all data files in a single *selected partition. do i

How to implement zipWithIndex as a UDF?

2015-10-23 Thread Benyi Wang
If I have two columns StructType(Seq( StructField("id", LongType), StructField("phones", ArrayType(StringType I want to add index for “phones” before I explode it. Can this be implemented as GenericUDF? I tried DataFrame.explode. It worked for simple types like string, but I could not

unsubscribe

2015-10-23 Thread ZAHOORAHMED.KAZI
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review,

Re: How to implement zipWithIndex as a UDF?

2015-10-23 Thread Michael Armbrust
The user facing type mapping is documented here: http://spark.apache.org/docs/latest/sql-programming-guide.html#data-types On Fri, Oct 23, 2015 at 12:10 PM, Benyi Wang wrote: > If I have two columns > > StructType(Seq( > StructField("id", LongType), >

Re: Multiple Spark Streaming Jobs on Single Master

2015-10-23 Thread Augustus Hong
How did you specify number of cores each executor can use? Be sure to use this when submitting jobs with spark-submit: *--total-executor-cores 100.* Other options won't work from my experience. On Fri, Oct 23, 2015 at 8:36 AM, gaurav sharma wrote: > Hi, > > I created

Re: How to get inverse Matrix / RDD or how to solve linear system of equations

2015-10-23 Thread Zhiliang Zhu
Hi Sujit ,  Firstly, I must show my deep appreciation and respect towards your kind help and excellent knowledge.It would be the best if you and me are in the same place then I shall specially go to express my thanks and respect to you. I will try your way by spark mllib SVD . For Linear

Re: Huge shuffle data size

2015-10-23 Thread Kartik Mathur
Don't use groupBy , use reduceByKey instead , groupBy should always be avoided as it leads to lot of shuffle reads/writes. On Fri, Oct 23, 2015 at 11:39 AM, pratik khadloya wrote: > Sorry i sent the wrong join code snippet, the actual snippet is > > ggImpsDf.join( >

Re: unsubscribe

2015-10-23 Thread Ted Yu
Take a look at first section of https://spark.apache.org/community On Fri, Oct 23, 2015 at 1:46 PM, wrote: > This e-mail and any files transmitted with it are for the sole use of the > intended recipient(s) and may contain confidential and privileged >

streaming.twitter.TwitterUtils what is the best way to save twitter status to HDFS?

2015-10-23 Thread Andy Davidson
I need to save the twitter status I receive so that I can do additional batch based processing on them in the future. Is it safe to assume HDFS is the best way to go? Any idea what is the best way to save twitter status to HDFS? JavaStreamingContext ssc = new JavaStreamingContext(jsc,

[SPARK-9776]Another instance of Derby may have already booted the database #8947

2015-10-23 Thread Ge, Yao (Y.)
I have not been able to run spark-shell in yarn-cluster mode since 1.5.0 due to the same issue described by [SPARK-9776]. Did this pull request fix the issue? https://github.com/apache/spark/pull/8947 I still have the same problem with 1.5.1 (I am running on HDP 2.2.6 with Hadoop 2.6) Thanks.

Re: How does Spark coordinate with Tachyon wrt data locality

2015-10-23 Thread Calvin Jia
Hi Shane, Tachyon provides an api to get the block locations of the file which Spark uses when scheduling tasks. Hope this helps, Calvin On Fri, Oct 23, 2015 at 8:15 AM, Kinsella, Shane wrote: > Hi all, > > > > I am looking into how Spark handles data locality wrt

Re: Maven build failed (Spark master)

2015-10-23 Thread Kayode Odeyemi
I saw this when I tested manually (without ./make-distribution) Detected Maven Version: 3.2.2 is not in the allowed range 3.3.3. So I simply upgraded maven to 3.3.3. Resolved. Thanks On Fri, Oct 23, 2015 at 3:17 PM, Sean Owen wrote: > This doesn't show the actual error

Improve parquet write speed to HDFS and spark.sql.execution.id is already set ERROR

2015-10-23 Thread Anubhav Agarwal
I have a spark job that creates 6 million rows in RDDs. I convert the RDD into Data-frame and write it to HDFS. Currently it takes 3 minutes to write it to HDFS. Here is the snippet:- RDDList.parallelStream().forEach(mapJavaRDD -> { if (mapJavaRDD != null) {

Re: [SPARK STREAMING] polling based operation instead of event based operation

2015-10-23 Thread Nipun Arora
Thanks for the suggestion. 1. Heartbeat: As a matter of fact, the heartbeat solution is what I thought of as well. However that needs to be outside spark-streaming. Furthermore, it cannot be generalized to all spark applications. For, e.g. I am doing several filtering operations before I reach

Re: Cannot start REPL shell since 1.4.0

2015-10-23 Thread emlyn
emlyn wrote > > xjlin0 wrote >> I cannot enter REPL shell in 1.4.0/1.4.1/1.5.0/1.5.1(with pre-built with >> or without Hadoop or home compiled with ant or maven). There was no >> error message in v1.4.x, system prompt nothing. On v1.5.x, once I enter >> $SPARK_HOME/bin/pyspark or spark-shell, I

How does Spark coordinate with Tachyon wrt data locality

2015-10-23 Thread Kinsella, Shane
Hi all, I am looking into how Spark handles data locality wrt Tachyon. My main concern is how this is coordinated. Will it send a task based on a file loaded from Tachyon to a node that it knows has that file locally and how does it know which nodes has what? Kind regards, Shane This email

Re: How to close connection in mapPartitions?

2015-10-23 Thread Aniket Bhatnagar
Are you sure RedisClientPool is being initialized properly in the constructor of RedisCache? Can you please copy paste the code that you use to initialize RedisClientPool inside the constructor of RedisCache? Thanks, Aniket On Fri, Oct 23, 2015 at 11:47 AM Bin Wang wrote: >

Re: Save RandomForest Model from ML package

2015-10-23 Thread amarouni
It's an open issue : https://issues.apache.org/jira/browse/SPARK-4587 That's being said, you can workaround the issue by serializing the Model (simple java serialization) and then restoring it before calling the predicition job. Best Regards, On 22/10/2015 14:33, Sebastian Kuepers wrote: >

How to close connection in mapPartitions?

2015-10-23 Thread Bin Wang
I use mapPartitions to open connections to Redis, I write it like this: val seqs = lines.mapPartitions { lines => val cache = new RedisCache(redisUrl, redisPort) val result = lines.map(line => Parser.parseBody(line, cache)) cache.redisPool.close result } But it

Re: How to close connection in mapPartitions?

2015-10-23 Thread Bin Wang
BTW, "lines" is a DStream. Bin Wang 于2015年10月23日周五 下午2:16写道: > I use mapPartitions to open connections to Redis, I write it like this: > > val seqs = lines.mapPartitions { lines => > val cache = new RedisCache(redisUrl, redisPort) > val result = lines.map(line

Re: spark streaming failing to replicate blocks

2015-10-23 Thread Akhil Das
Mostly a network issue, you need to check your network configuration from the aws console and make sure the ports are accessible within the cluster. Thanks Best Regards On Thu, Oct 22, 2015 at 8:53 PM, Eugen Cepoi wrote: > Huh indeed this worked, thanks. Do you know why

Re: [spark1.5.1] HiveQl.parse throws org.apache.spark.sql.AnalysisException: null

2015-10-23 Thread Xiao Li
Hi, Sebastian, To use private APIs, you have to be very familiar with the code path; otherwise, it is very easy to hit an exception or a bug. My suggestion is to use IntelliJ to step-by-step step in the function hiveContext.sql until you hit the parseSql API. Then, you will know if you have to

Re: Huge shuffle data size

2015-10-23 Thread pratik khadloya
Actually the groupBy is not taking a lot of time. The join that i do later takes the most (95 %) amount of time. Also, the grouping i am doing is based on the DataFrame api, which does not contain any function for reduceBy... i guess the DF automatically uses reduce by when we do a group by.

Spark cant ORC files properly using 1.5.1 hadoop 2.6

2015-10-23 Thread unk1102
Hi I am having weird issue I have a Spark job which has bunch of hiveContext.sql() and creates ORC files as part of hive tables with partitions and it runs fine in 1.4.1 and hadoop 2.4. Now I tried to move to Spark 1.5.1/hadoop 2.6 Spark job does not work as expected it does not created ORC

get host from rdd map

2015-10-23 Thread weoccc
in rdd map function, is there a way i can know the list of host names where the map runs ? any code sample would be appreciated ? thx, Weide

Re: get host from rdd map

2015-10-23 Thread Ted Yu
Can you outline your use case a bit more ? Do you want to know all the hosts which would run the map ? Cheers On Fri, Oct 23, 2015 at 5:16 PM, weoccc wrote: > in rdd map function, is there a way i can know the list of host names > where the map runs ? any code sample would

Re: get host from rdd map

2015-10-23 Thread weoccc
yea, my use cases is that i want to have some external communications where rdd is being run in map. The external communication might be handled separately transparent to spark. What will be the hacky way and nonhacky way to do that ? :) Weide On Fri, Oct 23, 2015 at 5:32 PM, Ted Yu

Spark Streaming: how to StreamingContext.queueStream

2015-10-23 Thread Anfernee Xu
Hi, Here's my situation, I have some kind of offline dataset, but I want to form a virtual data stream feeding to Spark Streaming, my code looks like this // sort offline data by time 1) JavaRDD sortedByTime = offlineDataRDD.sortBy( ); // compute a list of JavaRDD, each element

Re: sqlContext load by offset

2015-10-23 Thread Kayode Odeyemi
When I use that I get a "Caused by: org.postgresql.util.PSQLException: ERROR: column "none" does not exist" On Thu, Oct 22, 2015 at 9:31 PM, Kayode Odeyemi wrote: > Hi, > > I've trying to load a postgres table using the following expression: > > val cachedIndex =

Re: spark streaming failing to replicate blocks

2015-10-23 Thread Eugen Cepoi
When fixing the port to the same values as in the stack trace it works too. The network config of the slaves seems correct. Thanks, Eugen 2015-10-23 8:30 GMT+02:00 Akhil Das : > Mostly a network issue, you need to check your network configuration from > the aws

Re: [SPARK STREAMING] polling based operation instead of event based operation

2015-10-23 Thread Lars Albertsson
There is a heartbeat stream pattern that you can use: Create a service (perhaps a thread in your driver) that pushes a heartbeat event to a different stream every N seconds. Consume that stream as well in your streaming application, and perform an action on every heartbeat. This has worked well

Re: Saving offset while reading from kafka

2015-10-23 Thread Erwan ALLAIN
Have a look at this: https://github.com/koeninger/kafka-exactly-once especially: https://github.com/koeninger/kafka-exactly-once/blob/master/src/main/scala/example/TransactionalPerBatch.scala

Huge shuffle data size

2015-10-23 Thread pratik khadloya
Hello, Data about my spark job is below. My source data is only 916MB (stage 0) and 231MB (stage 1), but when i join the two data sets (stage 2) it takes a very long time and as i see the shuffled data is 614GB. Is this something expected? Both the data sets produce 200 partitions. Stage

spark.python.worker.memory Discontinuity

2015-10-23 Thread Connor Zanin
Hi all, I am running a simple word count job on a cluster of 4 nodes (24 cores per node). I am varying two parameter in the configuration, spark.python.worker.memory and the number of partitions in the RDD. My job is written in python. I am observing a discontinuity in the run time of the job

Re: Stream are not serializable

2015-10-23 Thread pratik khadloya
You might be referring to some class level variables from your code. I got to see the actual field which caused the error when i marked the class as serializable and run it on cluster. class MyClass extends java.io.Serializable The following resources will also help:

Re: Saprk error:- Not a valid DFS File name

2015-10-23 Thread kali.tumm...@gmail.com
Full Error:- at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:195) at org.apache.hadoop.hdfs.DistributedFileSystem.access$000(DistributedFileSystem.java:104) at