Re: Partitions with zero records & variable task times

2015-09-08 Thread Akhil Das
Try using a custom partitioner for the keys so that they will get evenly distributed across tasks Thanks Best Regards On Fri, Sep 4, 2015 at 7:19 PM, mark wrote: > I am trying to tune a Spark job and have noticed some strange behavior - > tasks in a stage vary in

Can not allocate executor when running spark on mesos

2015-09-08 Thread canan chen
:08:16.515960 301916160 master.cpp:1767] Received registration request for framework 'Spark shell' at scheduler-1ea1c85b-68bd-40b4-8c7c-ddccfd56f82b@192.168.3.3:57133 I0908 15:08:16.520545 301916160 master.cpp:1834] Registering framework 20150908-143320-16777343-5050-41965- (Spark shell

about mr-style merge sort

2015-09-08 Thread 周千昊
Hi, community I have an application which I try to migrate from MR to Spark. It will do some calculations from Hive and output to hfile which will be bulk load to HBase Table, details as follow: Rdd input = getSourceInputFromHive() Rdd> mapSideResult =

Re: Spark 1.4 RDD to DF fails with toDF()

2015-09-08 Thread Gheorghe Postelnicu
Good point. It is a pre-compiled Spark version. Based on the text on the downloads page, the answer to your question is no, so I will download the sources and recompile. Thanks! On Tue, Sep 8, 2015 at 5:17 AM, Koert Kuipers wrote: > is /opt/spark-1.4.1-bin-hadoop2.6 a spark

Re: Spark 1.4 RDD to DF fails with toDF()

2015-09-08 Thread Gheorghe Postelnicu
Compiling from source with Scala 2.11 support fixed this issue. Thanks again for the help! On Tue, Sep 8, 2015 at 7:33 AM, Gheorghe Postelnicu < gheorghe.posteln...@gmail.com> wrote: > Good point. It is a pre-compiled Spark version. Based on the text on the > downloads page, the answer to your

Re: Problems with Tungsten in Spark 1.5.0-rc2

2015-09-08 Thread Anders Arpteg
Ok, thanks Reynold. When I tested dynamic allocation with Spark 1.4, it complained saying that it was not tungsten compliant. Lets hope it works with 1.5 then! On Tue, Sep 8, 2015 at 5:49 AM Reynold Xin wrote: > > On Wed, Sep 2, 2015 at 12:03 AM, Anders Arpteg

Re: Spark SQL - UDF for scoring a model - take $"*"

2015-09-08 Thread Night Wolf
Haha ok, its one of those days, Array isn't valid. RTFM and it says Catalyst array maps to a Scala Seq, that makes sense. So it works! Two follow up questions; 1 - Is this the best approach? 2 - what if I want my expression to return multiple rows? - my binary classification model gives me a

Re: Spark SQL - UDF for scoring a model - take $"*"

2015-09-08 Thread Night Wolf
Not sure how that would work. Really I want to tack on an extra column onto the DF with a UDF that can take a Row object. On Tue, Sep 8, 2015 at 1:54 AM, Jörn Franke wrote: > Can you use a map or list with different properties as one parameter? > Alternatively a string

Re: Spark SQL - UDF for scoring a model - take $"*"

2015-09-08 Thread Night Wolf
Sorry for the spam - I had some success; case class ScoringDF(function: Row => Double) extends Expression { val dataType = DataTypes.DoubleType override type EvaluatedType = Double override def eval(input: Row): EvaluatedType = { function(input) } override def nullable: Boolean =

Split content into multiple Parquet files

2015-09-08 Thread Adrien Mogenet
Hi there, We've spent several hours to split our input data into several parquet files (or several folders, i.e. /datasink/output-parquets//foobar.parquet), based on a low-cardinality key. This works very well with a when using saveAsHadoopFile, but we can't achieve a similar thing with Parquet

Re: Spark SQL - UDF for scoring a model - take $"*"

2015-09-08 Thread Night Wolf
So basically I need something like df.withColumn("score", new Column(new Expression { ... def eval(input: Row = null): EvaluatedType = myModel.score(input) ... })) But I can't do this, so how can I make a UDF or something like it, that can take in a Row and pass back a double value or some

Re: Spark on Yarn: Kryo throws ClassNotFoundException for class included in fat jar

2015-09-08 Thread Igor Berman
as a starting point, attach your stacktrace... ps: look for duplicates in your classpath, maybe you include another jar with same class On 8 September 2015 at 06:38, Nicholas R. Peterson wrote: > I'm trying to run a Spark 1.4.1 job on my CDH5.4 cluster, through Yarn. >

Re: Split content into multiple Parquet files

2015-09-08 Thread Cheng Lian
In Spark 1.4 and 1.5, you can do something like this: df.write.partitionBy("key").parquet("/datasink/output-parquets") BTW, I'm curious about how did you do it without partitionBy using saveAsHadoopFile? Cheng On 9/8/15 2:34 PM, Adrien Mogenet wrote: Hi there, We've spent several hours to

Re: Spark on Yarn: Kryo throws ClassNotFoundException for class included in fat jar

2015-09-08 Thread Nicholas R. Peterson
Thans, Igor; I've got it running again right now, and can attach the stack trace when it finishes. In the mean time, I've noticed something interesting: in the Spark UI, the application jar that I submit is not being included on the classpath. It has been successfully uploaded to the nodes -- in

Re: buildSupportsSnappy exception when reading the snappy file in Spark

2015-09-08 Thread Akhil Das
Looks like you are having different versions of snappy library. Here's a similar discussion if you haven't seen it already http://stackoverflow.com/questions/22150417/hadoop-mapreduce-java-lang-unsatisfiedlinkerror-org-apache-hadoop-util-nativec Thanks Best Regards On Mon, Sep 7, 2015 at 7:41

Re: How to read files from S3 from Spark local when there is a http proxy

2015-09-08 Thread tariq
Hi svelusamy, Were you able to make it work? I am facing the exact same problem. Getting connection timed when trying to access S3. Thank you. -- View this message in context:

Re: Can not allocate executor when running spark on mesos

2015-09-08 Thread Akhil Das
iguration change. Here's the mesos master log > > I0908 15:08:16.515960 301916160 master.cpp:1767] Received registration > request for framework 'Spark shell' at > scheduler-1ea1c85b-68bd-40b4-8c7c-ddccfd56f82b@192.168.3.3:57133 > I0908 15:08:16.520545 301916160 master.cpp:1834]

Re: Exception when restoring spark streaming with batch RDD from checkpoint.

2015-09-08 Thread Akhil Das
Try to add a filter to remove/replace the null elements within/before the map operation. Thanks Best Regards On Mon, Sep 7, 2015 at 3:34 PM, ZhengHanbin wrote: > Hi, > > I am using spark streaming to join every RDD of a DStream to a stand alone > RDD to generate a new

Re: Parquet Array Support Broken?

2015-09-08 Thread Cheng Lian
Yeah, this is a typical Parquet interoperability issue due to unfortunate historical reasons. Hive (actually parquet-hive) gives the following schema for array: message m0 { optional group f (LIST) { repeated group bag { optional int32 array_element; } } } while Spark SQL gives

Re: Sending yarn application logs to web socket

2015-09-08 Thread Jeetendra Gangele
1.in order to change log4j.properties at the name node, u can change /home/hadoop/log4j.properties. 2.in order to change log4j.properties for the container logs, u need to change it at the yarn containers jar, since they hard-coded loading the file directly from project resources. 2.1 ssh to the

Applying transformations on a JavaRDD using reflection

2015-09-08 Thread Nirmal Fernando
Hi All, I'd like to apply a chain of Spark transformations (map/filter) on a given JavaRDD. I'll have the set of Spark transformations as Function, and even though I can determine the classes of T and A at the runtime, due to the type erasure, I cannot call JavaRDD's transformations as they

1.5 Build Errors

2015-09-08 Thread Benjamin Zaitlen
Hi All, I'm trying to build a distribution off of the latest in master and I keep getting errors on MQTT and the build fails. I'm running the build on a m1.large which has 7.5 GB of RAM and no other major processes are running. MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M

Re: Spark ANN

2015-09-08 Thread Feynman Liang
Just wondering, why do we need tensors? Is the implementation of convnets using im2col (see here ) insufficient? On Tue, Sep 8, 2015 at 11:55 AM, Ulanov, Alexander wrote: > Ruslan, thanks for including me in the

read compressed hdfs files using SparkContext.textFile?

2015-09-08 Thread shenyan zhen
Hi, For hdfs files written with below code: rdd.saveAsTextFile(getHdfsPath(...), classOf [org.apache.hadoop.io.compress.GzipCodec]) I can see the hdfs files been generated: 0 /lz/streaming/am/144173460/_SUCCESS 1.6 M /lz/streaming/am/144173460/part-0.gz 1.6 M

Re: Python Spark Streaming example with textFileStream does not work. Why?

2015-09-08 Thread Tathagata Das
Can you give absolute paths just to be sure? On Mon, Sep 7, 2015 at 12:59 AM, Kamil Khadiev wrote: > I think that problem also depends on file system, > I use mac and My program found file, but only when I created new, but not > rename or move > > And in logs > 15/09/07

performance when checking if data frame is empty or not

2015-09-08 Thread Axel Dahl
I have a join, that fails when one of the data frames is empty. To avoid this I am hoping to check if the dataframe is empty or not before the join. The question is what's the most performant way to do that? should I do df.count() or df.first() or something else? Thanks in advance, -Axel

Re: NPE while reading ORC file using Spark 1.4 API

2015-09-08 Thread Davies Liu
I think this is fixed in 1.5 (release soon), by https://github.com/apache/spark/pull/8407 On Tue, Sep 8, 2015 at 11:39 AM, unk1102 wrote: > Hi I read many ORC files in Spark and process it those files are basically > Hive partitions. Most of the times processing goes well

Re: Different Kafka createDirectStream implementations

2015-09-08 Thread Dan Dutrow
I see... the first method takes the offsets as it's third parameter while the second method just takes topic names and that's the primary reason why the implementations are different. In that case, what I am noticing is that setting the messageHandler is unavailable in the second method. This

NPE while reading ORC file using Spark 1.4 API

2015-09-08 Thread unk1102
Hi I read many ORC files in Spark and process it those files are basically Hive partitions. Most of the times processing goes well but for few files I get the following exception dont know why? These files are working fine in Hive using Hive queries. Please guide. Thanks in advance. DataFrame df

Re: NPE while reading ORC file using Spark 1.4 API

2015-09-08 Thread Umesh Kacha
Hi Zhan, thanks for the reply. Yes schema should be same actually I am reading Hive table partitions as ORC format into Spark. So I believe it should be same. I am new to Hive so dont know if schema can be different in Hive partitioned table. On Wed, Sep 9, 2015 at 12:16 AM, Zhan Zhang

Re: Is HDFS required for Spark streaming?

2015-09-08 Thread Tathagata Das
You can use local directories in that case but it is not recommended and not a well-test code path (so I have no idea what can happen). On Tue, Sep 8, 2015 at 6:59 AM, Cody Koeninger wrote: > Yes, local directories will be sufficient > > On Sat, Sep 5, 2015 at 10:44 AM, N B

Re: [streaming] DStream with window performance issue

2015-09-08 Thread Понькин Алексей
Thank you very much for great answer! -- Яндекс.Почта — надёжная почта http://mail.yandex.ru/neo2/collect/?exp=1=1 08.09.2015, 23:53, "Cody Koeninger" : > Yeah, that's the general idea. > > When you say hard code topic name, do you mean  Set(topicA, topicB, topicB) ?  >

Re: ClassCastException in driver program

2015-09-08 Thread Jeff Jones
Thanks for the response. Turns out that this post addressed the issue. http://stackoverflow.com/questions/28186607/java-lang-classcastexception-using-lambda-expressions-in-spark-job-on-remote-ser We have some UDFs defined and the jar containing the class for these UDFs wasn’t in the dependent

Re: [streaming] DStream with window performance issue

2015-09-08 Thread Понькин Алексей
Oh my, I implemented one directStream instead of union of three but it is still growing exponential with window method. -- Яндекс.Почта — надёжная почта http://mail.yandex.ru/neo2/collect/?exp=1=1 08.09.2015, 23:53, "Cody Koeninger" : > Yeah, that's the general idea. > >

Re: [streaming] DStream with window performance issue

2015-09-08 Thread Cody Koeninger
Yeah, that's the general idea. When you say hard code topic name, do you mean Set(topicA, topicB, topicB) ? You should be able to use a variable for that - read it from a config file, whatever. If you're talking about the match statement, yeah you'd need to hardcode your cases. On Tue, Sep

Re: Why is huge data shuffling in Spark when using union()/coalesce(1,false) on DataFrame?

2015-09-08 Thread Richard Marscher
Hi, what is the reasoning behind the use of `coalesce(1,false)`? This is saying to aggregate all data into a single partition, which must fit in memory on one node in the Spark cluster. If the cluster has more than one node it must shuffle to move the data. It doesn't seem like the following map

Re: Spark on Yarn vs Standalone

2015-09-08 Thread Sandy Ryza
Those settings seem reasonable to me. Are you observing performance that's worse than you would expect? -Sandy On Mon, Sep 7, 2015 at 11:22 AM, Alexander Pivovarov wrote: > Hi Sandy > > Thank you for your reply > Currently we use r3.2xlarge boxes (vCPU: 8, Mem: 61 GiB) >

Re: Memory-efficient successive calls to repartition()

2015-09-08 Thread Aurélien Bellet
What is strange is that if I remove the if condition (i.e., checkpoint at each iteration), then it basically works: non HDFS disk usage remains very small and stable throughout the execution. If instead I checkpoint only every now and then (cf code in my previous email), then the disk usage

Re: New to Spark - Paritioning Question

2015-09-08 Thread Richard Marscher
That seems like it could work, although I don't think `partitionByKey` is a thing, at least for RDD. You might be able to merge step #2 and step #3 into one step by using the `reduceByKey` function signature that takes in a Partitioner implementation. def reduceByKey(partitioner: Partitioner

Re: Spark on Yarn: Kryo throws ClassNotFoundException for class included in fat jar

2015-09-08 Thread Nick Peterson
Yes, the jar contains the class: $ jar -tf lumiata-evaluation-assembly-1.0.jar | grep 2028/Document/Document com/i2028/Document/Document$1.class com/i2028/Document/Document.class What else can I do? Is there any way to get more information about the classes available to the particular

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-08 Thread Cody Koeninger
Have you tried deleting or moving the contents of the checkpoint directory and restarting the job? On Fri, Sep 4, 2015 at 8:02 PM, Dmitry Goldenberg wrote: > Sorry, more relevant code below: > > SparkConf sparkConf = createSparkConf(appName, kahunaEnv); >

Re: 1.5 Build Errors

2015-09-08 Thread Benjamin Zaitlen
Ah, right. Should've caught that. The docs seem to recommend 2gb. Should that be increased as well? --Ben On Tue, Sep 8, 2015 at 9:33 AM, Sean Owen wrote: > It shows you there that Maven is out of memory. Give it more heap. I use > 3gb. > > On Tue, Sep 8, 2015 at 1:53

Re: Java vs. Scala for Spark

2015-09-08 Thread Jonathan Coveney
It worked for Twitter! Seriously though: scala is much much more pleasant. And scala has a great story for using Java libs. And since spark is kind of framework-y (use its scripts to submit, start up repl, etc) the projects tend to be lead projects, so even in a big company that uses Java the

Re: hadoop2.6.0 + spark1.4.1 + python2.7.10

2015-09-08 Thread Sasha Kacanski
Hi Ashish, Thanks for the update. I tried all of it, but what I don't get it is that I run cluster with one node so presumably I should have PYspark binaries there as I am developing on same host. Could you tell me where you placed parcels or whatever cloudera is using. My understanding of yarn

Re: buildSupportsSnappy exception when reading the snappy file in Spark

2015-09-08 Thread dong.yajun
hi Akhil, I just use property key LD_LIBRARY_PATH in conf/spark-env.xml instead of SPARK_LIBRARY_PATH which points to the path of native, it works. thanks. On Tue, Sep 8, 2015 at 6:14 PM, Akhil Das wrote: > Looks like you are having different versions of snappy

Re: Spark on Yarn: Kryo throws ClassNotFoundException for class included in fat jar

2015-09-08 Thread Nicholas R. Peterson
Here is the stack trace: (Sorry for the duplicate, Igor -- I forgot to include the list.) 15/09/08 05:56:43 WARN scheduler.TaskSetManager: Lost task 183.0 in stage 41.0 (TID 193386, ds-compute2.lumiata.com): java.io.IOException: com.esotericsoftware.kryo.KryoException: Error constructing

Re: How to read files from S3 from Spark local when there is a http proxy

2015-09-08 Thread Akhil Das
In the shell, before running the job you can actually do a *export http_proxy="http://host:port"* and see if it works. Thanks Best Regards On Tue, Sep 8, 2015 at 6:21 PM, tariq wrote: > Hi svelusamy, > > Were you able to make it work? I am facing the exact same problem.

Spark intermittently fails to recover from a worker failure (in standalone mode)

2015-09-08 Thread Cheuk Lam
We have run into a problem where some Spark job is aborted after a worker is killed in a 2-worker standalone cluster. The problem is intermittent, but we can consistently reproduce it. The problem only appears to happen when we kill a worker. It doesn't happen when we kill an executor directly.

Re: Spark on Yarn: Kryo throws ClassNotFoundException for class included in fat jar

2015-09-08 Thread Igor Berman
java.lang.ClassNotFoundException: com.i2028.Document.Document 1. so have you checked that jar that you create(fat jar) contains this class? 2. might be there is some stale cache issue...not sure though On 8 September 2015 at 16:12, Nicholas R. Peterson wrote: > Here is

Re: 1.5 Build Errors

2015-09-08 Thread Sean Owen
It shows you there that Maven is out of memory. Give it more heap. I use 3gb. On Tue, Sep 8, 2015 at 1:53 PM, Benjamin Zaitlen wrote: > Hi All, > > I'm trying to build a distribution off of the latest in master and I keep > getting errors on MQTT and the build fails. I'm

Java vs. Scala for Spark

2015-09-08 Thread Bryan Jeffrey
All, We're looking at language choice in developing a simple streaming processing application in spark. We've got a small set of example code built in Scala. Articles like the following: http://www.bigdatatidbits.cc/2015/02/navigating-from-scala-to-spark-for.html would seem to indicate that

Re: Is HDFS required for Spark streaming?

2015-09-08 Thread Cody Koeninger
Yes, local directories will be sufficient On Sat, Sep 5, 2015 at 10:44 AM, N B wrote: > Hi TD, > > Thanks! > > So our application does turn on checkpoints but we do not recover upon > application restart (we just blow the checkpoint directory away first and > re-create the

Re: Java vs. Scala for Spark

2015-09-08 Thread Ted Yu
Performance wise, Scala is by far the best choice when you use Spark. The cost of learning Scala is not negligible but not insurmountable either. My personal opinion. On Tue, Sep 8, 2015 at 6:50 AM, Bryan Jeffrey wrote: > All, > > We're looking at language choice in

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-08 Thread Tathagata Das
Why are you checkpointing the direct kafka stream? It serves not purpose. TD On Tue, Sep 8, 2015 at 9:35 AM, Dmitry Goldenberg wrote: > I just disabled checkpointing in our consumers and I can see that the > batch duration millis set to 20 seconds is now being

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-08 Thread Dmitry Goldenberg
>> Why are you checkpointing the direct kafka stream? It serves not purpose. Could you elaborate on what you mean? Our goal is fault tolerance. If a consumer is killed or stopped midstream, we want to resume where we left off next time the consumer is restarted. How would that be "not surving

Re: New to Spark - Paritioning Question

2015-09-08 Thread Mike Wright
Thanks for the response! Well, in retrospect each partition doesn't need to be restricted to a single key. But, I cannot have values associated with a key span partitions since they all need to be processed together for a key to facilitate cumulative calcs. So provided an individual key has all

Re: Can not allocate executor when running spark on mesos

2015-09-08 Thread canan chen
515960 301916160 master.cpp:1767] Received registration >> request for framework 'Spark shell' at >> scheduler-1ea1c85b-68bd-40b4-8c7c-ddccfd56f82b@192.168.3.3:57133 >> I0908 15:08:16.520545 301916160 master.cpp:1834] Registering framework >> 20150908-14332

Re: Java vs. Scala for Spark

2015-09-08 Thread Bryan Jeffrey
Thank you for the quick responses. It's useful to have some insight from folks already extensively using Spark. Regards, Bryan Jeffrey On Tue, Sep 8, 2015 at 10:28 AM, Sean Owen wrote: > Why would Scala vs Java performance be different Ted? Relatively > speaking there is

Re: Spark on Yarn: Kryo throws ClassNotFoundException for class included in fat jar

2015-09-08 Thread Igor Berman
hmm...out of ideas. can you check in spark ui environment tab that this jar is not somehow appears 2 times or more...? or more generally - any 2 jars that can contain this class by any chance regarding your question about classloader - no idea, probably there is, I remember stackoverflow has some

Re: Java vs. Scala for Spark

2015-09-08 Thread Igor Berman
we are using java7..its much more verbose that java8 or scala examples in addition there sometimes libraries that has no java api, so you need to write them by yourself(e.g. graphx) on the other hand, scala is not trivial language like java, so it depends on your team On 8 September 2015 at

Re: Spark on Yarn: Kryo throws ClassNotFoundException for class included in fat jar

2015-09-08 Thread Igor Berman
another idea - you can add this fat jar explicitly to the classpath of executors...it's not a solution, but might be it work... I mean place it somewhere locally on executors and add it to cp with spark.executor.extraClassPath On 8 September 2015 at 18:30, Nick Peterson

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-08 Thread Cody Koeninger
Well, I'm not sure why you're checkpointing messages. I'd also put in some logging to see what values are actually being read out of your params object for the various settings. On Tue, Sep 8, 2015 at 10:24 AM, Dmitry Goldenberg wrote: > I've stopped the jobs, the

Compress JSON dataframes

2015-09-08 Thread Saif.A.Ellafi
Hi, I am trying to figure out a way to compress df.write.json() but have not been succesful, even changing spark.io.compression. Any thoughts? Saif

Re: Java vs. Scala for Spark

2015-09-08 Thread Dean Wampler
It's true that Java 8 lambdas help. If you've read Learning Spark, where they use Java 7, Python, and Scala for the examples, it really shows how awful Java without lambdas is for Spark development. Still, there are several "power tools" in Scala I would sorely miss using Java 8: 1. The REPL

Re: Java vs. Scala for Spark

2015-09-08 Thread Ted Yu
Sean: w.r.t. performance, I meant Scala/Java vs Python. Cheers On Tue, Sep 8, 2015 at 7:28 AM, Sean Owen wrote: > Why would Scala vs Java performance be different Ted? Relatively > speaking there is almost no runtime difference; it's the same APIs or > calls via a thin

Re: 1.5 Build Errors

2015-09-08 Thread Benjamin Zaitlen
I'm still getting errors with 3g. I've increase to 4g and I'll report back To be clear: export MAVEN_OPTS="-Xmx4g -XX:MaxPermSize=1024M -XX:ReservedCodeCacheSize=1024m" [ERROR] GC overhead limit exceeded -> [Help 1] > [ERROR] > [ERROR] To see the full stack trace of the errors, re-run Maven

Re: Spark on Yarn: Kryo throws ClassNotFoundException for class included in fat jar

2015-09-08 Thread Nick Peterson
Yeah... none of the jars listed on the classpath contain this class. The only jar that does is the fat jar that I'm submitting with spark-submit, which as mentioned isn't showing up on the classpath anywhere. -- Nick On Tue, Sep 8, 2015 at 8:26 AM Igor Berman wrote: >

Re: Partitioning a RDD for training multiple classifiers

2015-09-08 Thread Ben Tucker
Hi Maximo — This is a relatively naive answer, but I would consider structuring the RDD into a DataFrame, then saving the 'splits' using something like DataFrame.write.parquet(hdfs_path, byPartition=('client')). You could then read a DataFrame from each resulting parquet directory and do your

Re: [streaming] DStream with window performance issue

2015-09-08 Thread Alexey Ponkin
Ok. Spark 1.4.1 on yarn Here is my application I have 4 different Kafka topics(different object streams) type Edge = (String,String) val a = KafkaUtils.createDirectStream[...](sc,"A",params).filter( nonEmpty ).map( toEdge ) val b = KafkaUtils.createDirectStream[...](sc,"B",params).filter(

RE: Spark ANN

2015-09-08 Thread Ulanov, Alexander
That is an option too. Implementing convolutions with FFTs should be considered as well http://arxiv.org/pdf/1312.5851.pdf. From: Feynman Liang [mailto:fli...@databricks.com] Sent: Tuesday, September 08, 2015 12:07 PM To: Ulanov, Alexander Cc: Ruslan Dautkhanov; Nick Pentreath; user;

Re: Partitions with zero records & variable task times

2015-09-08 Thread mark
As I understand things (maybe naively), my input data are stored in equal sized blocks in HDFS, and each block represents a partition within Spark when read from HDFS, therefore each block should hold roughly the same number of records. So something is missing in my understanding - what can

Re: Support of other languages?

2015-09-08 Thread Nagaraj Chandrashekar
Hi Rahul, I may not have the answer for what you are looking for but my thoughts are given below. I have worked with HP Vertica and R VIA UDF¹s (User Defined Functions). I don¹t have any experience with Spark R till now. I would expect it might follow the similar route. UDF functions

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-08 Thread Tathagata Das
Calling directKafkaStream.checkpoint() will make the system write the raw kafka data into HDFS files (that is, RDD checkpointing). This is completely unnecessary with Direct Kafka because it already tracks the offset of data in each batch (which checkpoint is enabled using

Event logging not working when worker machine terminated

2015-09-08 Thread David Rosenstrauch
Our Spark cluster is configured to write application history event logging to a directory on HDFS. This all works fine. (I've tested it with Spark shell.) However, on a large, long-running job that we ran tonight, one of our machines at the cloud provider had issues and had to be terminated

Re: Exception when restoring spark streaming with batch RDD from checkpoint.

2015-09-08 Thread Tathagata Das
Probably, the problem here is that the recovered StreamingContext is trying to refer to the pre-failure static RDD, which does exist after the failure. The solution: When the driver process restarts from checkpoint, you need to recreate the static RDD again explicitly, and make that the recreated

Re: Applying transformations on a JavaRDD using reflection

2015-09-08 Thread Nirmal Fernando
Any thoughts? On Tue, Sep 8, 2015 at 3:37 PM, Nirmal Fernando wrote: > Hi All, > > I'd like to apply a chain of Spark transformations (map/filter) on a given > JavaRDD. I'll have the set of Spark transformations as Function, and > even though I can determine the classes of

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-08 Thread Dmitry Goldenberg
What's wrong with creating a checkpointed context?? We WANT checkpointing, first of all. We therefore WANT the checkpointed context. Second of all, it's not true that we're loading the checkpointed context independent of whether params.isCheckpointed() is true. I'm quoting the code again: //

Contribution in Apche Spark

2015-09-08 Thread Chintan Bhatt
I want to contribute in Apache dspark especially in MLlib in Spark. pls suggest me any open issue/use case/problem. -- CHINTAN BHATT Assistant Professor, U & P U Patel Department of Computer Engineering, Chandubhai S. Patel Institute of

Re: Event logging not working when worker machine terminated

2015-09-08 Thread Jeff Zhang
What cluster mode do you use ? Standalone/Yarn/Mesos ? On Wed, Sep 9, 2015 at 11:15 AM, David Rosenstrauch wrote: > Our Spark cluster is configured to write application history event logging > to a directory on HDFS. This all works fine. (I've tested it with Spark >

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-08 Thread Dmitry Goldenberg
That is good to know. However, that doesn't change the problem I'm seeing. Which is that, even with that piece of code commented out (stream.checkpoint()), the batch duration millis aren't getting changed unless I take checkpointing completely out. In other words, this commented out: //if

Re: Getting Started with Spark

2015-09-08 Thread Tathagata Das
This is a known issue introduced in Spark 1.4.1 and 1.5.0 and will be fixed in Spark 1.5.1. In the mean time, you could prototype in Spark 1.4.0 and wait for Spark 1.5.1/1.4.2 to come out. You could also download the source code and compile the Spark master branch.

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-08 Thread Tathagata Das
Well, you are returning JavaStreamingContext.getOrCreate(params. getCheckpointDir(), factory); That is loading the checkpointed context, independent of whether params .isCheckpointed() is true. On Tue, Sep 8, 2015 at 8:28 PM, Dmitry Goldenberg wrote: > That is good

Re: Java vs. Scala for Spark

2015-09-08 Thread Jerry Lam
Hi Bryan, I would choose a language based on the requirements. It does not make sense if you have a lot of dependencies that are java-based components and interoperability between java and scala is not always obvious. I agree with the above comments that Java is much more verbose than Scala in

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-08 Thread Dmitry Goldenberg
I just disabled checkpointing in our consumers and I can see that the batch duration millis set to 20 seconds is now being honored. Why would that be the case? And how can we "untie" batch duration millis from checkpointing? Thanks. On Tue, Sep 8, 2015 at 11:48 AM, Cody Koeninger

Re: [streaming] DStream with window performance issue

2015-09-08 Thread Понькин Алексей
The thing is, that these topics contain absolutely different AVRO objects(Array[Byte]) that I need to deserialize to different Java(Scala) objects, filter and then map to tuple (String, String). So i have 3 streams with different avro object in there. I need to cast them(using some business

Re: [streaming] DStream with window performance issue

2015-09-08 Thread Cody Koeninger
I'm not 100% sure what's going on there, but why are you doing a union in the first place? If you want multiple topics in a stream, just pass them all in the set of topics to one call to createDirectStream On Tue, Sep 8, 2015 at 10:52 AM, Alexey Ponkin wrote: > Ok. > Spark

Spark with proxy

2015-09-08 Thread Mohammad Tariq
Hi friends, Is it possible to interact with Amazon S3 using Spark via a proxy? This is what I have been doing : SparkConf conf = new SparkConf().setAppName("MyApp").setMaster("local"); JavaSparkContext sparkContext = new JavaSparkContext(conf); Configuration hadoopConf =

Re: [streaming] DStream with window performance issue

2015-09-08 Thread Cody Koeninger
That doesn't really matter. With the direct stream you'll get all objects for a given topicpartition in the same spark partition. You know what topic it's from via hasOffsetRanges. Then you can deserialize appropriately based on topic. On Tue, Sep 8, 2015 at 11:16 AM, Понькин Алексей

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-08 Thread Dmitry Goldenberg
Just verified the logic for passing the batch duration millis in, looks OK. I see the value of 20 seconds being reflected in the logs - but not in the spark ui. Also, just commented out this piece and the consumer is still stuck at using 10 seconds for batch duration millis. //if

Different Kafka createDirectStream implementations

2015-09-08 Thread Dan Dutrow
The two methods of createDirectStream appear to have different implementations, the second checks the offset.reset flags and does some error handling while the first does not. Besides the use of a messageHandler, are they intended to be used in different situations? def createDirectStream[ K:

Partitioning a RDD for training multiple classifiers

2015-09-08 Thread Maximo Gurmendez
Hi, I have a RDD that needs to be split (say, by client) in order to train n models (i.e. one for each client). Since most of the classifiers that come with ml-lib only can accept an RDD as input (and cannot build multiple models in one pass - as I understand it), the only way to train n

Re: 1.5 Build Errors

2015-09-08 Thread Ted Yu
Do you run Zinc while compiling ? Cheers On Tue, Sep 8, 2015 at 7:56 AM, Benjamin Zaitlen wrote: > I'm still getting errors with 3g. I've increase to 4g and I'll report back > > To be clear: > > export MAVEN_OPTS="-Xmx4g -XX:MaxPermSize=1024M >

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-08 Thread Dmitry Goldenberg
I've stopped the jobs, the workers, and the master. Deleted the contents of the checkpointing dir. Then restarted master, workers, and consumers. I'm seeing the job in question still firing every 10 seconds. I'm seeing the 10 seconds in the Spark Jobs GUI page as well as our logs. Seems quite

No auto decompress in Spark Java textFile function?

2015-09-08 Thread Chris Teoh
Hi Folks, I tried using Spark v1.2 on bz2 files in Java but the behaviour is different to the same textFile API call in Python and Scala. That being said, how do I process to read .tar.bz2 files in Spark's Java API? Thanks in advance Chris

[streaming] DStream with window performance issue

2015-09-08 Thread Alexey Ponkin
Hi, I have an application with 2 streams, which are joined together. Stream1 - is simple DStream(relativly small size batch chunks) Stream2 - is a windowed DStream(with duration for example 60 seconds) Stream1 and Stream2 are Kafka direct stream. The problem is that according to logs window

Getting Started with Spark

2015-09-08 Thread Bryan Jeffrey
Hello. We're getting started with Spark Streaming. We're working to build some unit/acceptance testing around functions that consume DStreams. The current method for creating DStreams is to populate the data by creating an InputDStream: val input = Array(TestDataFactory.CreateEvent(123

Re: 1.5 Build Errors

2015-09-08 Thread Sean Owen
It might need more memory in certain situations / running certain tests. If 3gb works for your relatively full build, yes you can open a PR to change any occurrences of lower recommendations to 3gb. On Tue, Sep 8, 2015 at 3:02 PM, Benjamin Zaitlen wrote: > Ah, right.

Re: [streaming] DStream with window performance issue

2015-09-08 Thread Cody Koeninger
Can you provide more info (what version of spark, code example)? On Tue, Sep 8, 2015 at 8:18 AM, Alexey Ponkin wrote: > Hi, > > I have an application with 2 streams, which are joined together. > Stream1 - is simple DStream(relativly small size batch chunks) > Stream2 - is a

Re: Java vs. Scala for Spark

2015-09-08 Thread Sean Owen
Why would Scala vs Java performance be different Ted? Relatively speaking there is almost no runtime difference; it's the same APIs or calls via a thin wrapper. Scala/Java vs Python is a different story. Java libraries can be used in Scala. Vice-versa too, though calling Scala-generated classes

Re: Different Kafka createDirectStream implementations

2015-09-08 Thread Cody Koeninger
If you're providing starting offsets explicitly, then auto offset reset isn't relevant. On Tue, Sep 8, 2015 at 11:44 AM, Dan Dutrow wrote: > The two methods of createDirectStream appear to have different > implementations, the second checks the offset.reset flags and does

  1   2   >