Re: Accumulators / Accumulables : thread-local, task-local, executor-local ?

2015-06-18 Thread Guillaume Pitel
Hi, Thank you for this confirmation. Coalescing is what we do now. It creates, however, very big partitions. Guillaume Hey, I am not 100% sure but from my understanding accumulators are per partition (so per task as its the same) and are sent back to the driver with the task result and

Re: kafka spark streaming working example

2015-06-18 Thread Akhil Das
.setMaster(local) set it to local[2] or local[*] Thanks Best Regards On Thu, Jun 18, 2015 at 5:59 PM, Bartek Radziszewski bar...@scalaric.com wrote: hi, I'm trying to run simple kafka spark streaming example over spark-shell: sc.stop import org.apache.spark.SparkConf import

Re: RE: Spark or Storm

2015-06-18 Thread bit1...@163.com
I am wondering how direct stream api ensures end-to-end exactly once semantics I think there are two things involved: 1. From the spark streaming end, the driver will replay the Offset range when it's down and restarted,which means that the new tasks will process some already processed data. 2.

Re: Accumulators / Accumulables : thread-local, task-local, executor-local ?

2015-06-18 Thread Guillaume Pitel
I was thinking exactly the same. I'm going to try it, It doesn't really matter if I lose an executor, since its sketch will be lost, but then reexecuted somewhere else. And anyway, it's an approximate data structure, and what matters are ratios, not exact values. I mostly need to take care

Best way to randomly distribute elements

2015-06-18 Thread abellet
Hello, In the context of a machine learning algorithm, I need to be able to randomly distribute the elements of a large RDD across partitions (i.e., essentially assign each element to a random partition). How could I achieve this? I have tried to call repartition() with the current number of

Re: Best way to randomly distribute elements

2015-06-18 Thread ayan guha
how about generating the key using some 1-way hashing like md5? On Thu, Jun 18, 2015 at 9:59 PM, Guillaume Pitel guillaume.pi...@exensa.com wrote: I think you can randomly reshuffle your elements just by emitting a random key (mapping a PairRdd's key triggers a reshuffle IIRC) yourrdd.map{

Re: Accumulators / Accumulables : thread-local, task-local, executor-local ?

2015-06-18 Thread Eugen Cepoi
Hey, I am not 100% sure but from my understanding accumulators are per partition (so per task as its the same) and are sent back to the driver with the task result and merged. When a task needs to be run n times (multiple rdds depend on this one, some partition loss later in the chain etc) then

Re: deployment options for Spark and YARN w/ many app jar library dependencies

2015-06-18 Thread Sweeney, Matt
Thank you, Sandy! I'll investigate use of the extraClassPath variable. Both options are helpful. Thanks, Matt On Jun 17, 2015, at 8:01 PM, Sandy Ryza sandy.r...@cloudera.commailto:sandy.r...@cloudera.com wrote: Hi Matt, If you place your jars on HDFS in a public location, YARN will cache

Re: Spark and Google Cloud Storage

2015-06-18 Thread Nick Pentreath
I believe it is available here: https://cloud.google.com/hadoop/google-cloud-storage-connector 2015-06-18 15:31 GMT+02:00 Klaus Schaefers klaus.schaef...@ligatus.com: Hi, is there a kind adapter to use GoogleCloudStorage with Spark? Cheers, Klaus -- -- Klaus Schaefers Senior

Re: Accumulators / Accumulables : thread-local, task-local, executor-local ?

2015-06-18 Thread Eugen Cepoi
BTW I suggest this instead of using thread locals as I am not sure in which situation spark will reuse or not them. For example if an error happens inside a thread, will spark then create a new one or the error is catched inside the thread preventing it to stop. So in short, does spark guarantee

Re: Best way to randomly distribute elements

2015-06-18 Thread Guillaume Pitel
I think you can randomly reshuffle your elements just by emitting a random key (mapping a PairRdd's key triggers a reshuffle IIRC) yourrdd.map{ x = (rand(), x)} There is obiously a risk that rand() will give same sequence of numbers in each partition, so you may need to use

Re: connect mobile app with Spark backend

2015-06-18 Thread Akhil Das
Why not something like your mobile app pushes data to your webserver which pushes the data to Kafka or Cassandra or any other database and have a Spark streaming job running all the time operating on the incoming data and pushes the calculated values back. This way, you don't have to start a spark

Re: Spark Streming yarn-cluster Mode Off-heap Memory Is Constantly Growing

2015-06-18 Thread Ji ZHANG
Hi, We switched from ParallelGC to CMS, and the symptom is gone. On Thu, Jun 4, 2015 at 3:37 PM, Ji ZHANG zhangj...@gmail.com wrote: Hi, I set spark.shuffle.io.preferDirectBufs to false in SparkConf and this setting can be seen in web ui's environment tab. But, it still eats memory, i.e.

Issue with PySpark UDF on a column of Vectors

2015-06-18 Thread calstad
I am having trouble using a UDF on a column of Vectors in PySpark which can be illustrated here: from pyspark import SparkContext from pyspark.sql import Row from pyspark.sql.types import DoubleType from pyspark.sql.functions import udf from pyspark.mllib.linalg import Vectors FeatureRow =

Re: Best way to randomly distribute elements

2015-06-18 Thread Himanshu Mehra
Hi A bellet You can try RDD.randomSplit(weights array) where a weights array is the array of weight you wants to want to put in the consecutive partition example RDD.randomSplit(Array(0.7, 0.3)) will create two partitions containing 70% data in one and 30% in other, randomly selecting the

Re: Accumulators / Accumulables : thread-local, task-local, executor-local ?

2015-06-18 Thread Eugen Cepoi
Yeah thats the problem. There is probably some perfect num of partitions that provides the best balance between partition size and memory and merge overhead. Though it's not an ideal solution :( There could be another way but very hacky... for example if you store one sketch in a singleton per

Accumulators / Accumulables : thread-local, task-local, executor-local ?

2015-06-18 Thread Guillaume Pitel
Hi, I'm trying to figure out the smartest way to implement a global count-min-sketch on accumulators. For now, we are doing that with RDDs. It works well, but with one sketch per partition, merging takes too long. As you probably know, a count-min sketch is a big mutable array of array of

Spark and Google Cloud Storage

2015-06-18 Thread Klaus Schaefers
Hi, is there a kind adapter to use GoogleCloudStorage with Spark? Cheers, Klaus -- -- Klaus Schaefers Senior Optimization Manager Ligatus GmbH Hohenstaufenring 30-32 D-50674 Köln Tel.: +49 (0) 221 / 56939 -784 Fax: +49 (0) 221 / 56 939 - 599 E-Mail: klaus.schaef...@ligatus.com Web:

kafka spark streaming working example

2015-06-18 Thread Bartek Radziszewski
hi, I'm trying to run simple kafka spark streaming example over spark-shell: sc.stop import org.apache.spark.SparkConf import org.apache.spark.SparkContext._ import kafka.serializer.DefaultDecoder import org.apache.spark.streaming._ import org.apache.spark.streaming.kafka._ import

Re: Accumulators / Accumulables : thread-local, task-local, executor-local ?

2015-06-18 Thread Eugen Cepoi
2015-06-18 15:17 GMT+02:00 Guillaume Pitel guillaume.pi...@exensa.com: I was thinking exactly the same. I'm going to try it, It doesn't really matter if I lose an executor, since its sketch will be lost, but then reexecuted somewhere else. I mean that between the action that will update the

Got the exception when joining RDD with spark streamRDD

2015-06-18 Thread Groupme
Hi, I am writing pyspark stream program. I have the training data set to compute the regression model. I want to use the stream data set to test the model. So, I join with RDD with the StreamRDD, but i got the exception. Following are my source code, and the exception I got. Any help is

Re: [Spark Streaming] Iterative programming on an ordered spark stream using Java?

2015-06-18 Thread twinkle sachdeva
Hi, UpdateStateByKey : if you can brief the issue you are facing with this,that will be great. Regarding not keeping whole dataset in memory, you can tweak the parameter of remember, such that it does checkpoint at appropriate time. Thanks Twinkle On Thursday, June 18, 2015, Nipun Arora

Re: Executor memory allocations

2015-06-18 Thread Richard Marscher
It would be the 40%, although it's probably better to think of it as shuffle vs. data cache and the remainder goes to tasks. As the comments for the shuffle memory fraction configuration clarify that it will be taking memory at the expense of the storage/data cache fraction:

Re: Matrix Multiplication and mllib.recommendation

2015-06-18 Thread Ayman Farahat
Thanks Sabarish and Nick Would you happen to have some code snippets that you can share. Best Ayman On Jun 17, 2015, at 10:35 PM, Sabarish Sasidharan sabarish.sasidha...@manthan.com wrote: Nick is right. I too have implemented this way and it works just fine. In my case, there can be even

Re: [Spark Streaming] Iterative programming on an ordered spark stream using Java?

2015-06-18 Thread Nipun Arora
Hi All, I appreciate the help :) Here is a sample code where I am trying to keep the data of the previous RDD and the current RDD in a foreachRDD in spark stream. I do not know if the bottom code technically works as I cannot compile it , but I am trying to in a way keep the historical reference

Re: Matrix Multiplication and mllib.recommendation

2015-06-18 Thread Debasish Das
Also not sure how threading helps here because Spark puts a partition to each core. On each core may be there are multiple threads if you are using intel hyperthreading but I will let Spark handle the threading. On Thu, Jun 18, 2015 at 8:38 AM, Debasish Das debasish.da...@gmail.com wrote: We

Re: Matrix Multiplication and mllib.recommendation

2015-06-18 Thread Debasish Das
We added SPARK-3066 for this. In 1.4 you should get the code to do BLAS dgemm based calculation. On Thu, Jun 18, 2015 at 8:20 AM, Ayman Farahat ayman.fara...@yahoo.com.invalid wrote: Thanks Sabarish and Nick Would you happen to have some code snippets that you can share. Best Ayman On Jun

Re: Matrix Multiplication and mllib.recommendation

2015-06-18 Thread Debasish Das
Also in my experiments, it's much faster to blocked BLAS through cartesian rather than doing sc.union. Here are the details on the experiments: https://issues.apache.org/jira/browse/SPARK-4823 On Thu, Jun 18, 2015 at 8:40 AM, Debasish Das debasish.da...@gmail.com wrote: Also not sure how

Re: Spark-sql(yarn-client) java.lang.NoClassDefFoundError: org/apache/spark/deploy/yarn/ExecutorLauncher

2015-06-18 Thread Yin Huai
btw, user listt will be a better place for this thread. On Thu, Jun 18, 2015 at 8:19 AM, Yin Huai yh...@databricks.com wrote: Is it the full stack trace? On Thu, Jun 18, 2015 at 6:39 AM, Sea 261810...@qq.com wrote: Hi, all: I want to run spark sql on yarn(yarn-client), but ... I already

Re: RE: Spark or Storm

2015-06-18 Thread Cody Koeninger
That general description is accurate, but not really a specific issue of the direct steam. It applies to anything consuming from kafka (or, as Matei already said, any streaming system really). You can't have exactly once semantics, unless you know something more about how you're storing results.

Re: Loading lots of parquet files into dataframe from s3

2015-06-18 Thread lovelylavs
You can do something like this: ObjectListing objectListing; do { objectListing = s3Client.listObjects(listObjectsRequest); for (S3ObjectSummary objectSummary : objectListing.getObjectSummaries()) { if

Re: Submitting Spark Applications using Spark Submit

2015-06-18 Thread lovelylavs
Hi, To make the jar files as part of the jar which you would like to use, you should create a uber jar. Please refer to the following: https://maven.apache.org/plugins/maven-shade-plugin/examples/includes-excludes.html -- View this message in context:

Re: Issue with PySpark UDF on a column of Vectors

2015-06-18 Thread Xiangrui Meng
This is a known issue. See https://issues.apache.org/jira/browse/SPARK-7902 -Xiangrui On Thu, Jun 18, 2015 at 6:41 AM, calstad colin.als...@gmail.com wrote: I am having trouble using a UDF on a column of Vectors in PySpark which can be illustrated here: from pyspark import SparkContext from

Re: [Spark Streaming] Iterative programming on an ordered spark stream using Java?

2015-06-18 Thread Nipun Arora
@Twinkle - what did you mean by Regarding not keeping whole dataset in memory, you can tweak the parameter of remember, such that it does checkpoint at appropriate time? On Thu, Jun 18, 2015 at 11:40 AM, Nipun Arora nipunarora2...@gmail.com wrote: Hi All, I appreciate the help :) Here is a

different schemas per row with DataFrames

2015-06-18 Thread Alex Nastetsky
I am reading JSON data that has different schemas for every record. That is, for a given field that would have a null value, it's simply absent from that record (and therefore, its schema). I would like to use the DataFrame API to select specific fields from this data, and for fields that are

Hivecontext going out-of-sync issue

2015-06-18 Thread Ranadip Chatterjee
Hi All. I have a partitioned table in Hive. The use case is to drop one of the partitions before inserting new data every time the Spark process runs. I am using the Hivecontext to read and write (dynamic partitions) and also to alter the table to drop the partition before insert. Everything runs

[Spark Streaming] Runtime Error in call to max function for JavaPairRDD

2015-06-18 Thread Nipun Arora
Hi, I have the following piece of code, where I am trying to transform a spark stream and add min and max to it of eachRDD. However, I get an error saying max call does not exist, at run-time (compiles properly). I am using spark-1.4 I have added the question to stackoverflow as well:

problem with pants building

2015-06-18 Thread peixin li
Hi, We use pants to build python project to executable python file (pex). But we cannot run pex file until we add all necessary library paths to PYTHONPATH and use pip to install necessary packages for $SPARK_HOME/python/lib/pyspark.zip specially. Since we have already add the unzipped

problem with pants building

2015-06-18 Thread peixin li
Hi, We use pants to build python project to executable python file (pex). But we cannot run pex file until we add all necessary library paths to PYTHONPATH and use pip to install necessary packages for $SPARK_HOME/python/lib/pyspark.zip specially. Since we have already add the unzipped

Re: understanding on the waiting batches and scheduling delay in Streaming UI

2015-06-18 Thread Tathagata Das
Also, could you give a screenshot of the streaming UI. Even better, could you run it on Spark 1.4 which has a new streaming UI and then use that for debugging/screenshot? TD On Thu, Jun 18, 2015 at 3:05 AM, Akhil Das ak...@sigmoidanalytics.com wrote: Which version of spark? and what is your

Re: Latency between the RDD in Streaming

2015-06-18 Thread Tathagata Das
Why do you need to uniquely identify the message? All you need is the time when the message was inserted by the receiver, and when it is processed, isnt it? On Thu, Jun 18, 2015 at 2:28 PM, anshu shukla anshushuk...@gmail.com wrote: Thanks alot , But i have already tried the second way

NaiveBayes for MLPipeline is absent

2015-06-18 Thread Justin Yip
Hello, Currently, there is no NaiveBayes implementation for MLpipeline. I couldn't find the JIRA ticket related to it too (or maybe I missed). Is there a plan to implement it? If no one has the bandwidth, I can work on it. Thanks. Justin -- View this message in context:

Re: createDirectStream and Stats

2015-06-18 Thread Tim Smith
Thanks for the super-fast response, TD :) I will now go bug my hadoop vendor to upgrade from 1.3 to 1.4. Cloudera, are you listening? :D On Thu, Jun 18, 2015 at 7:02 PM, Tathagata Das tathagata.das1...@gmail.com wrote: Are you using Spark 1.3.x ? That explains. This issue has been fixed in

Spark-sql(yarn-client) java.lang.NoClassDefFoundError: org/apache/spark/deploy/yarn/ExecutorLauncher

2015-06-18 Thread Sea
Hi, all: I want to run spark sql on yarn(yarn-client), but ... I already set spark.yarn.jar and spark.jars in conf/spark-defaults.conf. ./bin/spark-sql -f game.sql --executor-memory 2g --num-executors 100 game.txt Exception in thread main java.lang.NoClassDefFoundError:

Re: SparkSubmit with Ivy jars is very slow to load with no internet access

2015-06-18 Thread Burak Yavuz
Hey Nathan, I like the first idea better. Let's see what others think. I'd be happy to review your PR afterwards! Best, Burak On Thu, Jun 18, 2015 at 9:53 PM, Nathan McCarthy nathan.mccar...@quantium.com.au wrote: Hey, Spark Submit adds maven central spark bintray to the ChainResolver

[SparkSQL]. MissingRequirementError when creating dataframe from RDD (new error in 1.4)

2015-06-18 Thread Adam Lewandowski
Since upgrading to Spark 1.4, I'm getting a scala.reflect.internal.MissingRequirementError when creating a DataFrame from an RDD. The error references a case class in the application (the RDD's type parameter), which has been verified to be present. Items of note: 1) This is running on AWS EMR

Re: Does MLLib has attribute importance?

2015-06-18 Thread Debasish Das
Running l1 and picking non zero coefficient s gives a good estimate of interesting features as well... On Jun 17, 2015 4:51 PM, Xiangrui Meng men...@gmail.com wrote: We don't have it in MLlib. The closest would be the ChiSqSelector, which works for categorical data. -Xiangrui On Thu, Jun 11,

Re: createDirectStream and Stats

2015-06-18 Thread Tathagata Das
Are you using Spark 1.3.x ? That explains. This issue has been fixed in Spark 1.4.0. Bonus you get a fancy new streaming UI with more awesome stats. :) On Thu, Jun 18, 2015 at 7:01 PM, Tim Smith secs...@gmail.com wrote: Hi, I just switched from createStream to the createDirectStream API for

Re: HiveContext saveAsTable create wrong partition

2015-06-18 Thread Yin Huai
Are you writing to an existing hive orc table? On Wed, Jun 17, 2015 at 3:25 PM, Cheng Lian lian.cs@gmail.com wrote: Thanks for reporting this. Would you mind to help creating a JIRA for this? On 6/16/15 2:25 AM, patcharee wrote: I found if I move the partitioned columns in schemaString

RE: RE: Spark or Storm

2015-06-18 Thread prajod.vettiyattil
More details on the Direct API of Spark 1.3 is at the databricks blog: https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html Note the use of checkpoints to persist the Kafka offsets in Spark Streaming itself, and not in zookeeper. Also this

Re: HiveContext saveAsTable create wrong partition

2015-06-18 Thread Yin Huai
If you are writing to an existing hive table, our insert into operator follows hive's requirement, which is *the dynamic partition columns must be specified last among the columns in the SELECT statement and in the same order** in which they appear in the PARTITION() clause*. You can find

Re: Error when connecting to Spark SQL via Hive JDBC driver

2015-06-18 Thread ogoh
hello, I am not sure what is wrong.. But, in my case, I followed the instruction from http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/HiveJDBCDriver.html. It worked fine with SQuirreL SQL Client (http://squirrel-sql.sourceforge.net/), and SQL Workbench J

Re: [SparkSQL]. MissingRequirementError when creating dataframe from RDD (new error in 1.4)

2015-06-18 Thread Michael Armbrust
Thanks for reporting. Filed as: https://issues.apache.org/jira/browse/SPARK-8470 On Thu, Jun 18, 2015 at 5:35 PM, Adam Lewandowski adam.lewandow...@gmail.com wrote: Since upgrading to Spark 1.4, I'm getting a scala.reflect.internal.MissingRequirementError when creating a DataFrame from an

Re: confusing ScalaReflectionException with DataFrames in 1.4

2015-06-18 Thread Michael Armbrust
I saw another report so I filed it already: Filed as: https://issues.apache.org/jira/browse/SPARK-8470 On Thu, Jun 18, 2015 at 4:07 PM, Chad Urso McDaniel cha...@gmail.com wrote: We're using the normal command line: --- bin/spark-submit --properties-file ./spark-submit.conf --class

Build spark application into uber jar

2015-06-18 Thread bit1...@163.com
Hi,sparks, I have a spark streaming application that is a maven project, I would like to build it into a uber jar and run in the cluster. I have found out two options to build the uber jar, either of them has its shortcomings, so I would ask how you guys do it. Thanks. 1. Use the maven shade

Interaction between StringIndexer feature transformer and CrossValidator

2015-06-18 Thread cyz
Hi, I encountered errors fitting a model using a CrossValidator. The training set contained a feature which was initially a String with many unique values. I used a StringIndexer to transform this feature column into label indices. Fitting a model with a regular pipeline worked fine, but I ran

RE: If not stop StreamingContext gracefully, will checkpoint data be consistent?

2015-06-18 Thread Haopu Wang
Akhil, From my test, I can see the files in the last batch will alwyas be reprocessed upon restarting from checkpoint even for graceful shutdown. I think usually the file is expected to be processed only once. Maybe this is a bug in fileStream? or do you know any approach to workaround it?

Re: Shuffle produces one huge partition and many tiny partitions

2015-06-18 Thread Corey Nolet
Sorry Du, Repartition means coalesce(shuffle = true) as per [1]. They are the same operation. Coalescing with shuffle = false means you are specifying the max amount of partitions after the coalesce (if there are less partitions you will end up with the lesser amount. [1]

Coalescing with shuffle = false in imbalanced cluster

2015-06-18 Thread Corey Nolet
I'm confused about this. The comment on the function seems to indicate that there is absolutely no shuffle or network IO but it also states that it assigns an even number of parent partitions to each final partition group. I'm having trouble seeing how this can be guaranteed without some data

how to change /tmp folder for spark ut use sbt

2015-06-18 Thread yuemeng (A)
hi,all if i want to change the /tmp folder to any other folder for spark ut use sbt,how can i do?

SparkSubmit with Ivy jars is very slow to load with no internet access

2015-06-18 Thread Nathan McCarthy
Hey, Spark Submit adds maven central spark bintray to the ChainResolver before it adds any external resolvers. https://github.com/apache/spark/blob/branch-1.4/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L821 When running on a cluster without internet access, this means the

createDirectStream and Stats

2015-06-18 Thread Tim Smith
Hi, I just switched from createStream to the createDirectStream API for kafka and while things otherwise seem happy, the first thing I noticed is that stream/receiver stats are gone from the Spark UI :( Those stats were very handy for keeping an eye on health of the app. What's the best way to

Re: [Spark Streaming] Runtime Error in call to max function for JavaPairRDD

2015-06-18 Thread Nipun Arora
Hi Tathagata, When you say please mark spark-core and spark-streaming as dependencies how do you mean? I have installed the pre-build spark-1.4 for Hadoop 2.6 from spark downloads. In my maven pom.xml, I am using version 1.4 as described. Please let me know how I can fix that? Thanks Nipun On

Specify number of partitions with which to run DataFrame.join?

2015-06-18 Thread Matt Cheah
Hi everyone, I¹m looking into switching raw RDD operations to DataFrames operations. When I used JavaPairRDD.join(), I had the option to specify the number of partitions with which to do the join. However, I don¹t see an equivalent option in DataFrame.join(). Is there a way to specify the

Re: Spark Streming yarn-cluster Mode Off-heap Memory Is Constantly Growing

2015-06-18 Thread Tathagata Das
Glad to hear that. :) On Thu, Jun 18, 2015 at 6:25 AM, Ji ZHANG zhangj...@gmail.com wrote: Hi, We switched from ParallelGC to CMS, and the symptom is gone. On Thu, Jun 4, 2015 at 3:37 PM, Ji ZHANG zhangj...@gmail.com wrote: Hi, I set spark.shuffle.io.preferDirectBufs to false in

Spark-sql versus Impala versus Hive

2015-06-18 Thread Sanjay Subramanian
I just published results of my findings herehttps://bigdatalatte.wordpress.com/2015/06/18/spark-sql-versus-impala-versus-hive/

Re: Serial batching with Spark Streaming

2015-06-18 Thread Binh Nguyen Van
I haven’t tried with 1.4 but I tried with 1.3 a while ago and I could not get the serialized behavior by using default scheduler when there is failure and retry so I created a customized stream like this. class EachSeqRDD[T: ClassTag] ( parent: DStream[T], eachSeqFunc: (RDD[T], Time) = Unit

Latency between the RDD in Streaming

2015-06-18 Thread anshu shukla
Is there any fixed way to find among RDD in stream processing systems , in the Distributed set-up . -- Thanks Regards, Anshu Shukla

Re: Matrix Multiplication and mllib.recommendation

2015-06-18 Thread Ayman Farahat
Thanks all for the help. It turned out that using the bumpy matrix multiplication made a huge difference in performance. I suspect that Numpy already uses BLAS optimized code. Here is Python code #This is where i load and directly test the predictions myModel =

Re: Got the exception when joining RDD with spark streamRDD

2015-06-18 Thread Davies Liu
This seems be a bug, could you file a JIRA for it? RDD should be serializable for Streaming job. On Thu, Jun 18, 2015 at 4:25 AM, Groupme grou...@gmail.com wrote: Hi, I am writing pyspark stream program. I have the training data set to compute the regression model. I want to use the stream

Re: Is there programmatic way running Spark job on Yarn cluster without using spark-submit script ?

2015-06-18 Thread Corey Nolet
This is not independent programmatic way of running of Spark job on Yarn cluster. The example I created simply demonstrates how to wire up the classpath so that spark submit can be called programmatically. For my use case, I wanted to hold open a connection so I could send tasks to the executors

Settings for K-Means Clustering in Mlib for large data set

2015-06-18 Thread Rogers Jeffrey
Hi All, I am trying to run KMeans clustering on a large data set with 12,000 points and 80,000 dimensions. I have a spark cluster in Ec2 stand alone mode with 8 workers running on 2 slaves with 160 GB Ram and 40 VCPU. My Code is as Follows: def convert_into_sparse_vector(A):

Re: Matrix Multiplication and mllib.recommendation

2015-06-18 Thread Nick Pentreath
Yup, numpy calls into BLAS for matrix multiply. Sent from my iPad On 18 Jun 2015, at 8:54 PM, Ayman Farahat ayman.fara...@yahoo.com wrote: Thanks all for the help. It turned out that using the bumpy matrix multiplication made a huge difference in performance. I suspect that Numpy already

MLIB-KMEANS: Py4JNetworkError: An error occurred while trying to connect to the Java server , on a huge data set

2015-06-18 Thread rogersjeffreyl
Hi All, I am trying to run KMeans clustering on a large data set with 12,000 points and 80,000 dimensions. I have a spark cluster in Ec2 stand alone mode with 8 workers running on 2 slaves with 160 GB Ram and 40 VCPU. *My Code is as Follows:* def convert_into_sparse_vector(A):

Re: [Spark Streaming] Runtime Error in call to max function for JavaPairRDD

2015-06-18 Thread Tathagata Das
I think you may be including a different version of Spark Streaming in your assembly. Please mark spark-core nd spark-streaming as provided dependencies. Any installation of Spark will automatically provide Spark in the classpath so you do not have to bundle it. On Thu, Jun 18, 2015 at 8:44 AM,

Re: Serial batching with Spark Streaming

2015-06-18 Thread Michal Čizmazia
Tathagata, thanks for your response. You are right! Everything seems to work as expected. Please could help me understand why the time for processing of all jobs for a batch is always less than 4 seconds? Please see my playground code below. The last modified time of the input (lines) RDD dump

Re: Does MLLib has attribute importance?

2015-06-18 Thread Xiangrui Meng
ChiSqSelector calls an RDD of labeled points, where the label is the target. See https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala#L120 On Wed, Jun 17, 2015 at 10:22 PM, Ruslan Dautkhanov dautkha...@gmail.com wrote: Thank you

Re: Latency between the RDD in Streaming

2015-06-18 Thread Tathagata Das
Its not clear what you are asking. Find what among RDD? On Thu, Jun 18, 2015 at 11:24 AM, anshu shukla anshushuk...@gmail.com wrote: Is there any fixed way to find among RDD in stream processing systems , in the Distributed set-up . -- Thanks Regards, Anshu Shukla

Re: Serial batching with Spark Streaming

2015-06-18 Thread Michal Čizmazia
Tathagata, Please could you confirm that batches are not processed in parallel during retries in Spark 1.4? See Binh's email copied below. Any pointers for workarounds if necessary? Thanks! On 18 June 2015 at 14:29, Binh Nguyen Van binhn...@gmail.com wrote: I haven’t tried with 1.4 but I

Re: Latency between the RDD in Streaming

2015-06-18 Thread anshu shukla
Sorry , i missed the LATENCY word.. for a large streaming query .How to find the time taken by the particular RDD to travel from initial D-STREAM to final/last D-STREAM . Help Please !! On Fri, Jun 19, 2015 at 12:40 AM, Tathagata Das t...@databricks.com wrote: Its not clear what you are

Re: Does MLLib has attribute importance?

2015-06-18 Thread Ruslan Dautkhanov
Got it. Thanks! -- Ruslan Dautkhanov On Thu, Jun 18, 2015 at 1:02 PM, Xiangrui Meng men...@gmail.com wrote: ChiSqSelector calls an RDD of labeled points, where the label is the target. See

Re: Latency between the RDD in Streaming

2015-06-18 Thread anshu shukla
Thanks alot , But i have already tried the second way ,Problem with that is that how to identify the particular RDD from source to sink (as we can do by passing a msg id in storm) . For that i just updated RDD and added a msgID (as static variable) . but while dumping them to file some of the

The Initial job has not accepted any resources error; can't seem to set

2015-06-18 Thread dgoldenberg
Hi, I'm running Spark Standalone on a single node with 16 cores. Master and 4 workers are running. I'm trying to submit two applications via spark-submit and am getting the following error when submitting the second one: Initial job has not accepted any resources; check your cluster UI to ensure

Re: The Initial job has not accepted any resources error; can't seem to set

2015-06-18 Thread dgoldenberg
I just realized that --conf needs to be one key-value pair per line. And somehow I needed --conf spark.cores.max=2 \ However, when it was --conf spark.deploy.defaultCores=2 \ then one job would take up all 16 cores on the box. What's the actual model here? We've got 10 apps

Re: Latency between the RDD in Streaming

2015-06-18 Thread Tathagata Das
Couple of ways. 1. Easy but approx way: Find scheduling delay and processing time using StreamingListener interface, and then calculate end-to-end delay = 0.5 * batch interval + scheduling delay + processing time. The 0.5 * batch inteval is the approx average batching delay across all the records

Re: Spark-sql versus Impala versus Hive

2015-06-18 Thread Michael Armbrust
I would also love to see a more recent version of Spark SQL. There have been a lot of performance improvements between 1.2 and 1.4 :) On Thu, Jun 18, 2015 at 3:18 PM, Steve Nunez snu...@hortonworks.com wrote: Interesting. What where the Hive settings? Specifically it would be useful to know

Re: confusing ScalaReflectionException with DataFrames in 1.4

2015-06-18 Thread Michael Armbrust
How are you adding com.rr.data.Visit to spark? With --jars? It is possible we are using the wrong classloader. Could you open a JIRA? On Thu, Jun 18, 2015 at 2:56 PM, Chad Urso McDaniel cha...@gmail.com wrote: We are seeing class exceptions when converting to a DataFrame. Anyone out there

Re: Shuffle produces one huge partition and many tiny partitions

2015-06-18 Thread Du Li
I got the same problem with rdd,repartition() in my streaming app, which generated a few huge partitions and many tiny partitions. The resulting high data skew makes the processing time of a batch unpredictable and often exceeding the batch interval. I eventually solved the problem by using

confusing ScalaReflectionException with DataFrames in 1.4

2015-06-18 Thread Chad Urso McDaniel
We are seeing class exceptions when converting to a DataFrame. Anyone out there with some suggestions on what is going on? Our original intention was to use a HiveContext to write ORC and we say the error there and have narrowed it down. This is an example of our code: --- def

Re: Spark-sql versus Impala versus Hive

2015-06-18 Thread Steve Nunez
Interesting. What where the Hive settings? Specifically it would be useful to know if this was Hive on Tez. - Steve From: Sanjay Subramanian Reply-To: Sanjay Subramanian Date: Thursday, June 18, 2015 at 11:08 To: user@spark.apache.orgmailto:user@spark.apache.org Subject: Spark-sql versus Impala

Re: confusing ScalaReflectionException with DataFrames in 1.4

2015-06-18 Thread Chad Urso McDaniel
We're using the normal command line: --- bin/spark-submit --properties-file ./spark-submit.conf --class com.rr.data.visits.VisitSequencerRunner ./mvt-master-SNAPSHOT-jar-with-dependencies.jar --- Our jar contains both com.rr.data.visits.orc.OrcReadWrite (which you can see in the stack trace) and

Re: Shuffle produces one huge partition and many tiny partitions

2015-06-18 Thread Corey Nolet
Doesn't repartition call coalesce(shuffle=true)? On Jun 18, 2015 6:53 PM, Du Li l...@yahoo-inc.com.invalid wrote: I got the same problem with rdd,repartition() in my streaming app, which generated a few huge partitions and many tiny partitions. The resulting high data skew makes the processing

Re: Submitting Spark Applications using Spark Submit

2015-06-18 Thread maxdml
You can specify the jars of your application to be included with spark-submit with the /--jars/ switch. Otherwise, are you sure that your newly compiled spark jar assembly is in assembly/target/scala-2.10/? -- View this message in context:

Re: Settings for K-Means Clustering in Mlib for large data set

2015-06-18 Thread Xiangrui Meng
With 80,000 features and 1000 clusters, you need 80,000,000 doubles to store the cluster centers. That is ~600MB. If there are 10 partitions, you might need 6GB on the driver to collect updates from workers. I guess the driver died. Did you specify driver memory with spark-submit? -Xiangrui On

Re: Shuffle produces one huge partition and many tiny partitions

2015-06-18 Thread Du Li
repartition() means coalesce(shuffle=false) On Thursday, June 18, 2015 4:07 PM, Corey Nolet cjno...@gmail.com wrote: Doesn't repartition call coalesce(shuffle=true)?On Jun 18, 2015 6:53 PM, Du Li l...@yahoo-inc.com.invalid wrote: I got the same problem with rdd,repartition() in my

Re: Settings for K-Means Clustering in Mlib for large data set

2015-06-18 Thread Rogers Jeffrey
I am submitting the application from a python notebook. I am launching pyspark as follows: SPARK_PUBLIC_DNS=ec2-54-165-202-17.compute-1.amazonaws.com SPARK_WORKER_CORES=8 SPARK_WORKER_MEMORY=15g SPARK_MEM=30g OUR_JAVA_MEM=30g SPARK_DAEMON_JAVA_OPTS=-XX:MaxPermSize=30g -Xms30g -Xmx30g IPYTHON=1

RE: Machine Learning on GraphX

2015-06-18 Thread Evo Eftimov
What is GraphX: - It can be viewed as a kind of Distributed, Parallel, Graph Database - It can be viewed as Graph Data Structure (Data Structures 101 from your CS course) - It features some off the shelve algos for Graph Processing and Navigation (Algos and Data

Re: *Metrics API is odd in MLLib

2015-06-18 Thread Sam
Firstly apologies for the header of my email containing some junk, I believe it's due to a copy and paste error on a smart phone. Thanks for your response. I will indeed make the PR you suggest, though glancing at the code I realize it's not just a case of making these public since the types are

Re: understanding on the waiting batches and scheduling delay in Streaming UI

2015-06-18 Thread Akhil Das
Which version of spark? and what is your data source? For some reason, your processing delay is exceeding the batch duration. And its strange that you are not seeing any scheduling delay. Thanks Best Regards On Thu, Jun 18, 2015 at 7:29 AM, Mike Fang chyfan...@gmail.com wrote: Hi, I have a

Fwd: mllib from sparkR

2015-06-18 Thread Elena Scardovi
Hi, I was wondering if it is possible to use MLlib function inside SparkR, as outlined at the Spark Summer East 2015 Warmup meetup: http://www.meetup.com/Spark-NYC/events/220850389/ Are there available examples? Thank you! Elena

  1   2   >