RE: Issue when rebroadcasting a variable outside of the definition scope

2015-08-07 Thread Ganelin, Ilya
Simone, here are some thoughts. Please check out the understanding closures section of the Spark Programming Guide. Secondly, broadcast variables do not propagate updates to the underlying data. You must either create a new broadcast variable or alternately if you simply wish to accumulate

RE: How to read gzip data in Spark - Simple question

2015-08-05 Thread Ganelin, Ilya
Have you tried reading the spark documentation? http://spark.apache.org/docs/latest/programming-guide.html Thank you, Ilya Ganelin -Original Message- From: ÐΞ€ρ@Ҝ (๏̯͡๏) [deepuj...@gmail.commailto:deepuj...@gmail.com] Sent: Thursday, August 06, 2015 12:41 AM Eastern Standard Time

RE: How to unpersist RDDs generated by ALS/MatrixFactorizationModel

2015-07-22 Thread Ganelin, Ilya
To be Unpersisted the RDD must be persisted first. If it's set to None, then it's not persisted, and as such does not need to be freed. Does that make sense ? Thank you, Ilya Ganelin -Original Message- From: Stahlman, Jonathan

Real-time data visualization with Zeppelin

2015-07-08 Thread Ganelin, Ilya
Hi all – I’m just wondering if anyone has had success integrating Spark Streaming with Zeppelin and actually dynamically updating the data in near real-time. From my investigation, it seems that Zeppelin will only allow you to display a snapshot of data, not a continuously updating table. Has

RE: Making Unpersist Lazy

2015-07-02 Thread Ganelin, Ilya
You may pass an optional parameter (blocking = false) to make it lazy. Thank you, Ilya Ganelin -Original Message- From: Jem Tucker [jem.tuc...@gmail.commailto:jem.tuc...@gmail.com] Sent: Thursday, July 02, 2015 04:06 AM Eastern Standard Time To: Akhil Das Cc: user Subject: Re: Making

RE: Performing sc.paralleize (..) in workers not in the driver program

2015-06-25 Thread Ganelin, Ilya
The parallelize operation accepts as input a data structure in memory. When you call it, you are necessarily operating In the memory space of the driver since that is where user code executes. Until you have an RDD, you can't really operate in a distributed way. If your files are stores in a

RE: Matrix Multiplication and mllib.recommendation

2015-06-17 Thread Ganelin, Ilya
Actually talk about this exact thing in a blog post here http://blog.cloudera.com/blog/2015/05/working-with-apache-spark-or-how-i-learned-to-stop-worrying-and-love-the-shuffle/. Keep in mind, you're actually doing a ton of math. Even with proper caching and use of broadcast variables this will

RE: Does long-lived SparkContext hold on to executor resources?

2015-05-11 Thread Ganelin, Ilya
Also check out the spark.cleaner.ttl property. Otherwise, you will accumulate shuffle metadata in the memory of the driver. Sent with Good (www.good.com) -Original Message- From: Silvio Fiorito [silvio.fior...@granturing.commailto:silvio.fior...@granturing.com] Sent: Monday, May 11,

RE: ReduceByKey and sorting within partitions

2015-04-27 Thread Ganelin, Ilya
Marco - why do you want data sorted both within and across partitions? If you need to take an ordered sequence across all your data you need to either aggregate your RDD on the driver and sort it, or use zipWithIndex to apply an ordered index to your data that matches the order it was stored on

RE: Spark 1.3.1 Hadoop 2.4 Prebuilt package broken ?

2015-04-27 Thread Ganelin, Ilya
What command are you using to untar? Are you running out of disk space? Sent with Good (www.good.com) -Original Message- From: ÐΞ€ρ@Ҝ (๏̯͡๏) [deepuj...@gmail.commailto:deepuj...@gmail.com] Sent: Monday, April 27, 2015 11:44 AM Eastern Standard Time To: user Subject: Spark 1.3.1 Hadoop

RE: Does HadoopRDD.zipWithIndex method preserve the order of the input data from Hadoop?

2015-04-24 Thread Ganelin, Ilya
...@boxever.commailto:michal.michal...@boxever.com] Sent: Friday, April 24, 2015 11:04 AM Eastern Standard Time To: Ganelin, Ilya Cc: Spico Florin; user Subject: Re: Does HadoopRDD.zipWithIndex method preserve the order of the input data from Hadoop? The problem I'm facing is that I need to process lines from input

RE: Does HadoopRDD.zipWithIndex method preserve the order of the input data from Hadoop?

2015-04-24 Thread Ganelin, Ilya
Michael - you need to sort your RDD. Check out the shuffle documentation on the Spark Programming Guide. It talks about this specifically. You can resolve this in a couple of ways - either by collecting your RDD and sorting it, using sortBy, or not worrying about the internal ordering. You can

RE: Does HadoopRDD.zipWithIndex method preserve the order of the input data from Hadoop?

2015-04-24 Thread Ganelin, Ilya
- From: Michal Michalski [michal.michal...@boxever.commailto:michal.michal...@boxever.com] Sent: Friday, April 24, 2015 11:18 AM Eastern Standard Time To: Ganelin, Ilya Cc: Spico Florin; user Subject: Re: Does HadoopRDD.zipWithIndex method preserve the order of the input data from Hadoop? I

RE: Map Question

2015-04-23 Thread Ganelin, Ilya
You need to expose that variable the same way you'd expose any other variable in Python that you wanted to see across modules. As long as you share a spark context all will work as expected. http://stackoverflow.com/questions/142545/python-how-to-make-a-cross-module-variable Sent with Good

RE: spark with kafka

2015-04-18 Thread Ganelin, Ilya
Write Kafka stream to HDFS via Spark streaming then ingest files via Spark from HDFS. Sent with Good (www.good.com) -Original Message- From: Shushant Arora [shushantaror...@gmail.commailto:shushantaror...@gmail.com] Sent: Saturday, April 18, 2015 06:44 AM Eastern Standard Time To:

Re: mapPartitions - How Does it Works

2015-03-18 Thread Ganelin, Ilya
Map partitions works as follows : 1) For each partition of your RDD, it provides an iterator over the values within that partition 2) You then define a function that operates on that iterator Thus if you do the following: val parallel = sc.parallelize(1 to 10, 3) parallel.mapPartitions( x =

RE: Brodcast Variable updated from one transformation and used from another

2015-02-24 Thread Ganelin, Ilya
You're not using the broadcasted variable within your map operations. You're attempting to modify myObjrct directly which won't work because you are modifying the serialized copy on the executor. You want to do myObjectBroadcasted.value.insert and myObjectBroadcasted.value.lookup. Sent with

RE: Spark job fails on cluster but works fine on a single machine

2015-02-19 Thread Ganelin, Ilya
When writing to hdfs Spark will not overwrite existing files or directories. You must either manually delete these or use Java's Hadoop FileSystem class to remove them. Sent with Good (www.good.com) -Original Message- From: Pavel Velikhov

RE: RDD Partition number

2015-02-19 Thread Ganelin, Ilya
As Ted Yu points out, default block size is 128MB as of Hadoop 2.1. Sent with Good (www.good.com) -Original Message- From: Ilya Ganelin [ilgan...@gmail.commailto:ilgan...@gmail.com] Sent: Thursday, February 19, 2015 12:13 PM Eastern Standard Time To: Alessandro Lulli;

RE: Is there a fast way to do fast top N product recommendations for all users

2015-02-12 Thread Ganelin, Ilya
Hi all - I've spent a while playing with this. Two significant sources of speed up that I've achieved are 1) Manually multiplying the feature vectors and caching either the user or product vector 2) By doing so, if one of the RDDs is a global it becomes possible to parallelize this step by

RE: spark challenge: zip with next???

2015-01-29 Thread Ganelin, Ilya
Make a copy of your RDD with an extra entry in the beginning to offset. The you can zip the two RDDs and run a map to generate an RDD of differences. Sent with Good (www.good.com) -Original Message- From: derrickburns [derrickrbu...@gmail.commailto:derrickrbu...@gmail.com] Sent:

RE: quickly counting the number of rows in a partition?

2015-01-13 Thread Ganelin, Ilya
] Sent: Tuesday, January 13, 2015 08:06 PM Eastern Standard Time To: Kevin Burton Cc: Ganelin, Ilya; user@spark.apache.org Subject: Re: quickly counting the number of rows in a partition? Hi, On Mon, Jan 12, 2015 at 8:09 PM, Ganelin, Ilya ilya.gane...@capitalone.commailto:ilya.gane

Re: configuring spark.yarn.driver.memoryOverhead on Spark 1.2.0

2015-01-12 Thread Ganelin, Ilya
There are two related options: To solve your problem directly try: val conf = new SparkConf().set(spark.yarn.driver.memoryOverhead, 1024) val sc = new SparkContext(conf) And the second, which increases the overall memory available on the driver, as part of your spark-submit script add:

RE: MatrixFactorizationModel serialization

2015-01-07 Thread Ganelin, Ilya
Try loading features as Val userfeatures = sc.objectFile(path1) Val productFeatures = sc.objectFile(path2) And then call the constructor of the MatrixFsgtorizationModel with those. Sent with Good (www.good.com) -Original Message- From: wanbo [gewa...@163.commailto:gewa...@163.com]

HDFS_DELEGATION_TOKEN errors after switching Spark Contexts

2015-01-06 Thread Ganelin, Ilya
Hi all. In order to get Spark to properly release memory during batch processing as a workaround to issue https://issues.apache.org/jira/browse/SPARK-4927 I tear down and re-initialize the spark context with : context.stop() and context = new SparkContext() The problem I run into is that

Re: Long-running job cleanup

2014-12-31 Thread Ganelin, Ilya
426.7 MB Free 402.5 MB Free 402.5 MB Free 426.7 MB From: Ganelin, Ganelin, Ilya ilya.gane...@capitalone.commailto:ilya.gane...@capitalone.com Date: Tuesday, December 30, 2014 at 7:30 PM To: Ilya Ganelin ilgan...@gmail.commailto:ilgan...@gmail.com, Patrick Wendell pwend...@gmail.commailto:pwend

Re: Long-running job cleanup

2014-12-30 Thread Ganelin, Ilya
: Ilya Ganelin ilgan...@gmail.commailto:ilgan...@gmail.com Date: Sunday, December 28, 2014 at 4:02 PM To: Patrick Wendell pwend...@gmail.commailto:pwend...@gmail.com, Ganelin, Ilya ilya.gane...@capitalone.commailto:ilya.gane...@capitalone.com Cc: user@spark.apache.orgmailto:user@spark.apache.org

Long-running job cleanup

2014-12-22 Thread Ganelin, Ilya
Hi all, I have a long running job iterating over a huge dataset. Parts of this operation are cached. Since the job runs for so long, eventually the overhead of spark shuffles starts to accumulate culminating in the driver starting to swap. I am aware of the spark.cleanup.tll parameter that

Understanding disk usage with Accumulators

2014-12-16 Thread Ganelin, Ilya
Hi all – I’m running a long running batch-processing job with Spark through Yarn. I am doing the following Batch Process val resultsArr = sc.accumulableCollection(mutable.ArrayBuffer[ListenableFuture[Result]]()) InMemoryArray.forEach{ 1) Using a thread pool, generate callable jobs that

Re: Understanding disk usage with Accumulators

2014-12-16 Thread Ganelin, Ilya
Also, this may be related to this issue https://issues.apache.org/jira/browse/SPARK-3885. Further, to clarify, data is being written to Hadoop on the data nodes. Would really appreciate any help. Thanks! From: Ganelin, Ganelin, Ilya ilya.gane...@capitalone.commailto:ilya.gane

Re: MLLib in Production

2014-12-10 Thread Ganelin, Ilya
Hi all – I’ve been storing the model userFeatures and productFeatures vectors that are generated internally serialized on disk and importing them as a separate job. From: Sonal Goyal sonalgoy...@gmail.commailto:sonalgoy...@gmail.com Date: Wednesday, December 10, 2014 at 5:31 AM To: Yanbo Liang

RE: Spark executor lost

2014-12-03 Thread Ganelin, Ilya
You want to look further up the stack (there are almost certainly other errors before this happens) and those other errors may give your better idea of what is going on. Also if you are running on yarn you can run yarn logs -applicationId yourAppId to get the logs from the data nodes. Sent

SaveAsTextFile brings down data nodes with IO Exceptions

2014-12-02 Thread Ganelin, Ilya
Hi all, as the last stage of execution, I am writing out a dataset to disk. Before I do this, I force the DAG to resolve so this is the only job left in the pipeline. The dataset in question is not especially large (a few gigabytes). During this step however, HDFS will inevitable crash. I will

Re: ALS failure with size Integer.MAX_VALUE

2014-11-29 Thread Ganelin, Ilya
Hi Bharath – I’m unsure if this is your problem but the MatrixFactorizationModel in MLLIB which is the underlying component for ALS expects your User/Product fields to be integers. Specifically, the input to ALS is an RDD[Rating] and Rating is an (Int, Int, Double). I am wondering if perhaps

Cancelled Key Exceptions on Massive Join

2014-11-14 Thread Ganelin, Ilya
Hello all. I have been running a Spark Job that eventually needs to do a large join. 24 million x 150 million A broadcast join is infeasible in this instance clearly, so I am instead attempting to do it with Hash Partitioning by defining a custom partitioner as: class

RE: Spark- How can I run MapReduce only on one partition in an RDD?

2014-11-13 Thread Ganelin, Ilya
Why do you only want the third partition? You can access individual partitions using the partitions() function. You can also filter your data using the filter() function to only contain the data you care about. Moreover, when you create your RDDs unless you define a custom partitioner you have

RE: Fwd: Why is Spark not using all cores on a single machine?

2014-11-07 Thread Ganelin, Ilya
To set the number of spark cores used you must set two parameters in the actual spark-submit script. You must set num-executors (the number of nodes to have) and executor-cores (the number of cores per machinel) . Please see the Spark configuration and tuning pages for more details.

Re: Redploying a spark streaming application

2014-11-06 Thread Ganelin, Ilya
You’ve basically got it. Deployment step can be simply scp-ing the file to a known location on the server and then executing a run script on the server that actually runs the spark-submit. From: Ashic Mahtab as...@live.commailto:as...@live.com Date: Thursday, November 6, 2014 at 5:01 PM To:

Re: Key-Value decomposition

2014-11-03 Thread Ganelin, Ilya
Very straightforward: You want to use cartesian. If you have two RDDs - RDD_1(³A²) and RDD_2(1,2,3) RDD_1.cartesian(RDD_2) will generate the cross product between the two RDDs and you will have RDD_3((³A²,1), (³B²,2), (³C², 3)) On 11/3/14, 11:38 AM, david david...@free.fr wrote: Hi, I'm a

RE: Repartitioning by partition size, not by number of partitions.

2014-10-31 Thread Ganelin, Ilya
Hi Jan. I've actually written a function recently to do precisely that using the RDD.randomSplit function. You just need to calculate how big each element of your data is, then how many of each data can fit in each RDD to populate the input to rqndomSplit. Unfortunately, in my case I wind up

RE: FileNotFoundException in appcache shuffle files

2014-10-29 Thread Ganelin, Ilya
Hi Ryan - I've been fighting the exact same issue for well over a month now. I initially saw the issue in 1.02 but it persists in 1.1. Jerry - I believe you are correct that this happens during a pause on long-running jobs on a large data set. Are there any parameters that you suggest tuning

GC Issues with randomSplit on large dataset

2014-10-29 Thread Ganelin, Ilya
Hey all – not writing to necessarily get a fix but more to get an understanding of what’s going on internally here. I wish to take a cross-product of two very large RDDs (using cartesian), the product of which is well in excess of what can be stored on disk . Clearly that is intractable, thus

Building Spark against Cloudera 5.2.0 - Failure

2014-10-28 Thread Ganelin, Ilya
Hello all, I am attempting to manually build the master branch of Spark against Cloudera’s 5.2.0 deployment. To do this I am running: mvn -Pyarn -Dhadoop.version=2.5.0-cdh5.2.0 -DskipTests clean package The build completes successfully and then I run: mvn -Pyarn -Phadoop.version=2.5.0-cdh5.2.0

Re: Spark-submt job Killed

2014-10-28 Thread Ganelin, Ilya
Hi Ami - I suspect that your code is completing because you have nothing to actually force resolution of your job. Spark executes lazily, so for example, if you have a bunch of maps in sequence but nothing else, Spark will not actually execute anything. Try adding an RDD.count() on the last RDD

RE: Is it possible to call a transform + action inside an action?

2014-10-28 Thread Ganelin, Ilya
You cannot have nested RDD transformations in Scala Spark. The issue is that when the outer operation is distributed to the cluster and kicks off a new job (the inner query) the inner job no longer has the context for the outer job. The way around this is to either do a join on two RDDs or to

RE: com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 13994

2014-10-28 Thread Ganelin, Ilya
Have you checked for any global variables in your scope? Remember that even if variables are not passed to the function they will be included as part of the context passed to the nodes. If you can't zen out what is breaking then try to simplify what you're doing. Set up a simple test call (like

RE: Use RDD like a Iterator

2014-10-28 Thread Ganelin, Ilya
Would Rdd.map() do what you need? It will apply a function to every element of the rdd and return a resulting RDD. -Original Message- From: Zhan Zhang [zzh...@hortonworks.commailto:zzh...@hortonworks.com] Sent: Tuesday, October 28, 2014 11:23 PM Eastern Standard Time To: Dai, Kevin Cc:

Re: MLLib ALS ArrayIndexOutOfBoundsException with Scala Spark 1.1.0

2014-10-27 Thread Ganelin, Ilya
Hi Xiangrui - I can certainly save the data before ALS - that would be a great first step. Why would reducing the number of partitions help? I would very much like to understand what¹s happening internally. Also, with regards to Burak¹s earlier comment, here is the JIRA referencing this problem.