Re: java.lang.NegativeArraySizeException? as iterating a big RDD

2015-10-23 Thread Jem Tucker
Hi Yifan, I think this is a result of Kryo trying to seriallize something too large. Have you tried to increase your partitioning? Cheers, Jem On Fri, Oct 23, 2015 at 11:24 AM Yifan LI wrote: > Hi, > > I have a big sorted RDD sRdd(~962million elements), and need to scan

Re: Custom Partitioner

2015-09-02 Thread Jem Tucker
> On Tue, Sep 1, 2015 at 10:42 PM, Davies Liu <dav...@databricks.com> wrote: > >> You can take the sortByKey as example: >> https://github.com/apache/spark/blob/master/python/pyspark/rdd.py#L642 >> >> On Tue, Sep 1, 2015 at 3:48 AM, Jem Tucker <jem.tuc...@gmail.

Re: Custom Partitioner

2015-09-01 Thread Jem Tucker
Hi, You just need to extend Partitioner and override the numPartitions and getPartition methods, see below class MyPartitioner extends partitioner { def numPartitions: Int = // Return the number of partitions def getPartition(key Any): Int = // Return the partition for a given key } On Tue,

Re: Custom Partitioner

2015-09-01 Thread Jem Tucker
tom partitioner like > range partitioner. > > On Tue, Sep 1, 2015 at 3:22 PM, Jem Tucker <jem.tuc...@gmail.com> wrote: > >> Hi, >> >> You just need to extend Partitioner and override the numPartitions and >> getPartition methods, see below >> >>

Re: Custom Partitioner

2015-09-01 Thread Jem Tucker
com> wrote: > Hi > > I think range partitioner is not available in pyspark, so if we want > create one. how should we create that. my question is that. > > On Tue, Sep 1, 2015 at 3:57 PM, Jem Tucker <jem.tuc...@gmail.com> wrote: > >> Ah sorry I miss read your ques

Re: RDD from partitions

2015-08-28 Thread Jem Tucker
} else{ iter.hasNext } } override def next():Int = iter.next() } } }.collect().foreach(println) On Fri, Aug 28, 2015 at 12:33 PM, Jem Tucker jem.tuc...@gmail.com wrote: Hi, I am trying to create an RDD from a selected number of its parents partitions. My

Re: Relation between threads and executor core

2015-08-26 Thread Jem Tucker
Hi Samya, When submitting an application with spark-submit the cores per executor can be set with --executor-cores, meaning you can run that many tasks per executor concurrently. The page below has some more details on submitting applications:

Re: Relation between threads and executor core

2015-08-26 Thread Jem Tucker
, Sam *From:* Jem Tucker [mailto:jem.tuc...@gmail.com] *Sent:* Wednesday, August 26, 2015 2:26 PM *To:* Samya MAITI samya.ma...@amadeus.com; user@spark.apache.org *Subject:* Re: Relation between threads and executor core Hi Samya, When submitting an application with spark-submit

Re: Spark on YARN

2015-08-10 Thread Jem Tucker
is getting run since another user's max vcore limit is not reached. On Sat, Aug 8, 2015 at 10:07 PM, Jem Tucker jem.tuc...@gmail.com wrote: Hi dustin, Yes there are enough resources available, the same application run with a different user works fine so I think it is something to do with permissions

Re: Spark on YARN

2015-08-08 Thread Jem Tucker
at 1:48 AM, Jem Tucker jem.tuc...@gmail.com wrote: Hi, I am running spark on YARN on the CDH5.3.2 stack. I have created a new user to own and run a testing environment, however when using this user applications I submit to yarn never begin to run, even if they are the exact same application

Re: Spark on YARN

2015-08-08 Thread Jem Tucker
of the RM web UI, do you see any available resources to spawn the application master container? On Sat, Aug 8, 2015 at 4:37 AM, Jem Tucker jem.tuc...@gmail.com wrote: Hi Sandy, The application doesn't fail, it gets accepted by yarn but the application master never starts and the application

Spark on YARN

2015-08-07 Thread Jem Tucker
Hi, I am running spark on YARN on the CDH5.3.2 stack. I have created a new user to own and run a testing environment, however when using this user applications I submit to yarn never begin to run, even if they are the exact same application that is successful with another user? Has anyone seen

Unread block data error

2015-07-17 Thread Jem Tucker
Hi, I have been running a batch of data through my application for the last couple of days and this morning discovered it had fallen over with the following error. java.lang.IllegalStateException: unread block data at

Indexed Store for lookup table

2015-07-16 Thread Jem Tucker
Hello, I have been using IndexedRDD as a large lookup (1 billion records) to join with small tables (1 million rows). The performance of indexedrdd is great until it has to be persisted on disk. Are there any alternatives to IndexedRDD or any changes to how I use it to improve performance with

Re: Indexed Store for lookup table

2015-07-16 Thread Jem Tucker
to install it separately. On Thu, Jul 16, 2015 at 2:29 PM Jem Tucker jem.tuc...@gmail.com wrote: Hi Vetle, IndexedRDD is persisted in the same way RDDs are as far as I am aware. Are you aware if Cassandra can be built into my application or has to be a stand alone database which is installed

Re: Indexed Store for lookup table

2015-07-16 Thread Jem Tucker
some time in any case. Regards, Vetle On Thu, Jul 16, 2015 at 10:02 AM Jem Tucker jem.tuc...@gmail.com wrote: Hello, I have been using IndexedRDD as a large lookup (1 billion records) to join with small tables (1 million rows). The performance of indexedrdd is great until it has

Re: creating a distributed index

2015-07-15 Thread Jem Tucker
With regards to Indexed structures in Spark are there any alternatives to IndexedRDD for more generic keys including Strings? Thanks Jem On Wed, Jul 15, 2015 at 7:41 AM Burak Yavuz brk...@gmail.com wrote: Hi Swetha, IndexedRDD is available as a package on Spark Packages

Re: creating a distributed index

2015-07-15 Thread Jem Tucker
AM, Jem Tucker jem.tuc...@gmail.com wrote: With regards to Indexed structures in Spark are there any alternatives to IndexedRDD for more generic keys including Strings? Thanks Jem

Spark Parallelism

2015-07-13 Thread Jem Tucker
Hi All, We have recently begun performance testing our Spark application and have found that changing the default parallelism has a much larger effect on the performance than expected, meaning there seems to be an illusive sweet spot that depends on the input size. Does anyone have any idea of a

Re: Accessing the console from spark

2015-07-03 Thread Jem Tucker
Regards On Fri, Jul 3, 2015 at 2:32 PM, Jem Tucker jem.tuc...@gmail.com wrote: Hi, We have an application that requires a username/password to be entered from the command line. To screen a password in java you need to use System.console.readPassword however when running with spark

Accessing the console from spark

2015-07-03 Thread Jem Tucker
Hi, We have an application that requires a username/password to be entered from the command line. To screen a password in java you need to use System.console.readPassword however when running with spark System.console returns null?? Any ideas on how to get the console from spark? Thanks, Jem

Re: Accessing the console from spark

2015-07-03 Thread Jem Tucker
val pass = console.readPassword(password: ) thanks, Jem On Fri, Jul 3, 2015 at 11:04 AM Akhil Das ak...@sigmoidanalytics.com wrote: Can you paste the code? Something is missing Thanks Best Regards On Fri, Jul 3, 2015 at 3:14 PM, Jem Tucker jem.tuc...@gmail.com wrote: In the driver when

Re: Making Unpersist Lazy

2015-07-02 Thread Jem Tucker
, 2015 at 7:48 PM, Jem Tucker jem.tuc...@gmail.com wrote: Hi, The current behavior of rdd.unpersist() appears to not be lazily executed and therefore must be placed after an action. Is there any way to emulate lazy execution of this function so it is added to the task queue? Thanks, Jem

Re: import errors with Eclipse Scala

2015-07-01 Thread Jem Tucker
in eclipse you can just add the spark assembly jar to the build path, right click the project build path configure build path library add external jars On Wed, Jul 1, 2015 at 7:15 PM Stefan Panayotov spanayo...@msn.com wrote: Hi Ted, How can I import the relevant Spark projects into

Making Unpersist Lazy

2015-07-01 Thread Jem Tucker
Hi, The current behavior of rdd.unpersist() appears to not be lazily executed and therefore must be placed after an action. Is there any way to emulate lazy execution of this function so it is added to the task queue? Thanks, Jem

FileInputDStream missing files

2015-01-14 Thread Jem Tucker
Hi all, A small number of the files being moved into my landing directory are not being seen by my fileStream reciever. After looking at the code it seems that, in the case of long batches ( 1minute), if files are created before a batch finishes, but only become visible after that batch finished

Re: IndexedRDD

2015-01-13 Thread Jem Tucker
time in scaling on the big table doesn't seem that surprising to me. What were you expecting? I assume you're doing normalRDD.join(indexedRDD). If you were to replace the indexedRDD with a normal RDD, what times do you get? On Tue, Jan 13, 2015 at 5:35 AM, Jem Tucker jem.tuc...@gmail.com

IndexedRDD

2015-01-13 Thread Jem Tucker
Hi, I have been playing around with the indexedRDD ( https://issues.apache.org/jira/browse/SPARK-2365, https://github.com/amplab/spark-indexedrdd) and have been very impressed with its performance. Some performance testing has revealed worse than expected scaling of the join performance*, and I