Hi Yifan,
I think this is a result of Kryo trying to seriallize something too large.
Have you tried to increase your partitioning?
Cheers,
Jem
On Fri, Oct 23, 2015 at 11:24 AM Yifan LI wrote:
> Hi,
>
> I have a big sorted RDD sRdd(~962million elements), and need to scan
> On Tue, Sep 1, 2015 at 10:42 PM, Davies Liu <dav...@databricks.com> wrote:
>
>> You can take the sortByKey as example:
>> https://github.com/apache/spark/blob/master/python/pyspark/rdd.py#L642
>>
>> On Tue, Sep 1, 2015 at 3:48 AM, Jem Tucker <jem.tuc...@gmail.
Hi,
You just need to extend Partitioner and override the numPartitions and
getPartition methods, see below
class MyPartitioner extends partitioner {
def numPartitions: Int = // Return the number of partitions
def getPartition(key Any): Int = // Return the partition for a given key
}
On Tue,
tom partitioner like
> range partitioner.
>
> On Tue, Sep 1, 2015 at 3:22 PM, Jem Tucker <jem.tuc...@gmail.com> wrote:
>
>> Hi,
>>
>> You just need to extend Partitioner and override the numPartitions and
>> getPartition methods, see below
>>
>>
com> wrote:
> Hi
>
> I think range partitioner is not available in pyspark, so if we want
> create one. how should we create that. my question is that.
>
> On Tue, Sep 1, 2015 at 3:57 PM, Jem Tucker <jem.tuc...@gmail.com> wrote:
>
>> Ah sorry I miss read your ques
} else{
iter.hasNext
}
}
override def next():Int = iter.next()
}
}
}.collect().foreach(println)
On Fri, Aug 28, 2015 at 12:33 PM, Jem Tucker jem.tuc...@gmail.com wrote:
Hi,
I am trying to create an RDD from a selected number of its parents
partitions. My
Hi Samya,
When submitting an application with spark-submit the cores per executor can
be set with --executor-cores, meaning you can run that many tasks per
executor concurrently. The page below has some more details on submitting
applications:
,
Sam
*From:* Jem Tucker [mailto:jem.tuc...@gmail.com]
*Sent:* Wednesday, August 26, 2015 2:26 PM
*To:* Samya MAITI samya.ma...@amadeus.com; user@spark.apache.org
*Subject:* Re: Relation between threads and executor core
Hi Samya,
When submitting an application with spark-submit
is getting run since another user's max
vcore limit is not reached.
On Sat, Aug 8, 2015 at 10:07 PM, Jem Tucker jem.tuc...@gmail.com wrote:
Hi dustin,
Yes there are enough resources available, the same application run with a
different user works fine so I think it is something to do with permissions
at 1:48 AM, Jem Tucker jem.tuc...@gmail.com wrote:
Hi,
I am running spark on YARN on the CDH5.3.2 stack. I have created a new
user to own and run a testing environment, however when using this user
applications I submit to yarn never begin to run, even if they are the
exact same application
of the RM web UI, do you see any available resources to spawn
the application master container?
On Sat, Aug 8, 2015 at 4:37 AM, Jem Tucker jem.tuc...@gmail.com wrote:
Hi Sandy,
The application doesn't fail, it gets accepted by yarn but the
application master never starts and the application
Hi,
I am running spark on YARN on the CDH5.3.2 stack. I have created a new user
to own and run a testing environment, however when using this user
applications I submit to yarn never begin to run, even if they are the
exact same application that is successful with another user?
Has anyone seen
Hi,
I have been running a batch of data through my application for the last
couple of days and this morning discovered it had fallen over with the
following error.
java.lang.IllegalStateException: unread block data
at
Hello,
I have been using IndexedRDD as a large lookup (1 billion records) to join
with small tables (1 million rows). The performance of indexedrdd is great
until it has to be persisted on disk. Are there any alternatives to
IndexedRDD or any changes to how I use it to improve performance with
to install it separately.
On Thu, Jul 16, 2015 at 2:29 PM Jem Tucker jem.tuc...@gmail.com wrote:
Hi Vetle,
IndexedRDD is persisted in the same way RDDs are as far as I am aware.
Are you aware if Cassandra can be built into my application or has to be a
stand alone database which is installed
some
time in any case.
Regards,
Vetle
On Thu, Jul 16, 2015 at 10:02 AM Jem Tucker jem.tuc...@gmail.com wrote:
Hello,
I have been using IndexedRDD as a large lookup (1 billion records) to
join with small tables (1 million rows). The performance of indexedrdd is
great until it has
With regards to Indexed structures in Spark are there any alternatives to
IndexedRDD for more generic keys including Strings?
Thanks
Jem
On Wed, Jul 15, 2015 at 7:41 AM Burak Yavuz brk...@gmail.com wrote:
Hi Swetha,
IndexedRDD is available as a package on Spark Packages
AM, Jem Tucker jem.tuc...@gmail.com wrote:
With regards to Indexed structures in Spark are there any alternatives to
IndexedRDD for more generic keys including Strings?
Thanks
Jem
Hi All,
We have recently begun performance testing our Spark application and have
found that changing the default parallelism has a much larger effect on the
performance than expected, meaning there seems to be an illusive sweet spot
that depends on the input size.
Does anyone have any idea of a
Regards
On Fri, Jul 3, 2015 at 2:32 PM, Jem Tucker jem.tuc...@gmail.com wrote:
Hi,
We have an application that requires a username/password to be entered
from the command line. To screen a password in java you need to use
System.console.readPassword however when running with spark
Hi,
We have an application that requires a username/password to be entered from
the command line. To screen a password in java you need to use
System.console.readPassword however when running with spark System.console
returns null?? Any ideas on how to get the console from spark?
Thanks,
Jem
val pass = console.readPassword(password: )
thanks,
Jem
On Fri, Jul 3, 2015 at 11:04 AM Akhil Das ak...@sigmoidanalytics.com
wrote:
Can you paste the code? Something is missing
Thanks
Best Regards
On Fri, Jul 3, 2015 at 3:14 PM, Jem Tucker jem.tuc...@gmail.com wrote:
In the driver when
, 2015 at 7:48 PM, Jem Tucker jem.tuc...@gmail.com wrote:
Hi,
The current behavior of rdd.unpersist() appears to not be lazily executed
and therefore must be placed after an action. Is there any way to emulate
lazy execution of this function so it is added to the task queue?
Thanks,
Jem
in eclipse you can just add the spark assembly jar to the build path, right
click the project build path configure build path library add
external jars
On Wed, Jul 1, 2015 at 7:15 PM Stefan Panayotov spanayo...@msn.com wrote:
Hi Ted,
How can I import the relevant Spark projects into
Hi,
The current behavior of rdd.unpersist() appears to not be lazily executed
and therefore must be placed after an action. Is there any way to emulate
lazy execution of this function so it is added to the task queue?
Thanks,
Jem
Hi all,
A small number of the files being moved into my landing directory are not
being seen by my fileStream reciever. After looking at the code it seems
that, in the case of long batches ( 1minute), if files are created before
a batch finishes, but only become visible after that batch finished
time in scaling on the big table doesn't seem that surprising to
me. What were you expecting?
I assume you're doing normalRDD.join(indexedRDD). If you were to replace
the indexedRDD with a normal RDD, what times do you get?
On Tue, Jan 13, 2015 at 5:35 AM, Jem Tucker jem.tuc...@gmail.com
Hi,
I have been playing around with the indexedRDD (
https://issues.apache.org/jira/browse/SPARK-2365,
https://github.com/amplab/spark-indexedrdd) and have been very impressed
with its performance. Some performance testing has revealed worse than
expected scaling of the join performance*, and I
28 matches
Mail list logo