date:20141005

Re: Spark on Mesos 0.20

2014-10-05 Thread Andrew Ash

Hi Gurvinder,

Is there a SPARK ticket tracking the issue you describe?

On Mon, Oct 6, 2014 at 2:44 AM, Gurvinder Singh 
wrote:

> On 10/06/2014 08:19 AM, Fairiz Azizi wrote:
> > The Spark online docs indicate that Spark is compatible with Mesos 0.18.1
> >
> > I've gotten it to work just fine on 0.18.1 and 0.18.2
> >
> > Has anyone tried Spark on a newer version of Mesos, i.e. Mesos v0.20.0?
> >
> > -Fi
> >
> Yeah we are using Spark 1.1.0 with Mesos 0.20.1. It runs fine in coarse
> mode, in fine grain mode there is an issue with blockmanager names
> conflict. I have been waiting for it to be fixed but it is still there.
>
> -Gurvinder
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Re: Spark on Mesos 0.20

2014-10-05 Thread Gurvinder Singh

On 10/06/2014 08:19 AM, Fairiz Azizi wrote:
> The Spark online docs indicate that Spark is compatible with Mesos 0.18.1
> 
> I've gotten it to work just fine on 0.18.1 and 0.18.2
> 
> Has anyone tried Spark on a newer version of Mesos, i.e. Mesos v0.20.0?
> 
> -Fi
> 
Yeah we are using Spark 1.1.0 with Mesos 0.20.1. It runs fine in coarse
mode, in fine grain mode there is an issue with blockmanager names
conflict. I have been waiting for it to be fixed but it is still there.

-Gurvinder

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Spark on Mesos 0.20

2014-10-05 Thread Fairiz Azizi

The Spark online docs indicate that Spark is compatible with Mesos 0.18.1

I've gotten it to work just fine on 0.18.1 and 0.18.2

Has anyone tried Spark on a newer version of Mesos, i.e. Mesos v0.20.0?

-Fi

Too big data Spark SQL on Hive table on version 1.0.2 has some strange output

2014-10-05 Thread Trident

Dear Developers,

I'm limited in using Spark 1.0.2 currently.

I use Spark SQL on Hive table to load amplab benchmark, which is  25.6GiB 
approximately.

I run:
CREATE EXTERNAL TABLE uservisits (sourceIP STRING,destURL STRING, visitDate 
STRING,adRevenue DOUBLE,userAgent STRING,countryCode STRING, languageCode 
STRING,searchWord STRING,duration INT ) ROW FORMAT DELIMITED FIELDS TERMINATED 
BY "\001" STORED AS SEQUENCEFILE LOCATION "/public//data/uservisits"‍

okay!

I run:
SELECT COUNT(*) from uservisits‍

okay! the result is correct

but when I run:
SELECT SUBSTR(sourceIP, 1, 8), SUM(adRevenue) FROM uservisits GROUP BY 
SUBSTR(sourceIP, 1, 8)‍

There are some error messages (i stronger and underline some important message)

mainly two problems:

akka => Timed out
GC => Out of memory

what should I do?

...
14/10/05 23:45:18 INFO MemoryStore: Block broadcast_2 of size 158188 dropped 
from memory (free 308752285)
14/10/05 23:45:40 ERROR BlockManagerMaster: Failed to remove shuffle 4
akka.pattern.AskTimeoutException: Timed out
at 
akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:334)
at akka.actor.Scheduler$$anon$11.run(Scheduler.scala:118)
at 
scala.concurrent.Future$InternalCallbackExecutor$.scala$concurrent$Future$InternalCallbackExecutor$$unbatchedExecute(Future.scala:694)
at 
scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:691)
at 
akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:455)
at 
akka.actor.LightArrayRevolverScheduler$$anon$12.executeBucket$1(Scheduler.scala:407)
at 
akka.actor.LightArrayRevolverScheduler$$anon$12.nextTick(Scheduler.scala:411)
at akka.actor.LightArrayRevolverScheduler$$anon$12.run(Scheduler.scala:363)
at java.lang.Thread.run(Thread.java:745)
14/10/05 23:45:47 ERROR BlockManagerMaster: Failed to remove shuffle 0
akka.pattern.AskTimeoutException: Timed out
at 
akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:334)
at akka.actor.Scheduler$$anon$11.run(Scheduler.scala:118)
at 
scala.concurrent.Future$InternalCallbackExecutor$.scala$concurrent$Future$InternalCallbackExecutor$$unbatchedExecute(Future.scala:694)
at 
scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:691)
at 
akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:455)
at 
akka.actor.LightArrayRevolverScheduler$$anon$12.executeBucket$1(Scheduler.scala:407)
at 
akka.actor.LightArrayRevolverScheduler$$anon$12.nextTick(Scheduler.scala:411)
at akka.actor.LightArrayRevolverScheduler$$anon$12.run(Scheduler.scala:363)
at java.lang.Thread.run(Thread.java:745)
14/10/05 23:45:46 ERROR BlockManagerMaster: Failed to remove shuffle 2
akka.pattern.AskTimeoutException: Timed out
at 
akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:334)
at akka.actor.Scheduler$$anon$11.run(Scheduler.scala:118)
at 
scala.concurrent.Future$InternalCallbackExecutor$.scala$concurrent$Future$InternalCallbackExecutor$$unbatchedExecute(Future.scala:694)
at 
scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:691)
at 
akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:455)
at 
akka.actor.LightArrayRevolverScheduler$$anon$12.executeBucket$1(Scheduler.scala:407)
at 
akka.actor.LightArrayRevolverScheduler$$anon$12.nextTick(Scheduler.scala:411)
at akka.actor.LightArrayRevolverScheduler$$anon$12.run(Scheduler.scala:363)
at java.lang.Thread.run(Thread.java:745)
14/10/05 23:45:45 ERROR BlockManagerMaster: Failed to remove shuffle 1
akka.pattern.AskTimeoutException: Timed out
at 
akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:334)
at akka.actor.Scheduler$$anon$11.run(Scheduler.scala:118)
at 
scala.concurrent.Future$InternalCallbackExecutor$.scala$concurrent$Future$InternalCallbackExecutor$$unbatchedExecute(Future.scala:694)
at 
scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:691)
at 
akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:455)
at 
akka.actor.LightArrayRevolverScheduler$$anon$12.executeBucket$1(Scheduler.scala:407)
at 
akka.actor.LightArrayRevolverScheduler$$anon$12.nextTick(Scheduler.scala:411)
at akka.actor.LightArrayRevolverScheduler$$anon$12.run(Scheduler.scala:363)
at java.lang.Thread.run(Thread.java:745)
14/10/05 23:45:40 ERROR BlockManagerMaster: Failed to remove shuffle 3
akka.pattern.AskTimeoutException: Timed out
at 
akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:334)
at akka.actor.Scheduler$$anon$11.run(Scheduler.scala:118)
at 
scala.concurrent.Future$InternalCallbackExecutor$.scala$concurrent$Future$InternalCallbackExecutor$$unbatchedExecute(Future.scala:694)
at 
scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:691)
at 
akka.actor.LightArrayRevolverScheduler$TaskHol

Re: SPARK-3660 : Initial RDD for updateStateByKey transformation

2014-10-05 Thread Soumitra Kumar

Hello,

I have submitted a pull request (Adding support of initial value for state 
update. #2665), please review and let me know.

Excited to submit my first pull request.

-Soumitra.

- Original Message -
From: "Soumitra Kumar" 
To: dev@spark.apache.org
Sent: Tuesday, September 23, 2014 1:28:21 PM
Subject: SPARK-3660 : Initial RDD for updateStateByKey transformation

Hello fellow developers,

Thanks TD for relevant pointers.

I have created an issue :
https://issues.apache.org/jira/browse/SPARK-3660

Copying the description from JIRA:
"
How to initialize state tranformation updateStateByKey?

I have word counts from previous spark-submit run, and want to load that in 
next spark-submit job to start counting over that.

One proposal is to add following argument to updateStateByKey methods.
initial : Option [RDD [(K, S)]] = None

This will maintain the backward compatibility as well.

I have a working code as well.

This thread started on spark-user list at:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-initialize-updateStateByKey-operation-td14772.html
"

Please let me know if I shall add a parameter "initial : Option [RDD [(K, S)]] 
= None" to all updateStateByKey methods or create new ones?

Thanks,
-Soumitra.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Hyper Parameter Tuning Algorithms

2014-10-05 Thread Lochana Menikarachchi


Found this thread from April..

http://mail-archives.apache.org/mod_mbox/spark-user/201404.mbox/%3ccabjxkq6b7sfaxie4+aqtcmd8jsqbznsxsfw6v5o0wwwouob...@mail.gmail.com%3E

Wondering what the status of this.. We are thinking about implementing 
these algorithms.. Would be a waste if they are already available?


Please advice.

Thanks.

Lochana

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Parquet schema migrations

2014-10-05 Thread Michael Armbrust

Hi Cody,

Assuming you are talking about 'safe' changes to the schema (i.e. existing
column names are never reused with incompatible types), this is something
I'd love to support.  Perhaps you can describe more what sorts of changes
you are making, and if simple merging of the schemas would be sufficient.
If so, we can open a JIRA, though I'm not sure when we'll have resources to
dedicate to this.

In the near term, I'd suggest writing converters for each version of the
schema, that translate to some desired master schema.  You can then union
all of these together and avoid the cost of batch conversion.  It seems
like in most cases this should be pretty efficient, at least now that we
have good pushdown past union operators :)

Michael

On Sun, Oct 5, 2014 at 3:58 PM, Andrew Ash  wrote:

> Hi Cody,
>
> I wasn't aware there were different versions of the parquet format.  What's
> the difference between "raw parquet" and the Hive-written parquet files?
>
> As for your migration question, the approaches I've often seen are
> convert-on-read and convert-all-at-once.  Apache Cassandra for example does
> both -- when upgrading between Cassandra versions that change the on-disk
> sstable format, it will do a convert-on-read as you access the sstables, or
> you can run the upgradesstables command to convert them all at once
> post-upgrade.
>
> Andrew
>
> On Fri, Oct 3, 2014 at 4:33 PM, Cody Koeninger  wrote:
>
> > Wondering if anyone has thoughts on a path forward for parquet schema
> > migrations, especially for people (like us) that are using raw parquet
> > files rather than Hive.
> >
> > So far we've gotten away with reading old files, converting, and writing
> to
> > new directories, but that obviously becomes problematic above a certain
> > data size.
> >
>

Re: Jython importing pyspark?

2014-10-05 Thread Matei Zaharia

PySpark doesn't attempt to support Jython at present. IMO while it might be a 
bit faster, it would lose a lot of the benefits of Python, which are the very 
strong data processing libraries (NumPy, SciPy, Pandas, etc). So I'm not sure 
it's worth supporting unless someone demonstrates a really major performance 
benefit.

There was actually a recent patch to add PyPy support 
(https://github.com/apache/spark/pull/2144), which is worth a try if you want 
Python applications to run faster. It might actually be faster overall than 
Jython.

Matei

On Oct 5, 2014, at 10:16 AM, Robert C Senkbeil  wrote:

> 
> 
> Hi there,
> 
> I wanted to ask whether or not anyone has successfully used Jython with the
> pyspark library. I wasn't sure if the C extension support was needed for
> pyspark itself or was just a bonus of using Cython.
> 
> There was a claim (
> http://apache-spark-developers-list.1001551.n3.nabble.com/PySpark-Driver-from-Jython-td7142.html#a7269
> ) that using Jython would be better - if you didn't need C extension
> support - because the cost of serialization is lower. However, I have not
> been able to import pyspark into a Jython session. I'm using version 2.7b3
> of Jython and version 1.1.0 of Spark for reference.
> 
> Jython 2.7b3 (default:e81256215fb0, Aug 4 2014, 02:39:51)
> [Java HotSpot(TM) 64-Bit Server VM (Oracle Corporation)] on java1.7.0_51
> Type "help", "copyright", "credits" or "license" for more information.
 from pyspark import SparkContext, SparkConf
> Traceback (most recent call last):
>  File "", line 1, in 
>  File "pyspark/__init__.py", line 63, in 
>  File "pyspark/context.py", line 25, in 
>  File "pyspark/accumulators.py", line 94, in 
>  File "pyspark/serializers.py", line 341, in 
>  File "pyspark/serializers.py", line 328, in _hijack_namedtuple
> RuntimeError: maximum recursion depth exceeded (Java StackOverflowError)
> 
> Is there something I am missing with this? Did Jython ever work for
> pyspark? The same error happens regardless of whether I use the Python
> files or compile them down to Java class files using Jython first.
> 
> I know that previous documentation (0.9.1) indicated, "PySpark requires
> Python 2.6 or higher. PySpark applications are executed using a standard
> CPython interpreter in order to support Python modules that use C
> extensions. We have not tested PySpark with Python 3 or with alternative
> Python interpreters, such as PyPy or Jython."
> 
> In later versions, it now reflects, "Spark 1.1.0 works with Python 2.6 or
> higher (but not Python 3). It uses the standard CPython interpreter, so C
> libraries like NumPy can be used."
> 
> I'm assuming this means that attempts to use other interpreters failed. If
> so, are there any plans to support something like Jython in the future?
> 
> Signed,
> Chip Senkbeil


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Parquet schema migrations

2014-10-05 Thread Andrew Ash

Hi Cody,

I wasn't aware there were different versions of the parquet format.  What's
the difference between "raw parquet" and the Hive-written parquet files?

As for your migration question, the approaches I've often seen are
convert-on-read and convert-all-at-once.  Apache Cassandra for example does
both -- when upgrading between Cassandra versions that change the on-disk
sstable format, it will do a convert-on-read as you access the sstables, or
you can run the upgradesstables command to convert them all at once
post-upgrade.

Andrew

On Fri, Oct 3, 2014 at 4:33 PM, Cody Koeninger  wrote:

> Wondering if anyone has thoughts on a path forward for parquet schema
> migrations, especially for people (like us) that are using raw parquet
> files rather than Hive.
>
> So far we've gotten away with reading old files, converting, and writing to
> new directories, but that obviously becomes problematic above a certain
> data size.
>

Re: Impact of input format on timing

2014-10-05 Thread Matei Zaharia

Hi Tom,

HDFS and Spark don't actually have a minimum block size -- so in that first 
dataset, the files won't each be costing you 64 MB. However, the main reason 
for difference in performance here is probably the number of RDD partitions. In 
the first case, Spark will create an RDD with 1 partitions, one per file, 
while in the second case it will likely have only 1-2 of them. The number of 
partitions affects the level of parallelism of operations like reduceByKey (by 
default, reduceByKey uses as many partitions as the parent RDD it runs on), and 
in this case, I think it's causing reduceByKey to spill to disk within tasks in 
the second case and not in the first case, because each task has more input 
data.

You can see the number of partitions in each stage of your computation on the 
application web UI at http://:4040 if you want to confirm this. Also, 
for both programs, you can manually set the number of partitions for the reduce 
by passing a second argument to reduceByKey (e.g. reduceByKey(myFunc, 100)). In 
general it's best to choose this so that the input data for each reduce task 
fits in memory to avoid spilling.

Matei

On Oct 5, 2014, at 1:58 PM, Tom Hubregtsen  wrote:

> Hi,
> 
> I ran the same version of a program with two different types of input
> containing equivalent information. 
> Program 1: 10,000 files with on average 50 IDs, one every line
> Program 2: 1 file containing 10,000 lines. On average 50 IDs per line
> 
> My program takes the input, creates key/value pairs of them, and performs
> about 7 more steps. I have compared the sets of key/value pairs after the
> initialization phase, and they are equivalent. All other steps are equal.
> The only difference is using wholeTextFile versus textFile and the
> initialization input map function itself.  
> 
> Since I am reading in from HDFS, and the minimum part size there is 64MB,
> every file in program 1 will take 64MB, even though they are KBs original.
> Because of this, I expected program 2 to be faster, or equivalent as the
> initialization phase may be neglectable. 
> 
> In my first comparison, I provided the program with sufficient memory, and
> they both take around 2 minutes. No surprises here.
> 
> In my second comparison, I limit the memory to in this case 4 GB. Program 1
> executes in a little over 4 minutes, but program 2 takes over 15 minutes
> (after which I terminated the program, as I see it is getting there, but
> spilling massively in every stage). The difference between the two runs is
> the amount of spilling in phases later on in the program (*not* in the
> initialization phase). Program 1 spills 2 chunks per stage:
> 14/10/05 14:52:30 INFO ExternalAppendOnlyMap: Thread 241 spilling in-memory
> map of 327 MB to disk (1 time so far)
> 14/10/05 14:52:30 INFO ExternalAppendOnlyMap: Thread 242 spilling in-memory
> map of 328 MB to disk (1 time so far)
> Program 2 also spills these 2 chunks:
> 14/10/05 14:44:35 INFO ExternalAppendOnlyMap: Thread 240 spilling in-memory
> map of 320 MB to disk (1 time so far)
> 14/10/05 14:44:35 INFO ExternalAppendOnlyMap: Thread 241 spilling in-memory
> map of 323 MB to disk (1 time so far)
> But then spills 2 * ~15,000 chuncks of <1MB:
> 14/10/05 14:44:41 INFO ExternalAppendOnlyMap: Thread 240 spilling in-memory
> map of 0 MB to disk (15561 time so far)
> 14/10/05 14:44:41 INFO ExternalAppendOnlyMap: Thread 240 spilling in-memory
> map of 0 MB to disk (13866 times so far)
> 
> I understand that RDDs with narrow dependencies to their parents are
> pipelined, and form a stage. These all point to the parent RDD. This parent
> RDD will have different partitioning and different data placement between
> the two programs. Because of this, I did expect a difference in this stage,
> but as we change from JavaRDD to JavaPairRDD and do a reduceByKey after the
> initialization, I expected this difference to be gone in the next stages.
> Can anyone explain this behavior, or point me in a direction?
> 
> Thanks in advance!
> 
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Impact-of-input-format-on-timing-tp8655.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
> 


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Impact of input format on timing

2014-10-05 Thread Tom Hubregtsen

Hi,

I ran the same version of a program with two different types of input
containing equivalent information.
Program 1: 10,000 files with on average 50 IDs, one every line
Program 2: 1 file containing 10,000 lines. On average 50 IDs per line

My program takes the input, creates key/value pairs of them, and performs
about 7 more steps. I have compared the sets of key/value pairs after the
initialization phase, and they are equivalent. All other steps are equal.
The only difference is using wholeTextFile versus textFile and the
initialization input map function itself.

Since I am reading in from HDFS, and the minimum part size there is 64MB,
every file in program 1 will take 64MB, even though they are KBs original.
Because of this, I expected program 2 to be faster, or equivalent as the
initialization phase may be neglectable.

In my first comparison, I provided the program with sufficient memory, and
they both take around 2 minutes. No surprises here.

In my second comparison, I limit the memory to in this case 4 GB. Program 1
executes in a little over 4 minutes, but program 2 takes over 15 minutes
(after which I terminated the program, as I see it is getting there, but
spilling massively in every stage). The difference between the two runs is
the amount of spilling in phases later on in the program (*not* in the
initialization phase). Program 1 spills 2 chunks per stage:
14/10/05 14:52:30 INFO ExternalAppendOnlyMap: Thread 241 spilling in-memory
map of 327 MB to disk (1 time so far)
14/10/05 14:52:30 INFO ExternalAppendOnlyMap: Thread 242 spilling in-memory
map of 328 MB to disk (1 time so far)
Program 2 also spills these 2 chunks:
14/10/05 14:44:35 INFO ExternalAppendOnlyMap: Thread 240 spilling in-memory
map of 320 MB to disk (1 time so far)
14/10/05 14:44:35 INFO ExternalAppendOnlyMap: Thread 241 spilling in-memory
map of 323 MB to disk (1 time so far)
But then spills 2 * ~15,000 chuncks of <1MB:
14/10/05 14:44:41 INFO ExternalAppendOnlyMap: Thread 240 spilling in-memory
map of 0 MB to disk (15561 time so far)
14/10/05 14:44:41 INFO ExternalAppendOnlyMap: Thread 240 spilling in-memory
map of 0 MB to disk (13866 times so far)

I understand that RDDs with narrow dependencies to their parents are
pipelined, and form a stage. These all point to the parent RDD. This parent
RDD will have different partitioning and different data placement between
the two programs. Because of this, I did expect a difference in this stage,
but as we change from JavaRDD to JavaPairRDD and do a reduceByKey after the
initialization, I expected this difference to be gone in the next stages.
Can anyone explain this behavior, or point me in a direction?

Thanks in advance!

--
View this message in context:
http://apache-spark-developers-list.1001551.n3.nabble.com/Impact-of-input-format-on-timing-tp8655.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Jython importing pyspark?

2014-10-05 Thread Robert C Senkbeil



Hi there,

I wanted to ask whether or not anyone has successfully used Jython with the
pyspark library. I wasn't sure if the C extension support was needed for
pyspark itself or was just a bonus of using Cython.

There was a claim (
http://apache-spark-developers-list.1001551.n3.nabble.com/PySpark-Driver-from-Jython-td7142.html#a7269
) that using Jython would be better - if you didn't need C extension
support - because the cost of serialization is lower. However, I have not
been able to import pyspark into a Jython session. I'm using version 2.7b3
of Jython and version 1.1.0 of Spark for reference.

Jython 2.7b3 (default:e81256215fb0, Aug 4 2014, 02:39:51)
[Java HotSpot(TM) 64-Bit Server VM (Oracle Corporation)] on java1.7.0_51
Type "help", "copyright", "credits" or "license" for more information.
>>> from pyspark import SparkContext, SparkConf
Traceback (most recent call last):
  File "", line 1, in 
  File "pyspark/__init__.py", line 63, in 
  File "pyspark/context.py", line 25, in 
  File "pyspark/accumulators.py", line 94, in 
  File "pyspark/serializers.py", line 341, in 
  File "pyspark/serializers.py", line 328, in _hijack_namedtuple
RuntimeError: maximum recursion depth exceeded (Java StackOverflowError)

Is there something I am missing with this? Did Jython ever work for
pyspark? The same error happens regardless of whether I use the Python
files or compile them down to Java class files using Jython first.

I know that previous documentation (0.9.1) indicated, "PySpark requires
Python 2.6 or higher. PySpark applications are executed using a standard
CPython interpreter in order to support Python modules that use C
extensions. We have not tested PySpark with Python 3 or with alternative
Python interpreters, such as PyPy or Jython."

In later versions, it now reflects, "Spark 1.1.0 works with Python 2.6 or
higher (but not Python 3). It uses the standard CPython interpreter, so C
libraries like NumPy can be used."

I'm assuming this means that attempts to use other interpreters failed. If
so, are there any plans to support something like Jython in the future?

Signed,
Chip Senkbeil

Re: Spark on Mesos 0.20

Re: Spark on Mesos 0.20

Spark on Mesos 0.20

Too big data Spark SQL on Hive table on version 1.0.2 has some strange output

Re: SPARK-3660 : Initial RDD for updateStateByKey transformation

Hyper Parameter Tuning Algorithms

Re: Parquet schema migrations

Re: Jython importing pyspark?

Re: Parquet schema migrations

Re: Impact of input format on timing

Impact of input format on timing

Jython importing pyspark?

12 matches

Site Navigation

Mail list logo

Footer information