Re: Is Apache Spark-2.2.1 compatible with Hadoop-3.0.0

2018-01-08 Thread Josh Rosen
My current best guess is that Spark does *not* fully support Hadoop 3.x
because https://issues.apache.org/jira/browse/SPARK-18673 (updates to Hive
shims for Hadoop 3.x) has not been resolved. There are also likely to be
transitive dependency conflicts which will need to be resolved.

On Mon, Jan 8, 2018 at 8:52 AM akshay naidu  wrote:

> yes , spark download page does mention that 2.2.1 is for 'hadoop-2.7 and
> later', but my confusion is because spark was released on 1st dec and
> hadoop-3 stable version released on 13th Dec. And  to my similar question
> on stackoverflow.com
> 
> , Mr. jacek-laskowski
>  replied that
> spark-2.2.1 doesn't support hadoop-3. so I am just looking for more clarity
> on this doubt before moving on to upgrades.
>
> Thanks all for help.
>
> Akshay.
>
> On Mon, Jan 8, 2018 at 8:47 AM, Saisai Shao 
> wrote:
>
>> AFAIK, there's no large scale test for Hadoop 3.0 in the community. So it
>> is not clear whether it is supported or not (or has some issues). I think
>> in the download page "Pre-Built for Apache Hadoop 2.7 and later" mostly
>> means that it supports Hadoop 2.7+ (2.8...), but not 3.0 (IIUC).
>>
>> Thanks
>> Jerry
>>
>> 2018-01-08 4:50 GMT+08:00 Raj Adyanthaya :
>>
>>> Hi Akshay
>>>
>>> On the Spark Download page when you select Spark 2.2.1 it gives you an
>>> option to select package type. In that, there is an option to select
>>> "Pre-Built for Apache Hadoop 2.7 and later". I am assuming it means that it
>>> does support Hadoop 3.0.
>>>
>>> http://spark.apache.org/downloads.html
>>>
>>> Thanks,
>>> Raj A.
>>>
>>> On Sat, Jan 6, 2018 at 8:23 PM, akshay naidu 
>>> wrote:
>>>
 hello Users,
 I need to know whether we can run latest spark on  latest hadoop
 version i.e., spark-2.2.1 released on 1st dec and hadoop-3.0.0 released on
 13th dec.
 thanks.

>>>
>>>
>>
>


Re: Off Heap (Tungsten) Memory Usage / Management ?

2016-09-22 Thread Josh Rosen
Spark SQL / Tungsten's explicitly-managed off-heap memory will be capped at
spark.memory.offHeap.size bytes. This is purposely specified as an absolute
size rather than as a percentage of the heap size in order to allow end
users to tune Spark so that its overall memory consumption stays within
container memory limits.

To use your example of a 3GB YARN container, you could configure Spark so
that it's maximum heap size plus spark.memory.offHeap.size is smaller than
3GB (minus some overhead fudge-factor).

On Thu, Sep 22, 2016 at 7:56 AM Sean Owen  wrote:

> It's looking at the whole process's memory usage, and doesn't care
> whether the memory is used by the heap or not within the JVM. Of
> course, allocating memory off-heap still counts against you at the OS
> level.
>
> On Thu, Sep 22, 2016 at 3:54 PM, Michael Segel
>  wrote:
> > Thanks for the response Sean.
> >
> > But how does YARN know about the off-heap memory usage?
> > That’s the piece that I’m missing.
> >
> > Thx again,
> >
> > -Mike
> >
> >> On Sep 21, 2016, at 10:09 PM, Sean Owen  wrote:
> >>
> >> No, Xmx only controls the maximum size of on-heap allocated memory.
> >> The JVM doesn't manage/limit off-heap (how could it? it doesn't know
> >> when it can be released).
> >>
> >> The answer is that YARN will kill the process because it's using more
> >> memory than it asked for. A JVM is always going to use a little
> >> off-heap memory by itself, so setting a max heap size of 2GB means the
> >> JVM process may use a bit more than 2GB of memory. With an off-heap
> >> intensive app like Spark it can be a lot more.
> >>
> >> There's a built-in 10% overhead, so that if you ask for a 3GB executor
> >> it will ask for 3.3GB from YARN. You can increase the overhead.
> >>
> >> On Wed, Sep 21, 2016 at 11:41 PM, Jörn Franke 
> wrote:
> >>> All off-heap memory is still managed by the JVM process. If you limit
> the
> >>> memory of this process then you limit the memory. I think the memory
> of the
> >>> JVM process could be limited via the xms/xmx parameter of the JVM.
> This can
> >>> be configured via spark options for yarn (be aware that they are
> different
> >>> in cluster and client mode), but i recommend to use the spark options
> for
> >>> the off heap maximum.
> >>>
> >>> https://spark.apache.org/docs/latest/running-on-yarn.html
> >>>
> >>>
> >>> On 21 Sep 2016, at 22:02, Michael Segel 
> wrote:
> >>>
> >>> I’ve asked this question a couple of times from a friend who
> didn’t know
> >>> the answer… so I thought I would try here.
> >>>
> >>>
> >>> Suppose we launch a job on a cluster (YARN) and we have set up the
> >>> containers to be 3GB in size.
> >>>
> >>>
> >>> What does that 3GB represent?
> >>>
> >>> I mean what happens if we end up using 2-3GB of off heap storage via
> >>> tungsten?
> >>> What will Spark do?
> >>> Will it try to honor the container’s limits and throw an exception
> or will
> >>> it allow my job to grab that amount of memory and exceed YARN’s
> >>> expectations since its off heap?
> >>>
> >>> Thx
> >>>
> >>> -Mike
> >>>
> >>>
> B‹CB• È
> >>> [œÝXœØÜšX™H K[XZ[ ˆ \Ù\‹][œÝXœØÜšX™P Ü \šË˜\ XÚ K›Ü™ÃBƒ
> >
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: pyspark persist MEMORY_ONLY vs MEMORY_AND_DISK

2016-09-10 Thread Josh Rosen
Based on Ben's helpful error description, I managed to reproduce this bug
and found the root cause:

There's a bug in MemoryStore's PartiallySerializedBlock class: it doesn't
close a serialization stream before attempting to deserialize its
serialized values, causing it to miss any data stored in the serializer's
internal buffers (which can happen with KryoSerializer, which was
automatically being used to serialize RDDs of byte arrays). I've reported
this as https://issues.apache.org/jira/browse/SPARK-17491 and have submitted
 https://github.com/apache/spark/pull/15043 to fix this (I'm still planning
to add more tests to that patch).

On Fri, Sep 9, 2016 at 10:37 AM Josh Rosen <joshro...@databricks.com> wrote:

> cache() / persist() is definitely *not* supposed to affect the result of
> a program, so the behavior that you're seeing is unexpected.
>
> I'll try to reproduce this myself by caching in PySpark under heavy memory
> pressure, but in the meantime the following questions will help me to debug:
>
>- Does this only happen in Spark 2.0? Have you successfully run the
>same workload with correct behavior on an earlier Spark version, such as
>1.6.x?
>- How accurately does your example code model the structure of your
>real code? Are you calling cache()/persist() on an RDD which has been
>transformed in Python or are you calling it on an untransformed input RDD
>(such as the RDD returned from sc.textFile() / sc.hadoopFile())?
>
>
> On Fri, Sep 9, 2016 at 5:01 AM Ben Leslie <be...@benno.id.au> wrote:
>
>> Hi,
>>
>> I'm trying to understand if there is any difference in correctness
>> between rdd.persist(pyspark.StorageLevel.MEMORY_ONLY) and
>> rdd.persist(pyspark.StorageLevel.MEMORY_AND_DISK).
>>
>> I can see that there may be differences in performance, but my
>> expectation was that using either would result in the same behaviour.
>> However that is not what I'm seeing in practise.
>>
>> Specifically I have some code like:
>>
>> text_lines = sc.textFile(input_files)
>> records = text_lines.map(json.loads)
>> records.persist(pyspark.StorageLevel.MEMORY_ONLY)
>> count = records.count()
>> records.unpersist()
>>
>> When I do not use persist at all the 'count' variable contains the
>> correct value.
>> When I use persist with pyspark.StorageLevel.MEMORY_AND_DISK, I also
>> get the correct, expected value.
>> However, if I use persist with no argument (or
>> pyspark.StorageLevel.MEMORY_ONLY) then the value of 'count' is too
>> small.
>>
>> In all cases the script completes without errors (or warning as far as
>> I can tell).
>>
>> I'm using Spark 2.0.0 on an AWS EMR cluster.
>>
>> It appears that the executors may not have enough memory to store all
>> the RDD partitions in memory only, however I thought in this case it
>> would fall back to regenerating from the parent RDD, rather than
>> providing the wrong answer.
>>
>> Is this the expected behaviour? It seems a little difficult to work
>> with in practise.
>>
>> Cheers,
>>
>> Ben
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>


Re: pyspark persist MEMORY_ONLY vs MEMORY_AND_DISK

2016-09-09 Thread Josh Rosen
cache() / persist() is definitely *not* supposed to affect the result of a
program, so the behavior that you're seeing is unexpected.

I'll try to reproduce this myself by caching in PySpark under heavy memory
pressure, but in the meantime the following questions will help me to debug:

   - Does this only happen in Spark 2.0? Have you successfully run the same
   workload with correct behavior on an earlier Spark version, such as 1.6.x?
   - How accurately does your example code model the structure of your real
   code? Are you calling cache()/persist() on an RDD which has been
   transformed in Python or are you calling it on an untransformed input RDD
   (such as the RDD returned from sc.textFile() / sc.hadoopFile())?


On Fri, Sep 9, 2016 at 5:01 AM Ben Leslie  wrote:

> Hi,
>
> I'm trying to understand if there is any difference in correctness
> between rdd.persist(pyspark.StorageLevel.MEMORY_ONLY) and
> rdd.persist(pyspark.StorageLevel.MEMORY_AND_DISK).
>
> I can see that there may be differences in performance, but my
> expectation was that using either would result in the same behaviour.
> However that is not what I'm seeing in practise.
>
> Specifically I have some code like:
>
> text_lines = sc.textFile(input_files)
> records = text_lines.map(json.loads)
> records.persist(pyspark.StorageLevel.MEMORY_ONLY)
> count = records.count()
> records.unpersist()
>
> When I do not use persist at all the 'count' variable contains the
> correct value.
> When I use persist with pyspark.StorageLevel.MEMORY_AND_DISK, I also
> get the correct, expected value.
> However, if I use persist with no argument (or
> pyspark.StorageLevel.MEMORY_ONLY) then the value of 'count' is too
> small.
>
> In all cases the script completes without errors (or warning as far as
> I can tell).
>
> I'm using Spark 2.0.0 on an AWS EMR cluster.
>
> It appears that the executors may not have enough memory to store all
> the RDD partitions in memory only, however I thought in this case it
> would fall back to regenerating from the parent RDD, rather than
> providing the wrong answer.
>
> Is this the expected behaviour? It seems a little difficult to work
> with in practise.
>
> Cheers,
>
> Ben
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: A number of issues when running spark-ec2

2016-04-16 Thread Josh Rosen
Using a different machine / toolchain, I've downloaded and re-uploaded all
of the 1.6.1 artifacts to that S3 bucket, so hopefully everything should be
working now. Let me know if you still encounter any problems with
unarchiving.

On Sat, Apr 16, 2016 at 3:10 PM Ted Yu  wrote:

> Pardon me - there is no tarball for hadoop 2.7
>
> I downloaded
> https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.6.tgz
> and successfully expanded it.
>
> FYI
>
> On Sat, Apr 16, 2016 at 2:52 PM, Jon Gregg  wrote:
>
>> That link points to hadoop2.6.tgz.  I tried changing the URL to
>> https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.7.tgz
>> and I get a NoSuchKey error.
>>
>> Should I just go with it even though it says hadoop2.6?
>>
>> On Sat, Apr 16, 2016 at 5:37 PM, Ted Yu  wrote:
>>
>>> BTW this was the original thread:
>>>
>>> http://search-hadoop.com/m/q3RTt0Oxul0W6Ak
>>>
>>> The link for spark-1.6.1-bin-hadoop2.7 is
>>> https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.7.tgz
>>> 
>>>
>>> On Sat, Apr 16, 2016 at 2:14 PM, Ted Yu  wrote:
>>>
 From the output you posted:
 ---
 Unpacking Spark

 gzip: stdin: not in gzip format
 tar: Child returned status 1
 tar: Error is not recoverable: exiting now
 ---

 The artifact for spark-1.6.1-bin-hadoop2.6 is corrupt.

 This problem has been reported in other threads.

 Try spark-1.6.1-bin-hadoop2.7 - the artifact should be good.

 On Sat, Apr 16, 2016 at 2:09 PM, YaoPau  wrote:

> I launched a cluster with: "./spark-ec2 --key-pair my_pem
> --identity-file
> ../../ssh/my_pem.pem launch jg_spark2" and I got the "Spark standalone
> cluster started at
> http://ec2-54-88-249-255.compute-1.amazonaws.com:8080;
> and "Done!" success confirmations at the end.  I confirmed on EC2 that
> 1
> Master and 1 Slave were both launched and passed their status checks.
>
> But none of the Spark commands seem to work (spark-shell, pyspark,
> etc), and
> port 8080 isn't being used.  The full output from launching the
> cluster is
> below.  Any ideas what the issue is?
>
> >>
> launch
>
> jg_spark2/Users/jg/dev/spark-1.6.1-bin-hadoop2.6/ec2/lib/boto-2.34.0/boto/plugin.py:40:
> PendingDeprecationWarning: the imp module is deprecated in favour of
> importlib; see the module's documentation for alternative uses
>   import imp
>
> /Users/jg/dev/spark-1.6.1-bin-hadoop2.6/ec2/lib/boto-2.34.0/boto/provider.py:197:
> ResourceWarning: unclosed file <_io.TextIOWrapper
> name='/Users/jg/.aws/credentials' mode='r' encoding='UTF-8'>
>   self.shared_credentials.load_from_path(shared_path)
> Setting up security groups...
> Creating security group jg_spark2-master
> Creating security group jg_spark2-slaves
> Searching for existing cluster jg_spark2 in region us-east-1...
> Spark AMI: ami-5bb18832
> Launching instances...
> Launched 1 slave in us-east-1a, regid = r-e7d97944
> Launched master in us-east-1a, regid = r-d3d87870
> Waiting for AWS to propagate instance metadata...
> Waiting for cluster to enter 'ssh-ready' state
>
> Warning: SSH connection error. (This could be temporary.)
> Host: ec2-54-88-249-255.compute-1.amazonaws.com
> SSH return code: 255
> SSH output: b'ssh: connect to host
> ec2-54-88-249-255.compute-1.amazonaws.com
> port 22: Connection refused'
>
>
> ./Users/jg/dev/spark-1.6.1-bin-hadoop2.6/ec2/lib/boto-2.34.0/boto/connection.py:190:
> ResourceWarning: unclosed  family=AddressFamily.AF_INET,
> type=SocketKind.SOCK_STREAM, proto=6, laddr=('192.168.1.66', 55580),
> raddr=('54.239.20.1', 443)>
>   self.queue.pop(0)
>
>
> Warning: SSH connection error. (This could be temporary.)
> Host: ec2-54-88-249-255.compute-1.amazonaws.com
> SSH return code: 255
> SSH output: b'ssh: connect to host
> ec2-54-88-249-255.compute-1.amazonaws.com
> port 22: Connection refused'
>
>
> ./Users/jg/dev/spark-1.6.1-bin-hadoop2.6/ec2/lib/boto-2.34.0/boto/connection.py:190:
> ResourceWarning: unclosed  family=AddressFamily.AF_INET,
> type=SocketKind.SOCK_STREAM, proto=6, laddr=('192.168.1.66', 55760),
> raddr=('54.239.26.182', 443)>
>   self.queue.pop(0)
>
>
> Warning: SSH connection error. (This could be temporary.)
> Host: ec2-54-88-249-255.compute-1.amazonaws.com
> SSH return code: 255
> SSH output: b'ssh: connect to host
> ec2-54-88-249-255.compute-1.amazonaws.com
> port 22: Connection refused'
>
>
> 

Re: Spark sql not pushing down timestamp range queries

2016-04-14 Thread Josh Rosen
AFAIK this is not being pushed down because it involves an implicit cast
and we currently don't push casts into data sources or scans; see
https://github.com/databricks/spark-redshift/issues/155 for a
possibly-related discussion.

On Thu, Apr 14, 2016 at 10:27 AM Mich Talebzadeh 
wrote:

> Are you comparing strings in here or timestamp?
>
> Filter ((cast(registration#37 as string) >= 2015-05-28) &&
> (cast(registration#37 as string) <= 2015-05-29))
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 14 April 2016 at 18:04, Kiran Chitturi 
> wrote:
>
>> Hi,
>>
>> Timestamp range filter queries in SQL are not getting pushed down to the
>> PrunedFilteredScan instances. The filtering is happening at the Spark layer.
>>
>> The physical plan for timestamp range queries is not showing the pushed
>> filters where as range queries on other types is working fine as the
>> physical plan is showing the pushed filters.
>>
>> Please see below for code and examples.
>>
>> *Example:*
>>
>> *1.* Range filter queries on Timestamp types
>>
>>*code: *
>>
>>> sqlContext.sql("SELECT * from events WHERE `registration` >=
>>> '2015-05-28' AND `registration` <= '2015-05-29' ")
>>
>>*Full example*:
>> https://github.com/lucidworks/spark-solr/blob/master/src/test/scala/com/lucidworks/spark/EventsimTestSuite.scala#L151
>> *plan*:
>> https://gist.github.com/kiranchitturi/4a52688c9f0abe3d4b2bd8b938044421#file-time-range-sql
>>
>> *2. * Range filter queries on Long types
>>
>> *code*:
>>
>>> sqlContext.sql("SELECT * from events WHERE `length` >= '700' and
>>> `length` <= '1000'")
>>
>> *Full example*:
>> https://github.com/lucidworks/spark-solr/blob/master/src/test/scala/com/lucidworks/spark/EventsimTestSuite.scala#L151
>> *plan*:
>> https://gist.github.com/kiranchitturi/4a52688c9f0abe3d4b2bd8b938044421#file-length-range-sql
>>
>> The SolrRelation class we use extends
>> 
>> the PrunedFilteredScan.
>>
>> Since Solr supports date ranges, I would like for the timestamp filters
>> to be pushed down to the Solr query.
>>
>> Are there limitations on the type of filters that are passed down with
>> Timestamp types ?
>> Is there something that I should do in my code to fix this ?
>>
>> Thanks,
>> --
>> Kiran Chitturi
>>
>>
>


Re: Kryo serialization mismatch in spark sql windowing function

2016-04-06 Thread Josh Rosen
Spark is compiled against a custom fork of Hive 1.2.1 which added shading
of Protobuf and removed shading of Kryo. What I think that what's happening
here is that stock Hive 1.2.1 is taking precedence so the Kryo instance
that it's returning is an instance of shaded/relocated Hive version rather
than the unshaded, stock Kryo that Spark is expecting here.

I just so happen to have a patch which reintroduces the shading of Kryo
(motivated by other factors): https://github.com/apache/spark/pull/12215;
there's a chance that a backport of this patch might fix this problem.

However, I'm a bit curious about how your classpath is set up and why stock
1.2.1's shaded Kryo is being used here.

/cc +Marcelo Vanzin  and +Steve Loughran
, who may know more.

On Wed, Apr 6, 2016 at 6:08 PM Soam Acharya  wrote:

> Hi folks,
>
> I have a build of Spark 1.6.1 on which spark sql seems to be functional
> outside of windowing functions. For example, I can create a simple external
> table via Hive:
>
> CREATE EXTERNAL TABLE PSTable (pid int, tty string, time string, cmd
> string)
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY ','
> LINES TERMINATED BY '\n'
> STORED AS TEXTFILE
> LOCATION '/user/test/ps';
>
> Ensure that the table is pointing to some valid data, set up spark sql to
> point to the Hive metastore (we're running Hive 1.2.1) and run a basic test:
>
> spark-sql> select * from PSTable;
> 7239pts/0   00:24:31java
> 9993pts/9   00:00:00ps
> 9994pts/9   00:00:00tail
> 9995pts/9   00:00:00sed
> 9996pts/9   00:00:00sed
>
> But when I try to run a windowing function which I know runs onHive, I get:
>
> spark-sql> select a.pid ,a.time, a.cmd, min(a.time) over (partition by
> a.cmd order by a.time ) from PSTable a;
> org.apache.spark.SparkException: Task not serializable
> at
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
> at
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
> at
> org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
> at org.apache.spark.SparkContext.clean(SparkContext.scala:2055)
> :
> :
> Caused by: java.lang.ClassCastException:
> org.apache.hive.com.esotericsoftware.kryo.Kryo cannot be cast to
> com.esotericsoftware.kryo.Kryo
> at
> org.apache.spark.sql.hive.HiveShim$HiveFunctionWrapper.serializePlan(HiveShim.scala:178)
> at
> org.apache.spark.sql.hive.HiveShim$HiveFunctionWrapper.writeExternal(HiveShim.scala:191)
> at
> java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1458)
> at
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1429)
> at
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
>
> Any thoughts or ideas would be appreciated!
>
> Regards,
>
> Soam
>


Re: Spark master keeps running out of RAM

2016-03-31 Thread Josh Rosen
One possible cause of a standalone master OOMing is
https://issues.apache.org/jira/browse/SPARK-6270.  In 2.x, this will be
fixed by https://issues.apache.org/jira/browse/SPARK-12299. In 1.x, one
mitigation is to disable event logging. Another workaround would be to
produce a patch which disables eager UI reconstruction for 1.x.

On Thu, Mar 31, 2016 at 11:16 AM Dillian Murphey 
wrote:

> Why would the spark master run out of RAM if I have too  many slaves?  Is
> this a flaw in the coding?  I'm just a user of spark. The developer that
> set this up left the company, so I'm starting from the top here.
>
> So I noticed if I spawn lots of jobs, my spark master ends up crashing due
> to low memory.  It makes sense to me the master would just be a
> brain/controller to dish out jobs and the resources on the slaves is what
> would get used up, not the master.
>
> Thanks for any ideas/concepts/info.
>
> Appreciate it much
>
>


Re: Apache Spark Exception in thread “main” java.lang.NoClassDefFoundError: scala/collection/GenTraversableOnce$class

2016-03-19 Thread Josh Rosen
See the instructions in the Spark documentation:
https://spark.apache.org/docs/latest/building-spark.html#building-for-scala-211

On Wed, Mar 16, 2016 at 7:05 PM satyajit vegesna 
wrote:

>
>
> Hi,
>
> Scala version:2.11.7(had to upgrade the scala verison to enable case
> clasess to accept more than 22 parameters.)
>
> Spark version:1.6.1.
>
> PFB pom.xml
>
> Getting below error when trying to setup spark on intellij IDE,
>
> 16/03/16 18:36:44 INFO spark.SparkContext: Running Spark version 1.6.1
> Exception in thread "main" java.lang.NoClassDefFoundError:
> scala/collection/GenTraversableOnce$class at
> org.apache.spark.util.TimeStampedWeakValueHashMap.(TimeStampedWeakValueHashMap.scala:42)
> at org.apache.spark.SparkContext.(SparkContext.scala:298) at
> com.examples.testSparkPost$.main(testSparkPost.scala:27) at
> com.examples.testSparkPost.main(testSparkPost.scala) at
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606) at
> com.intellij.rt.execution.application.AppMain.main(AppMain.java:140) Caused
> by: java.lang.ClassNotFoundException:
> scala.collection.GenTraversableOnce$class at
> java.net.URLClassLoader$1.run(URLClassLoader.java:366) at
> java.net.URLClassLoader$1.run(URLClassLoader.java:355) at
> java.security.AccessController.doPrivileged(Native Method) at
> java.net.URLClassLoader.findClass(URLClassLoader.java:354) at
> java.lang.ClassLoader.loadClass(ClassLoader.java:425) at
> sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at
> java.lang.ClassLoader.loadClass(ClassLoader.java:358) ... 9 more
>
> pom.xml:
>
> http://maven.apache.org/POM/4.0.0; 
> xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance;
>  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 
> http://maven.apache.org/maven-v4_0_0.xsd;>
> 4.0.0
> StreamProcess
> StreamProcess
> 0.0.1-SNAPSHOT
> ${project.artifactId}
> This is a boilerplate maven project to start using Spark in 
> Scala
> 2010
>
> 
> 1.6
> 1.6
> UTF-8
> 2.10
> 
> 2.11.7
> 
>
> 
> 
> 
> cloudera-repo-releases
> https://repository.cloudera.com/artifactory/repo/
> 
> 
>
> 
> src/main/scala
> src/test/scala
> 
> 
> 
> maven-assembly-plugin
> 
> 
> package
> 
> single
> 
> 
> 
> 
> 
> 
> jar-with-dependencies
> 
> 
> 
> 
> 
> net.alchim31.maven
> scala-maven-plugin
> 3.2.2
> 
> 
> 
> compile
> testCompile
> 
> 
> 
> 
> -dependencyfile
> 
> ${project.build.directory}/.scala_dependencies
> 
> 
> 
> 
> 
>
> 
> 
> maven-assembly-plugin
> 2.4.1
> 
> 
> jar-with-dependencies
> 
> 
> 
> 
> make-assembly
> package
> 
> single
> 
> 
> 
> 
> 
> 
> 
> 
> org.scala-lang
> scala-library
> ${scala.version}
> 
> 
> org.mongodb.mongo-hadoop
> mongo-hadoop-core
> 1.4.2
> 
> 
> javax.servlet
> servlet-api
> 
> 
> 
> 
> org.mongodb
> mongodb-driver
> 3.2.2
> 
> 
> javax.servlet
> servlet-api
> 
> 
> 
> 
> org.mongodb
> mongodb-driver
> 3.2.2
> 
> 
> javax.servlet
>

Re: Apache Spark Exception in thread “main” java.lang.NoClassDefFoundError: scala/collection/GenTraversableOnce$class

2016-03-19 Thread Josh Rosen
Err, whoops, looks like this is a user app and not building Spark itself,
so you'll have to change your deps to use the 2.11 versions of Spark.
e.g. spark-streaming_2.10 -> spark-streaming_2.11.

On Wed, Mar 16, 2016 at 7:07 PM Josh Rosen <joshro...@databricks.com> wrote:

> See the instructions in the Spark documentation:
> https://spark.apache.org/docs/latest/building-spark.html#building-for-scala-211
>
> On Wed, Mar 16, 2016 at 7:05 PM satyajit vegesna <
> satyajit.apas...@gmail.com> wrote:
>
>>
>>
>> Hi,
>>
>> Scala version:2.11.7(had to upgrade the scala verison to enable case
>> clasess to accept more than 22 parameters.)
>>
>> Spark version:1.6.1.
>>
>> PFB pom.xml
>>
>> Getting below error when trying to setup spark on intellij IDE,
>>
>> 16/03/16 18:36:44 INFO spark.SparkContext: Running Spark version 1.6.1
>> Exception in thread "main" java.lang.NoClassDefFoundError:
>> scala/collection/GenTraversableOnce$class at
>> org.apache.spark.util.TimeStampedWeakValueHashMap.(TimeStampedWeakValueHashMap.scala:42)
>> at org.apache.spark.SparkContext.(SparkContext.scala:298) at
>> com.examples.testSparkPost$.main(testSparkPost.scala:27) at
>> com.examples.testSparkPost.main(testSparkPost.scala) at
>> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:606) at
>> com.intellij.rt.execution.application.AppMain.main(AppMain.java:140) Caused
>> by: java.lang.ClassNotFoundException:
>> scala.collection.GenTraversableOnce$class at
>> java.net.URLClassLoader$1.run(URLClassLoader.java:366) at
>> java.net.URLClassLoader$1.run(URLClassLoader.java:355) at
>> java.security.AccessController.doPrivileged(Native Method) at
>> java.net.URLClassLoader.findClass(URLClassLoader.java:354) at
>> java.lang.ClassLoader.loadClass(ClassLoader.java:425) at
>> sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at
>> java.lang.ClassLoader.loadClass(ClassLoader.java:358) ... 9 more
>>
>> pom.xml:
>>
>> http://maven.apache.org/POM/4.0.0; 
>> xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance;
>>  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 
>> http://maven.apache.org/maven-v4_0_0.xsd;>
>> 4.0.0
>> StreamProcess
>> StreamProcess
>> 0.0.1-SNAPSHOT
>> ${project.artifactId}
>> This is a boilerplate maven project to start using Spark in 
>> Scala
>> 2010
>>
>> 
>> 1.6
>> 1.6
>> UTF-8
>> 2.10
>> 
>> 2.11.7
>> 
>>
>> 
>> 
>> 
>> cloudera-repo-releases
>> https://repository.cloudera.com/artifactory/repo/
>> 
>> 
>>
>> 
>> src/main/scala
>> src/test/scala
>> 
>> 
>> 
>> maven-assembly-plugin
>> 
>> 
>> package
>> 
>> single
>> 
>> 
>> 
>> 
>> 
>> 
>> jar-with-dependencies
>> 
>> 
>> 
>> 
>> 
>> net.alchim31.maven
>> scala-maven-plugin
>> 3.2.2
>> 
>> 
>> 
>> compile
>> testCompile
>> 
>> 
>> 
>> 
>> -dependencyfile
>> 
>> ${project.build.directory}/.scala_dependencies
>> 
>> 
>> 
>> 
>> 
>>
>> 
>> 
>> maven-assembly-plugin
>> 2.4.1
>> 

Re: Python unit tests - Unable to ru it with Python 2.6 or 2.7

2016-03-11 Thread Josh Rosen
AFAIK we haven't actually broken 2.6 compatibility yet for PySpark itself,
since Jenkins is still testing that configuration.

I think the problem that you're seeing is that dev/run-tests /
dev/run-tests-jenkins only work against Python 2.7+ right now. However,
./python/run-tests should be able to launch and run PySpark tests with
Python 2.6. Try ./python/run-tests --help for more details.

On Fri, Mar 11, 2016 at 10:31 AM Holden Karau  wrote:

> So the run tests command allows you to specify the python version to test
> again - maybe specify python2.7
>
> On Friday, March 11, 2016, Gayathri Murali 
> wrote:
>
>> I do have 2.7 installed and unittest2 package available. I still see this
>> error :
>>
>> Please install unittest2 to test with Python 2.6 or earlier
>> Had test failures in pyspark.sql.tests with python2.6; see logs.
>>
>> Thanks
>> Gayathri
>>
>>
>>
>> On Fri, Mar 11, 2016 at 10:07 AM, Davies Liu 
>> wrote:
>>
>>> Spark 2.0 is dropping the support for Python 2.6, it only work with
>>> Python 2.7, and 3.4+
>>>
>>> On Thu, Mar 10, 2016 at 11:17 PM, Gayathri Murali
>>>  wrote:
>>> > Hi all,
>>> >
>>> > I am trying to run python unit tests.
>>> >
>>> > I currently have Python 2.6 and 2.7 installed. I installed unittest2
>>> against both of them.
>>> >
>>> > When I try to run /python/run-tests with Python 2.7 I get the
>>> following error :
>>> >
>>> > Please install unittest2 to test with Python 2.6 or earlier
>>> > Had test failures in pyspark.sql.tests with python2.6; see logs.
>>> >
>>> > When I try to run /python/run-tests with Python 2.6 I get the
>>> following error:
>>> >
>>> > Traceback (most recent call last):
>>> >   File "./python/run-tests.py", line 42, in 
>>> > from sparktestsupport.modules import all_modules  # noqa
>>> >   File
>>> "/Users/gayathri/spark/python/../dev/sparktestsupport/modules.py", line 18,
>>> in 
>>> > from functools import total_ordering
>>> > ImportError: cannot import name total_ordering
>>> >
>>> > total_ordering is a package that is available in 2.7.
>>> >
>>> > Can someone help?
>>> >
>>> > Thanks
>>> > Gayathri
>>> > -
>>> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> > For additional commands, e-mail: user-h...@spark.apache.org
>>> >
>>>
>>
>>
>
> --
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau
>
>


Does anyone implement org.apache.spark.serializer.Serializer in their own code?

2016-03-07 Thread Josh Rosen
Does anyone implement Spark's serializer interface
(org.apache.spark.serializer.Serializer) in your own third-party code? If
so, please let me know because I'd like to change this interface from a
DeveloperAPI to private[spark] in Spark 2.0 in order to do some cleanup and
refactoring. I think that the only reason it was a DeveloperAPI was Shark,
but I'd like to confirm this by asking the community.

Thanks,
Josh


Re: Unresolved dep when building project with spark 1.6

2016-02-29 Thread Josh Rosen
Have you tried removing the leveldbjni files from your local ivy cache? My
hunch is that this is a problem with some local cache state rather than the
dependency simply being unavailable / not existing (note that the error
message was "origin location must be absolute:[...]", not that the files
couldn't be found).

On Mon, Feb 29, 2016 at 2:19 AM Hao Ren  wrote:

> Hi,
>
> I am upgrading my project to spark 1.6.
> It seems that the deps are broken.
>
> Deps used in sbt
>
> val scalaVersion = "2.10"
> val sparkVersion  = "1.6.0"
> val hadoopVersion = "2.7.1"
>
> // Libraries
> val scalaTest = "org.scalatest" %% "scalatest" % "2.2.4" % "test"
> val sparkSql  = "org.apache.spark" %% "spark-sql" % sparkVersion
> val sparkML   = "org.apache.spark" %% "spark-mllib" % sparkVersion
> val hadoopAWS = "org.apache.hadoop" % "hadoop-aws" % hadoopVersion
> val scopt = "com.github.scopt" %% "scopt" % "3.3.0"
> val jodacvt   = "org.joda" % "joda-convert" % "1.8.1"
>
> Sbt exception:
>
> [warn] ::
> [warn] ::  UNRESOLVED DEPENDENCIES ::
> [warn] ::
> [warn] :: org.fusesource.leveldbjni#leveldbjni-all;1.8:
> org.fusesource.leveldbjni#leveldbjni-all;1.8!leveldbjni-all.pom(pom.original)
> origin location must be absolute:
> file:/home/invkrh/.m2/repository/org/fusesource/leveldbjni/leveldbjni-all/1.8/leveldbjni-all-1.8.pom
> [warn] ::
> [warn]
> [warn] Note: Unresolved dependencies path:
> [warn] org.fusesource.leveldbjni:leveldbjni-all:1.8
> [warn]  +- org.apache.spark:spark-network-shuffle_2.10:1.6.0
> [warn]  +- org.apache.spark:spark-core_2.10:1.6.0
> [warn]  +- org.apache.spark:spark-catalyst_2.10:1.6.0
> [warn]  +- org.apache.spark:spark-sql_2.10:1.6.0
> (/home/invkrh/workspace/scala/confucius/botdet/build.sbt#L14)
> [warn]  +- org.apache.spark:spark-mllib_2.10:1.6.0
> (/home/invkrh/workspace/scala/confucius/botdet/build.sbt#L14)
> [warn]  +- fr.leboncoin:botdet_2.10:0.1
> sbt.ResolveException: unresolved dependency:
> org.fusesource.leveldbjni#leveldbjni-all;1.8:
> org.fusesource.leveldbjni#leveldbjni-all;1.8!leveldbjni-all.pom(pom.original)
> origin location must be absolute:
> file:/home/invkrh/.m2/repository/org/fusesource/leveldbjni/leveldbjni-all/1.8/leveldbjni-all-1.8.pom
>
> Thank you.
>
> --
> Hao Ren
>
> Data Engineer @ leboncoin
>
> Paris, France
>


Re: bug for large textfiles on windows

2016-01-25 Thread Josh Rosen
Hi Christopher,

What would be super helpful here is a standalone reproduction. Ideally this
would be a single Scala file or set of commands that I can run in
`spark-shell` in order to reproduce this. Ideally, this code would generate
a giant file, then try to read it in a way that demonstrates the bug. If
you have such a reproduction, could you attach it to that JIRA ticket?
Thanks!

On Mon, Jan 25, 2016 at 7:53 AM Christopher Bourez <
christopher.bou...@gmail.com> wrote:

> Dears,
>
> I would like to re-open a case for a potential bug (current status is
> resolved but it sounds not) :
>
> *https://issues.apache.org/jira/browse/SPARK-12261
> *
>
> I believe there is something wrong about the memory management under
> windows
>
> It has no sense to work with files smaller than a few Mo...
>
> Do not hesitate to ask me questions if you try to help and reproduce the
> bug,
>
> Best
>
> Christopher Bourez
> 06 17 17 50 60
>


Re: TaskCommitDenied (Driver denied task commit)

2016-01-21 Thread Josh Rosen
Is speculation enabled? This TaskCommitDenied by driver error is thrown by
writers who lost the race to commit an output partition. I don't think this
had anything to do with key skew etc. Replacing the groupbykey with a count
will mask this exception because the coordination does not get triggered in
non save/write operations.
On Thu, Jan 21, 2016 at 2:46 PM Holden Karau  wrote:

> Before we dig too far into this, the thing which most quickly jumps out to
> me is groupByKey which could be causing some problems - whats the
> distribution of keys like? Try replacing the groupByKey with a count() and
> see if the pipeline works up until that stage. Also 1G of driver memory is
> also a bit small for something with 90 executors...
>
> On Thu, Jan 21, 2016 at 2:40 PM, Arun Luthra 
> wrote:
>
>>
>>
>> 16/01/21 21:52:11 WARN NativeCodeLoader: Unable to load native-hadoop
>> library for your platform... using builtin-java classes where applicable
>>
>> 16/01/21 21:52:14 WARN MetricsSystem: Using default name DAGScheduler for
>> source because spark.app.id is not set.
>>
>> spark.yarn.driver.memoryOverhead is set but does not apply in client mode.
>>
>> 16/01/21 21:52:16 WARN DomainSocketFactory: The short-circuit local reads
>> feature cannot be used because libhadoop cannot be loaded.
>>
>> 16/01/21 21:52:52 WARN MemoryStore: Not enough space to cache broadcast_4
>> in memory! (computed 60.2 MB so far)
>>
>> 16/01/21 21:52:52 WARN MemoryStore: Persisting block broadcast_4 to disk
>> instead.
>>
>> [Stage 1:>(2260 + 7)
>> / 2262]16/01/21 21:57:24 WARN TaskSetManager: Lost task 1440.1 in stage 1.0
>> (TID 4530, --): TaskCommitDenied (Driver denied task commit) for job: 1,
>> partition: 1440, attempt: 4530
>>
>> [Stage 1:>(2260 + 6)
>> / 2262]16/01/21 21:57:27 WARN TaskSetManager: Lost task 1488.1 in stage 1.0
>> (TID 4531, --): TaskCommitDenied (Driver denied task commit) for job: 1,
>> partition: 1488, attempt: 4531
>>
>> [Stage 1:>(2261 + 4)
>> / 2262]16/01/21 21:57:39 WARN TaskSetManager: Lost task 1982.1 in stage 1.0
>> (TID 4532, --): TaskCommitDenied (Driver denied task commit) for job: 1,
>> partition: 1982, attempt: 4532
>>
>> 16/01/21 21:57:57 WARN TaskSetManager: Lost task 2214.0 in stage 1.0 (TID
>> 4482, --): TaskCommitDenied (Driver denied task commit) for job: 1,
>> partition: 2214, attempt: 4482
>>
>> 16/01/21 21:57:57 WARN TaskSetManager: Lost task 2168.0 in stage 1.0 (TID
>> 4436, --): TaskCommitDenied (Driver denied task commit) for job: 1,
>> partition: 2168, attempt: 4436
>>
>>
>> I am running with:
>>
>> spark-submit --class "myclass" \
>>
>>   --num-executors 90 \
>>
>>   --driver-memory 1g \
>>
>>   --executor-memory 60g \
>>
>>   --executor-cores 8 \
>>
>>   --master yarn-client \
>>
>>   --conf "spark.executor.extraJavaOptions=-verbose:gc
>> -XX:+PrintGCDetails -XX:+PrintGCTimeStamps" \
>>
>>   my.jar
>>
>>
>> There are 2262 input files totaling just 98.6G. The DAG is basically
>> textFile().map().filter().groupByKey().saveAsTextFile().
>>
>> On Thu, Jan 21, 2016 at 2:14 PM, Holden Karau 
>> wrote:
>>
>>> Can you post more of your log? How big are the partitions? What is the
>>> action you are performing?
>>>
>>> On Thu, Jan 21, 2016 at 2:02 PM, Arun Luthra 
>>> wrote:
>>>
 Example warning:

 16/01/21 21:57:57 WARN TaskSetManager: Lost task 2168.0 in stage 1.0
 (TID 4436, XXX): TaskCommitDenied (Driver denied task commit) for job:
 1, partition: 2168, attempt: 4436


 Is there a solution for this? Increase driver memory? I'm using just 1G
 driver memory but ideally I won't have to increase it.

 The RDD being processed has 2262 partitions.

 Arun

>>>
>>>
>>>
>>> --
>>> Cell : 425-233-8271
>>> Twitter: https://twitter.com/holdenkarau
>>>
>>
>>
>
>
> --
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau
>


Re: ibsnappyjava.so: failed to map segment from shared object

2016-01-11 Thread Josh Rosen
This is due to the snappy-java library; I think that you'll have to
configure either java.io.tmpdir or org.xerial.snappy.tempdir; see
https://github.com/xerial/snappy-java/blob/1198363176ad671d933fdaf0938b8b9e609c0d8a/src/main/java/org/xerial/snappy/SnappyLoader.java#L335



On Mon, Jan 11, 2016 at 7:12 PM, yatinla  wrote:

> I'm trying to get pyspark running on a shared web host.  I can get into the
> pyspark shell but whenever I run a simple command like
> sc.parallelize([1,2,3,4]).sum() I get an error that seems to stem from some
> kind of permission issue with libsnappyjava.so:
>
> Caused by: java.lang.UnsatisfiedLinkError:
> /tmp/snappy-1.1.2-b7abadd6-9b05-4dee-885a-c80434db68e2-libsnappyjava.so:
> /tmp/snappy-1.1.2-b7abadd6-9b05-4dee-885a-c80434db68e2-libsnappyjava.so:
> failed to map segment from shared object: Operation not permitted
>
> I'm no Linux expert but I suspect it has something to do with noexec maybe
> on the /tmp folder?  So I tried setting the TMP, TEMP, and TMPDIR
> environment variables to a tmp folder in my own home directory but I get
> the
> same error and it still says /tmp/snappy... not the folder in my my home
> directory.  So then I also tried, in pyspark using SparkConf, setting the
> spark.local.dir property to my personal tmp folder, and same for the
> spark.externalBlockStore.baseDir.  But no matter what, it seems like the
> error happens and always refers to /tmp not my personal folder.
>
> Any help would be greatly appreciated.  It all works great on my laptop,
> just not on the web host which is a shared linux hosting plan so it doesn't
> seem surprising that there would be permission issues with /tmp.
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/ibsnappyjava-so-failed-to-map-segment-from-shared-object-tp25937.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: how garbage collection works on parallelize

2016-01-08 Thread Josh Rosen
It won't be GC'd as long as the RDD which results from `parallelize()` is
kept around; that RDD keeps strong references to the parallelized
collection's elements in order to enable fault-tolerance.

On Fri, Jan 8, 2016 at 6:50 PM, jluan  wrote:

> Hi,
>
> I am curious about garbage collect on an object which gets parallelized.
> Say
> if we have a really large array (say 40GB in ram) that we want to
> parallelize across our machines.
>
> I have the following function:
>
> def doSomething(): RDD[Double] = {
> val reallyBigArray = Array[Double[(some really big value)
> sc.parallelize(reallyBigArray)
> }
>
> Theoretically, will reallyBigArray be marked for GC? Or will reallyBigArray
> not be GC'd because parallelize somehow has a reference on reallyBigArray?
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/how-garbage-collection-works-on-parallelize-tp25926.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Josh Rosen
If users are able to install Spark 2.0 on their RHEL clusters, then I
imagine that they're also capable of installing a standalone Python
alongside that Spark version (without changing Python systemwide). For
instance, Anaconda/Miniconda make it really easy to install Python
2.7.x/3.x without impacting / changing the system Python and doesn't
require any special permissions to install (you don't need root / sudo
access). Does this address the Python versioning concerns for RHEL users?

On Tue, Jan 5, 2016 at 2:33 PM, Koert Kuipers  wrote:

> yeah, the practical concern is that we have no control over java or python
> version on large company clusters. our current reality for the vast
> majority of them is java 7 and python 2.6, no matter how outdated that is.
>
> i dont like it either, but i cannot change it.
>
> we currently don't use pyspark so i have no stake in this, but if we did i
> can assure you we would not upgrade to spark 2.x if python 2.6 was dropped.
> no point in developing something that doesnt run for majority of customers.
>
> On Tue, Jan 5, 2016 at 5:19 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> As I pointed out in my earlier email, RHEL will support Python 2.6 until
>> 2020. So I'm assuming these large companies will have the option of riding
>> out Python 2.6 until then.
>>
>> Are we seriously saying that Spark should likewise support Python 2.6 for
>> the next several years? Even though the core Python devs stopped supporting
>> it in 2013?
>>
>> If that's not what we're suggesting, then when, roughly, can we drop
>> support? What are the criteria?
>>
>> I understand the practical concern here. If companies are stuck using
>> 2.6, it doesn't matter to them that it is deprecated. But balancing that
>> concern against the maintenance burden on this project, I would say that
>> "upgrade to Python 2.7 or stay on Spark 1.6.x" is a reasonable position to
>> take. There are many tiny annoyances one has to put up with to support 2.6.
>>
>> I suppose if our main PySpark contributors are fine putting up with those
>> annoyances, then maybe we don't need to drop support just yet...
>>
>> Nick
>> 2016년 1월 5일 (화) 오후 2:27, Julio Antonio Soto de Vicente 님이
>> 작성:
>>
>>> Unfortunately, Koert is right.
>>>
>>> I've been in a couple of projects using Spark (banking industry) where
>>> CentOS + Python 2.6 is the toolbox available.
>>>
>>> That said, I believe it should not be a concern for Spark. Python 2.6 is
>>> old and busted, which is totally opposite to the Spark philosophy IMO.
>>>
>>>
>>> El 5 ene 2016, a las 20:07, Koert Kuipers  escribió:
>>>
>>> rhel/centos 6 ships with python 2.6, doesnt it?
>>>
>>> if so, i still know plenty of large companies where python 2.6 is the
>>> only option. asking them for python 2.7 is not going to work
>>>
>>> so i think its a bad idea
>>>
>>> On Tue, Jan 5, 2016 at 1:52 PM, Juliet Hougland <
>>> juliet.hougl...@gmail.com> wrote:
>>>
 I don't see a reason Spark 2.0 would need to support Python 2.6. At
 this point, Python 3 should be the default that is encouraged.
 Most organizations acknowledge the 2.7 is common, but lagging behind
 the version they should theoretically use. Dropping python 2.6
 support sounds very reasonable to me.

 On Tue, Jan 5, 2016 at 5:45 AM, Nicholas Chammas <
 nicholas.cham...@gmail.com> wrote:

> +1
>
> Red Hat supports Python 2.6 on REHL 5 until 2020
> ,
> but otherwise yes, Python 2.6 is ancient history and the core Python
> developers stopped supporting it in 2013. REHL 5 is not a good enough
> reason to continue support for Python 2.6 IMO.
>
> We should aim to support Python 2.7 and Python 3.3+ (which I believe
> we currently do).
>
> Nick
>
> On Tue, Jan 5, 2016 at 8:01 AM Allen Zhang 
> wrote:
>
>> plus 1,
>>
>> we are currently using python 2.7.2 in production environment.
>>
>>
>>
>>
>>
>> 在 2016-01-05 18:11:45,"Meethu Mathew"  写道:
>>
>> +1
>> We use Python 2.7
>>
>> Regards,
>>
>> Meethu Mathew
>>
>> On Tue, Jan 5, 2016 at 12:47 PM, Reynold Xin 
>> wrote:
>>
>>> Does anybody here care about us dropping support for Python 2.6 in
>>> Spark 2.0?
>>>
>>> Python 2.6 is ancient, and is pretty slow in many aspects (e.g. json
>>> parsing) when compared with Python 2.7. Some libraries that Spark 
>>> depend on
>>> stopped supporting 2.6. We can still convince the library maintainers to
>>> support 2.6, but it will be extra work. I'm curious if anybody still 
>>> uses
>>> Python 2.6 to run Spark.
>>>
>>> Thanks.
>>>
>>>
>>>
>>

>>>
>


Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Josh Rosen
I don't think that we're planning to drop Java 7 support for Spark 2.0.

Personally, I would recommend using Java 8 if you're running Spark 1.5.0+
and are using SQL/DataFrames so that you can benefit from improvements to
code cache flushing in the Java 8 JVMs. Spark SQL's generated classes can
fill up the JVM's code cache, which causes JIT to stop working for new
bytecode. Empirically, it looks like the Java 8 JVMs have an improved
ability to flush this code cache, thereby avoiding this problem.

TL;DR: I'd prefer to run Java 8 with Spark if given the choice.

On Tue, Jan 5, 2016 at 4:07 PM, Koert Kuipers <ko...@tresata.com> wrote:

> hey evil admin:)
> i think the bit about java was from me?
> if so, i meant to indicate that the reality for us is java is 1.7 on most
> (all?) clusters. i do not believe spark prefers java 1.8. my point was that
> even although java 1.7 is getting old as well it would be a major issue for
> me if spark dropped java 1.7 support.
>
> On Tue, Jan 5, 2016 at 6:53 PM, Carlile, Ken <carli...@janelia.hhmi.org>
> wrote:
>
>> As one of the evil administrators that runs a RHEL 6 cluster, we already
>> provide quite a few different version of python on our cluster pretty darn
>> easily. All you need is a separate install directory and to set the
>> PYTHON_HOME environment variable to point to the correct python, then have
>> the users make sure the correct python is in their PATH. I understand that
>> other administrators may not be so compliant.
>>
>> Saw a small bit about the java version in there; does Spark currently
>> prefer Java 1.8.x?
>>
>> —Ken
>>
>> On Jan 5, 2016, at 6:08 PM, Josh Rosen <joshro...@databricks.com> wrote:
>>
>> Note that you _can_ use a Python 2.7 `ipython` executable on the driver
>>> while continuing to use a vanilla `python` executable on the executors
>>
>>
>> Whoops, just to be clear, this should actually read "while continuing to
>> use a vanilla `python` 2.7 executable".
>>
>> On Tue, Jan 5, 2016 at 3:07 PM, Josh Rosen <joshro...@databricks.com>
>> wrote:
>>
>>> Yep, the driver and executors need to have compatible Python versions. I
>>> think that there are some bytecode-level incompatibilities between 2.6 and
>>> 2.7 which would impact the deserialization of Python closures, so I think
>>> you need to be running the same 2.x version for all communicating Spark
>>> processes. Note that you _can_ use a Python 2.7 `ipython` executable on the
>>> driver while continuing to use a vanilla `python` executable on the
>>> executors (we have environment variables which allow you to control these
>>> separately).
>>>
>>> On Tue, Jan 5, 2016 at 3:05 PM, Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
>>>> I think all the slaves need the same (or a compatible) version of
>>>> Python installed since they run Python code in PySpark jobs natively.
>>>>
>>>> On Tue, Jan 5, 2016 at 6:02 PM Koert Kuipers <ko...@tresata.com> wrote:
>>>>
>>>>> interesting i didnt know that!
>>>>>
>>>>> On Tue, Jan 5, 2016 at 5:57 PM, Nicholas Chammas <
>>>>> nicholas.cham...@gmail.com> wrote:
>>>>>
>>>>>> even if python 2.7 was needed only on this one machine that launches
>>>>>> the app we can not ship it with our software because its gpl licensed
>>>>>>
>>>>>> Not to nitpick, but maybe this is important. The Python license is 
>>>>>> GPL-compatible
>>>>>> but not GPL <https://docs.python.org/3/license.html>:
>>>>>>
>>>>>> Note GPL-compatible doesn’t mean that we’re distributing Python under
>>>>>> the GPL. All Python licenses, unlike the GPL, let you distribute a 
>>>>>> modified
>>>>>> version without making your changes open source. The GPL-compatible
>>>>>> licenses make it possible to combine Python with other software that is
>>>>>> released under the GPL; the others don’t.
>>>>>>
>>>>>> Nick
>>>>>> ​
>>>>>>
>>>>>> On Tue, Jan 5, 2016 at 5:49 PM Koert Kuipers <ko...@tresata.com>
>>>>>> wrote:
>>>>>>
>>>>>>> i do not think so.
>>>>>>>
>>>>>>> does the python 2.7 need to be installed on all slaves? if so, we do
>>>>>>> not have direct access to those.
&g

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Josh Rosen
>
> Note that you _can_ use a Python 2.7 `ipython` executable on the driver
> while continuing to use a vanilla `python` executable on the executors


Whoops, just to be clear, this should actually read "while continuing to
use a vanilla `python` 2.7 executable".

On Tue, Jan 5, 2016 at 3:07 PM, Josh Rosen <joshro...@databricks.com> wrote:

> Yep, the driver and executors need to have compatible Python versions. I
> think that there are some bytecode-level incompatibilities between 2.6 and
> 2.7 which would impact the deserialization of Python closures, so I think
> you need to be running the same 2.x version for all communicating Spark
> processes. Note that you _can_ use a Python 2.7 `ipython` executable on the
> driver while continuing to use a vanilla `python` executable on the
> executors (we have environment variables which allow you to control these
> separately).
>
> On Tue, Jan 5, 2016 at 3:05 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> I think all the slaves need the same (or a compatible) version of Python
>> installed since they run Python code in PySpark jobs natively.
>>
>> On Tue, Jan 5, 2016 at 6:02 PM Koert Kuipers <ko...@tresata.com> wrote:
>>
>>> interesting i didnt know that!
>>>
>>> On Tue, Jan 5, 2016 at 5:57 PM, Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
>>>> even if python 2.7 was needed only on this one machine that launches
>>>> the app we can not ship it with our software because its gpl licensed
>>>>
>>>> Not to nitpick, but maybe this is important. The Python license is 
>>>> GPL-compatible
>>>> but not GPL <https://docs.python.org/3/license.html>:
>>>>
>>>> Note GPL-compatible doesn’t mean that we’re distributing Python under
>>>> the GPL. All Python licenses, unlike the GPL, let you distribute a modified
>>>> version without making your changes open source. The GPL-compatible
>>>> licenses make it possible to combine Python with other software that is
>>>> released under the GPL; the others don’t.
>>>>
>>>> Nick
>>>> ​
>>>>
>>>> On Tue, Jan 5, 2016 at 5:49 PM Koert Kuipers <ko...@tresata.com> wrote:
>>>>
>>>>> i do not think so.
>>>>>
>>>>> does the python 2.7 need to be installed on all slaves? if so, we do
>>>>> not have direct access to those.
>>>>>
>>>>> also, spark is easy for us to ship with our software since its apache
>>>>> 2 licensed, and it only needs to be present on the machine that launches
>>>>> the app (thanks to yarn).
>>>>> even if python 2.7 was needed only on this one machine that launches
>>>>> the app we can not ship it with our software because its gpl licensed, so
>>>>> the client would have to download it and install it themselves, and this
>>>>> would mean its an independent install which has to be audited and approved
>>>>> and now you are in for a lot of fun. basically it will never happen.
>>>>>
>>>>>
>>>>> On Tue, Jan 5, 2016 at 5:35 PM, Josh Rosen <joshro...@databricks.com>
>>>>> wrote:
>>>>>
>>>>>> If users are able to install Spark 2.0 on their RHEL clusters, then I
>>>>>> imagine that they're also capable of installing a standalone Python
>>>>>> alongside that Spark version (without changing Python systemwide). For
>>>>>> instance, Anaconda/Miniconda make it really easy to install Python
>>>>>> 2.7.x/3.x without impacting / changing the system Python and doesn't
>>>>>> require any special permissions to install (you don't need root / sudo
>>>>>> access). Does this address the Python versioning concerns for RHEL users?
>>>>>>
>>>>>> On Tue, Jan 5, 2016 at 2:33 PM, Koert Kuipers <ko...@tresata.com>
>>>>>> wrote:
>>>>>>
>>>>>>> yeah, the practical concern is that we have no control over java or
>>>>>>> python version on large company clusters. our current reality for the 
>>>>>>> vast
>>>>>>> majority of them is java 7 and python 2.6, no matter how outdated that 
>>>>>>> is.
>>>>>>>
>>>>>>> i dont like it either, but i cannot change it.
>>>>>>>
>>>>>>> we

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Josh Rosen
Yep, the driver and executors need to have compatible Python versions. I
think that there are some bytecode-level incompatibilities between 2.6 and
2.7 which would impact the deserialization of Python closures, so I think
you need to be running the same 2.x version for all communicating Spark
processes. Note that you _can_ use a Python 2.7 `ipython` executable on the
driver while continuing to use a vanilla `python` executable on the
executors (we have environment variables which allow you to control these
separately).

On Tue, Jan 5, 2016 at 3:05 PM, Nicholas Chammas <nicholas.cham...@gmail.com
> wrote:

> I think all the slaves need the same (or a compatible) version of Python
> installed since they run Python code in PySpark jobs natively.
>
> On Tue, Jan 5, 2016 at 6:02 PM Koert Kuipers <ko...@tresata.com> wrote:
>
>> interesting i didnt know that!
>>
>> On Tue, Jan 5, 2016 at 5:57 PM, Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> even if python 2.7 was needed only on this one machine that launches the
>>> app we can not ship it with our software because its gpl licensed
>>>
>>> Not to nitpick, but maybe this is important. The Python license is 
>>> GPL-compatible
>>> but not GPL <https://docs.python.org/3/license.html>:
>>>
>>> Note GPL-compatible doesn’t mean that we’re distributing Python under
>>> the GPL. All Python licenses, unlike the GPL, let you distribute a modified
>>> version without making your changes open source. The GPL-compatible
>>> licenses make it possible to combine Python with other software that is
>>> released under the GPL; the others don’t.
>>>
>>> Nick
>>> ​
>>>
>>> On Tue, Jan 5, 2016 at 5:49 PM Koert Kuipers <ko...@tresata.com> wrote:
>>>
>>>> i do not think so.
>>>>
>>>> does the python 2.7 need to be installed on all slaves? if so, we do
>>>> not have direct access to those.
>>>>
>>>> also, spark is easy for us to ship with our software since its apache 2
>>>> licensed, and it only needs to be present on the machine that launches the
>>>> app (thanks to yarn).
>>>> even if python 2.7 was needed only on this one machine that launches
>>>> the app we can not ship it with our software because its gpl licensed, so
>>>> the client would have to download it and install it themselves, and this
>>>> would mean its an independent install which has to be audited and approved
>>>> and now you are in for a lot of fun. basically it will never happen.
>>>>
>>>>
>>>> On Tue, Jan 5, 2016 at 5:35 PM, Josh Rosen <joshro...@databricks.com>
>>>> wrote:
>>>>
>>>>> If users are able to install Spark 2.0 on their RHEL clusters, then I
>>>>> imagine that they're also capable of installing a standalone Python
>>>>> alongside that Spark version (without changing Python systemwide). For
>>>>> instance, Anaconda/Miniconda make it really easy to install Python
>>>>> 2.7.x/3.x without impacting / changing the system Python and doesn't
>>>>> require any special permissions to install (you don't need root / sudo
>>>>> access). Does this address the Python versioning concerns for RHEL users?
>>>>>
>>>>> On Tue, Jan 5, 2016 at 2:33 PM, Koert Kuipers <ko...@tresata.com>
>>>>> wrote:
>>>>>
>>>>>> yeah, the practical concern is that we have no control over java or
>>>>>> python version on large company clusters. our current reality for the 
>>>>>> vast
>>>>>> majority of them is java 7 and python 2.6, no matter how outdated that 
>>>>>> is.
>>>>>>
>>>>>> i dont like it either, but i cannot change it.
>>>>>>
>>>>>> we currently don't use pyspark so i have no stake in this, but if we
>>>>>> did i can assure you we would not upgrade to spark 2.x if python 2.6 was
>>>>>> dropped. no point in developing something that doesnt run for majority of
>>>>>> customers.
>>>>>>
>>>>>> On Tue, Jan 5, 2016 at 5:19 PM, Nicholas Chammas <
>>>>>> nicholas.cham...@gmail.com> wrote:
>>>>>>
>>>>>>> As I pointed out in my earlier email, RHEL will support Python 2.6
>>>>>>> until 2020. So I'm assuming these large companies will have the option 
>&g

Re: Applicaiton Detail UI change

2015-12-21 Thread Josh Rosen
In the script / environment which launches your Spark driver, try setting
the SPARK_PUBLIC_DNS environment variable to point to a publicly-accessible
hostname.

See
https://spark.apache.org/docs/latest/configuration.html#environment-variables
for more details. This environment variable also affects the Spark
Standalone master and worker web UIs as well and will cause them to
advertise a public hostname in URLs and UIs.

On Mon, Dec 21, 2015 at 10:13 AM, carlilek 
wrote:

> I administer an HPC cluster that runs Spark clusters as jobs. We run Spark
> over the backend network (typically used for MPI), which is not accessible
> outside the cluster. Until we upgraded to 1.5.1 (from 1.3.1), this did not
> present a problem. Now the Application Detail UI link is returning the IP
> address of the backend network of the driver machine rather than that
> machine's hostname. Consequentially, users cannot access that page.  I am
> unsure what might have changed or how I might change the behavior back.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Applicaiton-Detail-UI-change-tp25756.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: fishing for help!

2015-12-21 Thread Josh Rosen
@Eran, are Server 1 and Server 2 both part of the same cluster / do they
have similar positions in the network topology w.r.t the Spark executors?
If Server 1 had fast network access to the executors but Server 2 was
across a WAN then I'd expect the job to run slower from Server 2 duet to
the extra network latency / reduced bandwidth. This is assuming that you're
running the driver in non-cluster deploy mode (so the driver process runs
on the machine which submitted the job).

On Mon, Dec 21, 2015 at 1:30 PM, Igor Berman  wrote:

> look for differences: packages versions, cpu/network/memory diff etc etc
>
>
> On 21 December 2015 at 14:53, Eran Witkon  wrote:
>
>> Hi,
>> I know it is a wide question but can you think of reasons why a pyspark
>> job which runs on from server 1 using user 1 will run faster then the same
>> job when running on server 2 with user 1
>> Eran
>>
>
>


Re: Spark 1.3.1 - Does SparkConext in multi-threaded env requires SparkEnv.set(env) anymore

2015-12-10 Thread Josh Rosen
Nope, you shouldn't have to do that anymore. As of
https://github.com/apache/spark/pull/2624, which is in Spark 1.2.0+,
SparkEnv's thread-local stuff was removed and replaced by a simple global
variable (since it was used in an *effectively* global way before (see my
comments on that PR)). As a result, there shouldn't really be any need for
you to call SparkEnv.set(env) in your user threads anymore.

On Thu, Dec 10, 2015 at 11:03 AM, Nirav Patel  wrote:

> As subject says, do we still need to use static env in every thread that
> access sparkContext? I read some ref here.
>
>
> http://qnalist.com/questions/4956211/is-spark-context-in-local-mode-thread-safe
>
>
>
> [image: What's New with Xactly] 
>
>   [image: LinkedIn]
>   [image: Twitter]
>   [image: Facebook]
>   [image: YouTube]
> 


Re: Spark UI - Streaming Tab

2015-12-04 Thread Josh Rosen
The Streaming tab is only supported in the live UI, not in the History
Server.

On Fri, Dec 4, 2015 at 9:31 AM, patcharee  wrote:

> I ran streaming jobs, but no streaming tab appeared for those jobs.
>
> Patcharee
>
>
>
> On 04. des. 2015 18:12, PhuDuc Nguyen wrote:
>
> I believe the "Streaming" tab is dynamic - it appears once you have a
> streaming job running, not when the cluster is simply up. It does not
> depend on 1.6 and has been in there since at least 1.0.
>
> HTH,
> Duc
>
> On Fri, Dec 4, 2015 at 7:28 AM, patcharee 
> wrote:
>
>> Hi,
>>
>> We tried to get the streaming tab interface on Spark UI -
>> 
>> https://databricks.com/blog/2015/07/08/new-visualizations-for-understanding-spark-streaming-applications.html
>>
>> Tested on version 1.5.1, 1.6.0-snapshot, but no such interface for
>> streaming applications at all. Any suggestions? Do we need to configure the
>> history UI somehow to get such interface?
>>
>> Thanks,
>> Patcharee
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>
>


Re: Problem with RDD of (Long, Byte[Array])

2015-12-03 Thread Josh Rosen
Are they keys that you're joining on the bye arrays themselves? If so,
that's not likely to work because of how Java computes arrays' hashCodes;
see https://issues.apache.org/jira/browse/SPARK-597. If this turns out to
be the problem, we should look into strengthening the checks for array-type
keys in order to detect and fail fast for this join() case.

On Thu, Dec 3, 2015 at 8:58 AM, Hervé Yviquel  wrote:

> Hi all,
>
> I have problem when using Array[Byte] in RDD operation.
> When I join two different RDDs of type [(Long, Array[Byte])], I obtain
> wrong results... But if I translate the byte array in integer and join two
> different RDDs of type [(Long, Integer)], then the results is correct...
> Any idea ?
>
> --
> The code:
>
> val byteRDD0 = sc.binaryRecords(path_arg0, 4).zipWithIndex.map{x => (x._2,
> x._1)}
> val byteRDD1 = sc.binaryRecords(path_arg1, 4).zipWithIndex.map{x => (x._2,
> x._1)}
>
> byteRDD0.foreach{x => println("BYTE0 " + x._1 + "=> "
> +ByteBuffer.wrap(x._2).order(java.nio.ByteOrder.LITTLE_ENDIAN).getInt())}
> byteRDD1.foreach{x => println("BYTE1 " + x._1 + "=> "
> +ByteBuffer.wrap(x._2).order(java.nio.ByteOrder.LITTLE_ENDIAN).getInt())}
>
> val intRDD1 = byteRDD1.mapValues{x=>
> ByteBuffer.wrap(x).order(java.nio.ByteOrder.LITTLE_ENDIAN).getInt()}
> val intRDD2 = byteRDD2.mapValues{x=>
> ByteBuffer.wrap(x).order(java.nio.ByteOrder.LITTLE_ENDIAN).getInt()}
>
> val byteJOIN = byteRDD1.join(byteRDD2)
> byteJOIN.foreach{x => println("BYTEJOIN " + x._1 + "=> " +
> ByteBuffer.wrap(x._2._1).order(java.nio.ByteOrder.LITTLE_ENDIAN).getInt() +
> " -
> "+ByteBuffer.wrap(x._2._2).order(java.nio.ByteOrder.LITTLE_ENDIAN).getInt())}
>
> val intJOIN = intRDD1.join(intRDD2)
> intJOIN.foreach{x => println("INTJOIN " + x._1 + "=> " + x._2._1 + " - "+
> x._2._2)}
>
>
> --
> stdout:
>
> BYTE0 0=> 1
> BYTE0 1=> 3
> BYTE0 2=> 5
> BYTE0 3=> 7
> BYTE0 4=> 9
> BYTE0 5=> 11
> BYTE0 6=> 13
> BYTE0 7=> 15
> BYTE0 8=> 17
> BYTE0 9=> 19
> BYTE0 10=> 21
> BYTE0 11=> 23
> BYTE0 12=> 25
> BYTE0 13=> 27
> BYTE0 14=> 29
> BYTE0 15=> 31
> BYTE0 16=> 33
> BYTE0 17=> 35
> BYTE0 18=> 37
> BYTE0 19=> 39
> BYTE0 20=> 41
> BYTE0 21=> 43
> BYTE0 22=> 45
> BYTE0 23=> 47
> BYTE0 24=> 49
> BYTE0 25=> 51
> BYTE0 26=> 53
> BYTE0 27=> 55
> BYTE0 28=> 57
> BYTE0 29=> 59
> BYTE1 0=> 0
> BYTE1 1=> 1
> BYTE1 2=> 2
> BYTE1 3=> 3
> BYTE1 4=> 4
> BYTE1 5=> 5
> BYTE1 6=> 6
> BYTE1 7=> 7
> BYTE1 8=> 8
> BYTE1 9=> 9
> BYTE1 10=> 10
> BYTE1 11=> 11
> BYTE1 12=> 12
> BYTE1 13=> 13
> BYTE1 14=> 14
> BYTE1 15=> 15
> BYTE1 16=> 16
> BYTE1 17=> 17
> BYTE1 18=> 18
> BYTE1 19=> 19
> BYTE1 20=> 20
> BYTE1 21=> 21
> BYTE1 22=> 22
> BYTE1 23=> 23
> BYTE1 24=> 24
> BYTE1 25=> 25
> BYTE1 26=> 26
> BYTE1 27=> 27
> BYTE1 28=> 28
> BYTE1 29=> 29
> BYTEJOIN 13=> 1 - 0
> BYTEJOIN 19=> 1 - 0
> BYTEJOIN 15=> 1 - 0
> BYTEJOIN 4=> 1 - 0
> BYTEJOIN 21=> 1 - 0
> BYTEJOIN 16=> 1 - 0
> BYTEJOIN 22=> 1 - 0
> BYTEJOIN 25=> 1 - 0
> BYTEJOIN 28=> 1 - 0
> BYTEJOIN 29=> 1 - 0
> BYTEJOIN 11=> 1 - 0
> BYTEJOIN 14=> 1 - 0
> BYTEJOIN 27=> 1 - 0
> BYTEJOIN 0=> 1 - 0
> BYTEJOIN 24=> 1 - 0
> BYTEJOIN 23=> 1 - 0
> BYTEJOIN 1=> 1 - 0
> BYTEJOIN 6=> 1 - 0
> BYTEJOIN 17=> 1 - 0
> BYTEJOIN 3=> 1 - 0
> BYTEJOIN 7=> 1 - 0
> BYTEJOIN 9=> 1 - 0
> BYTEJOIN 8=> 1 - 0
> BYTEJOIN 12=> 1 - 0
> BYTEJOIN 18=> 1 - 0
> BYTEJOIN 20=> 1 - 0
> BYTEJOIN 26=> 1 - 0
> BYTEJOIN 10=> 1 - 0
> BYTEJOIN 5=> 1 - 0
> BYTEJOIN 2=> 1 - 0
> INTJOIN 13=> 27 - 13
> INTJOIN 19=> 39 - 19
> INTJOIN 15=> 31 - 15
> INTJOIN 4=> 9 - 4
> INTJOIN 21=> 43 - 21
> INTJOIN 16=> 33 - 16
> INTJOIN 22=> 45 - 22
> INTJOIN 25=> 51 - 25
> INTJOIN 28=> 57 - 28
> INTJOIN 29=> 59 - 29
> INTJOIN 11=> 23 - 11
> INTJOIN 14=> 29 - 14
> INTJOIN 27=> 55 - 27
> INTJOIN 0=> 1 - 0
> INTJOIN 24=> 49 - 24
> INTJOIN 23=> 47 - 23
> INTJOIN 1=> 3 - 1
> INTJOIN 6=> 13 - 6
> INTJOIN 17=> 35 - 17
> INTJOIN 3=> 7 - 3
> INTJOIN 7=> 15 - 7
> INTJOIN 9=> 19 - 9
> INTJOIN 8=> 17 - 8
> INTJOIN 12=> 25 - 12
> INTJOIN 18=> 37 - 18
> INTJOIN 20=> 41 - 20
> INTJOIN 26=> 53 - 26
> INTJOIN 10=> 21 - 10
> INTJOIN 5=> 11 - 5
> INTJOIN 2=> 5 - 2
>
>
> Thanks,
> Hervé
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Low Latency SQL query

2015-12-01 Thread Josh Rosen
Use a long-lived SparkContext rather than creating a new one for each query.

On Tue, Dec 1, 2015 at 11:52 AM Andrés Ivaldi  wrote:

> Hi,
>
> I'd like to use spark to perform some transformations over data stored
> inSQL, but I need low Latency, I'm doing some test and I run into spark
> context creation and data query over SQL takes too long time.
>
> Any idea for speed up the process?
>
> regards.
>
> --
> Ing. Ivaldi Andres
>


Re: Question about yarn-cluster mode and spark.driver.allowMultipleContexts

2015-12-01 Thread Josh Rosen
Yep, you shouldn't enable *spark.driver.allowMultipleContexts* since it has
the potential to cause extremely difficult-to-debug task failures; it was
originally introduced as an escape-hatch to allow users whose workloads
happened to work "by accident" to continue using multiple active contexts,
but I would not write any new code which uses it.

On Tue, Dec 1, 2015 at 5:45 PM, Ted Yu  wrote:

> Looks like #2 is better choice.
>
> On Tue, Dec 1, 2015 at 4:51 PM, Anfernee Xu  wrote:
>
>> Thanks Ted, so 1) is off from the table, can I go with 2), yarn-cluster
>> mode? As the driver is running as a Yarn container, it's should be OK for
>> my usercase, isn't it?
>>
>> Anfernee
>>
>> On Tue, Dec 1, 2015 at 4:48 PM, Ted Yu  wrote:
>>
>>> For #1, looks like the config is used in test suites:
>>>
>>> .set("spark.driver.allowMultipleContexts", "true")
>>>
>>> ./sql/core/src/test/scala/org/apache/spark/sql/MultiSQLContextsSuite.scala
>>> .set("spark.driver.allowMultipleContexts", "true")
>>>
>>> ./sql/core/src/test/scala/org/apache/spark/sql/execution/ExchangeCoordinatorSuite.scala
>>>   .set("spark.driver.allowMultipleContexts", "false")
>>> val conf = new SparkConf().set("spark.driver.allowMultipleContexts",
>>> "false")
>>> .set("spark.driver.allowMultipleContexts", "true")
>>> .set("spark.driver.allowMultipleContexts", "true"))
>>> ./core/src/test/scala/org/apache/spark/SparkContextSuite.scala
>>>
>>> FYI
>>>
>>> On Tue, Dec 1, 2015 at 3:32 PM, Anfernee Xu 
>>> wrote:
>>>
 Hi,

 I have a doubt regarding yarn-cluster mode and spark.driver.
 allowMultipleContexts for below usercases.

 I have a long running backend server where I will create a short-lived
 Spark job in response to each user request, base on the fact that by
 default multiple Spark Context cannot be created in the same JVM, looks
 like I have 2 choices

 1) enable spark.driver.allowMultipleContexts

 2) run my jobs in yarn-cluster mode instead yarn-client

 For 1) I cannot find any official document, so looks like it's not
 encouraged, isn't it?
 For 2), I want to make sure yarn-cluster will NOT hit such
 limitation(single SparkContext per VM), apparently I have to something in
 driver side to push the result set back to my application.

 Thanks

 --
 --Anfernee

>>>
>>>
>>
>>
>> --
>> --Anfernee
>>
>
>


Re: spark.cleaner.ttl for 1.4.1

2015-11-30 Thread Josh Rosen
AFAIK the ContextCleaner should perform all of the cleaning *as long as
garbage collection is performed frequently enough on the driver*. See
https://issues.apache.org/jira/browse/SPARK-7689 and
https://github.com/apache/spark/pull/6220#issuecomment-102950055 for
discussion of this technicality.

On Mon, Nov 30, 2015 at 8:46 AM Michal Čizmazia  wrote:

> Does *spark.cleaner.ttl *still need to be used for Spark *1.4.1 *long-running
> streaming jobs? Or does *ContextCleaner* alone do all the cleaning?
>


Re: out of memory error with Parquet

2015-11-13 Thread Josh Rosen
Tip: jump straight to 1.5.2; it has some key bug fixes.

Sent from my phone

> On Nov 13, 2015, at 10:02 PM, AlexG  wrote:
> 
> Never mind; when I switched to Spark 1.5.0, my code works as written and is
> pretty fast! Looking at some Parquet related Spark jiras, it seems that
> Parquet is known to have some memory issues with buffering and writing, and
> that at least some were resolved in Spark 1.5.0. 
> 
> 
> 
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/out-of-memory-error-with-Parquet-tp25381p25382.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Anybody hit this issue in spark shell?

2015-11-09 Thread Josh Rosen
When we remove this, we should add a style-checker rule to ban the import
so that it doesn't get added back by accident.

On Mon, Nov 9, 2015 at 6:13 PM, Michael Armbrust 
wrote:

> Yeah, we should probably remove that.
>
> On Mon, Nov 9, 2015 at 5:54 PM, Ted Yu  wrote:
>
>> If there is no option to let shell skip processing @VisibleForTesting ,
>> should the annotation be dropped ?
>>
>> Cheers
>>
>> On Mon, Nov 9, 2015 at 5:50 PM, Marcelo Vanzin 
>> wrote:
>>
>>> We've had this in the past when using "@VisibleForTesting" in classes
>>> that for some reason the shell tries to process. QueryExecution.scala
>>> seems to use that annotation and that was added recently, so that's
>>> probably the issue.
>>>
>>> BTW, if anyone knows how Scala can find a reference to the original
>>> Guava class even after shading, I'd really like to know. I've looked
>>> several times and never found where the original class name is stored.
>>>
>>> On Mon, Nov 9, 2015 at 10:37 AM, Zhan Zhang 
>>> wrote:
>>> > Hi Folks,
>>> >
>>> > Does anybody meet the following issue? I use "mvn package -Phive
>>> > -DskipTests” to build the package.
>>> >
>>> > Thanks.
>>> >
>>> > Zhan Zhang
>>> >
>>> >
>>> >
>>> > bin/spark-shell
>>> > ...
>>> > Spark context available as sc.
>>> > error: error while loading QueryExecution, Missing dependency 'bad
>>> symbolic
>>> > reference. A signature in QueryExecution.class refers to term
>>> annotations
>>> > in package com.google.common which is not available.
>>> > It may be completely missing from the current classpath, or the
>>> version on
>>> > the classpath might be incompatible with the version used when
>>> compiling
>>> > QueryExecution.class.', required by
>>> >
>>> /Users/zzhang/repo/spark/assembly/target/scala-2.10/spark-assembly-1.6.0-SNAPSHOT-hadoop2.2.0.jar(org/apache/spark/sql/execution/QueryExecution.class)
>>> > :10: error: not found: value sqlContext
>>> >import sqlContext.implicits._
>>> >   ^
>>> > :10: error: not found: value sqlContext
>>> >import sqlContext.sql
>>> >   ^
>>>
>>>
>>>
>>> --
>>> Marcelo
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>


Re: If you use Spark 1.5 and disabled Tungsten mode ...

2015-10-27 Thread Josh Rosen
Hi Sjoerd,

Did your job actually *fail* or did it just generate many spurious
exceptions? While the stacktrace that you posted does indicate a bug, I
don't think that it should have stopped query execution because Spark
should have fallen back to an interpreted code path (note the "Failed to
generate ordering, fallback to interpreted" in the error message).

On Tue, Oct 27, 2015 at 12:56 PM Sjoerd Mulder 
wrote:

> I have disabled it because of it started generating ERROR's when upgrading
> from Spark 1.4 to 1.5.1
>
> 2015-10-27T20:50:11.574+0100 ERROR TungstenSort.newOrdering() - Failed to
> generate ordering, fallback to interpreted
> java.util.concurrent.ExecutionException: java.lang.Exception: failed to
> compile: org.codehaus.commons.compiler.CompileException: Line 15, Column 9:
> Invalid character input "@" (character code 64)
>
> public SpecificOrdering
> generate(org.apache.spark.sql.catalyst.expressions.Expression[] expr) {
>   return new SpecificOrdering(expr);
> }
>
> class SpecificOrdering extends
> org.apache.spark.sql.catalyst.expressions.codegen.BaseOrdering {
>
>   private org.apache.spark.sql.catalyst.expressions.Expression[]
> expressions;
>
>
>
>   public
> SpecificOrdering(org.apache.spark.sql.catalyst.expressions.Expression[]
> expr) {
> expressions = expr;
>
>   }
>
>   @Override
>   public int compare(InternalRow a, InternalRow b) {
> InternalRow i = null;  // Holds current row being evaluated.
>
> i = a;
> boolean isNullA2;
> long primitiveA3;
> {
>   /* input[2, LongType] */
>
>   boolean isNull0 = i.isNullAt(2);
>   long primitive1 = isNull0 ? -1L : (i.getLong(2));
>
>   isNullA2 = isNull0;
>   primitiveA3 = primitive1;
> }
> i = b;
> boolean isNullB4;
> long primitiveB5;
> {
>   /* input[2, LongType] */
>
>   boolean isNull0 = i.isNullAt(2);
>   long primitive1 = isNull0 ? -1L : (i.getLong(2));
>
>   isNullB4 = isNull0;
>   primitiveB5 = primitive1;
> }
> if (isNullA2 && isNullB4) {
>   // Nothing
> } else if (isNullA2) {
>   return 1;
> } else if (isNullB4) {
>   return -1;
> } else {
>   int comp = (primitiveA3 > primitiveB5 ? 1 : primitiveA3 <
> primitiveB5 ? -1 : 0);
>   if (comp != 0) {
> return -comp;
>   }
> }
>
> return 0;
>   }
> }
>
> at
> org.spark-project.guava.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:306)
> at
> org.spark-project.guava.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:293)
> at
> org.spark-project.guava.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
> at
> org.spark-project.guava.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:135)
> at
> org.spark-project.guava.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2410)
> at
> org.spark-project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2380)
> at
> org.spark-project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
> at
> org.spark-project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257)
> at org.spark-project.guava.cache.LocalCache.get(LocalCache.java:4000)
> at org.spark-project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004)
> at
> org.spark-project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
> at
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.compile(CodeGenerator.scala:362)
> at
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.create(GenerateOrdering.scala:139)
> at
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.create(GenerateOrdering.scala:37)
> at
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:425)
> at
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:422)
> at
> org.apache.spark.sql.execution.SparkPlan.newOrdering(SparkPlan.scala:294)
> at org.apache.spark.sql.execution.TungstenSort.org
> $apache$spark$sql$execution$TungstenSort$$preparePartition$1(sort.scala:131)
> at
> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169)
> at
> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169)
> at
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:59)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 

Re: [Spark SQL]: Spark Job Hangs on the refresh method when saving over 1 million files

2015-10-25 Thread Josh Rosen
Hi Jerry,

Do you have speculation enabled? A write which produces one million files /
output partitions might be using tons of driver memory via the
OutputCommitCoordinator's bookkeeping data structures.

On Sun, Oct 25, 2015 at 5:50 PM, Jerry Lam  wrote:

> Hi spark guys,
>
> I think I hit the same issue SPARK-8890
> https://issues.apache.org/jira/browse/SPARK-8890. It is marked as
> resolved. However it is not. I have over a million output directories for 1
> single column in partitionBy. Not sure if this is a regression issue? Do I
> need to set some parameters to make it more memory efficient?
>
> Best Regards,
>
> Jerry
>
>
>
>
> On Sun, Oct 25, 2015 at 8:39 PM, Jerry Lam  wrote:
>
>> Hi guys,
>>
>> After waiting for a day, it actually causes OOM on the spark driver. I
>> configure the driver to have 6GB. Note that I didn't call refresh myself.
>> The method was called when saving the dataframe in parquet format. Also I'm
>> using partitionBy() on the DataFrameWriter to generate over 1 million
>> files. Not sure why it OOM the driver after the job is marked _SUCCESS in
>> the output folder.
>>
>> Best Regards,
>>
>> Jerry
>>
>>
>> On Sat, Oct 24, 2015 at 9:35 PM, Jerry Lam  wrote:
>>
>>> Hi Spark users and developers,
>>>
>>> Does anyone encounter any issue when a spark SQL job produces a lot of
>>> files (over 1 millions), the job hangs on the refresh method? I'm using
>>> spark 1.5.1. Below is the stack trace. I saw the parquet files are produced
>>> but the driver is doing something very intensively (it uses all the cpus).
>>> Does it mean Spark SQL cannot be used to produce over 1 million files in a
>>> single job?
>>>
>>> Thread 528: (state = BLOCKED)
>>>  - java.util.Arrays.copyOf(char[], int) @bci=1, line=2367 (Compiled
>>> frame)
>>>  - java.lang.AbstractStringBuilder.expandCapacity(int) @bci=43, line=130
>>> (Compiled frame)
>>>  - java.lang.AbstractStringBuilder.ensureCapacityInternal(int) @bci=12,
>>> line=114 (Compiled frame)
>>>  - java.lang.AbstractStringBuilder.append(java.lang.String) @bci=19,
>>> line=415 (Compiled frame)
>>>  - java.lang.StringBuilder.append(java.lang.String) @bci=2, line=132
>>> (Compiled frame)
>>>  - org.apache.hadoop.fs.Path.toString() @bci=128, line=384 (Compiled
>>> frame)
>>>  -
>>> org.apache.spark.sql.sources.HadoopFsRelation$FileStatusCache$$anonfun$listLeafFiles$1.apply(org.apache.hadoop.fs.FileStatus)
>>> @bci=4, line=447 (Compiled frame)
>>>  -
>>> org.apache.spark.sql.sources.HadoopFsRelation$FileStatusCache$$anonfun$listLeafFiles$1.apply(java.lang.Object)
>>> @bci=5, line=447 (Compiled frame)
>>>  -
>>> scala.collection.TraversableLike$$anonfun$map$1.apply(java.lang.Object)
>>> @bci=9, line=244 (Compiled frame)
>>>  -
>>> scala.collection.TraversableLike$$anonfun$map$1.apply(java.lang.Object)
>>> @bci=2, line=244 (Compiled frame)
>>>  -
>>> scala.collection.IndexedSeqOptimized$class.foreach(scala.collection.IndexedSeqOptimized,
>>> scala.Function1) @bci=22, line=33 (Compiled frame)
>>>  - scala.collection.mutable.ArrayOps$ofRef.foreach(scala.Function1)
>>> @bci=2, line=108 (Compiled frame)
>>>  -
>>> scala.collection.TraversableLike$class.map(scala.collection.TraversableLike,
>>> scala.Function1, scala.collection.generic.CanBuildFrom) @bci=17, line=244
>>> (Compiled frame)
>>>  - scala.collection.mutable.ArrayOps$ofRef.map(scala.Function1,
>>> scala.collection.generic.CanBuildFrom) @bci=3, line=108 (Interpreted frame)
>>>  -
>>> org.apache.spark.sql.sources.HadoopFsRelation$FileStatusCache.listLeafFiles(java.lang.String[])
>>> @bci=279, line=447 (Interpreted frame)
>>>  -
>>> org.apache.spark.sql.sources.HadoopFsRelation$FileStatusCache.refresh()
>>> @bci=8, line=453 (Interpreted frame)
>>>  - 
>>> org.apache.spark.sql.sources.HadoopFsRelation.org$apache$spark$sql$sources$HadoopFsRelation$$fileStatusCache$lzycompute()
>>> @bci=26, line=465 (Interpreted frame)
>>>  - 
>>> org.apache.spark.sql.sources.HadoopFsRelation.org$apache$spark$sql$sources$HadoopFsRelation$$fileStatusCache()
>>> @bci=12, line=463 (Interpreted frame)
>>>  - org.apache.spark.sql.sources.HadoopFsRelation.refresh() @bci=1,
>>> line=540 (Interpreted frame)
>>>  -
>>> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation.refresh()
>>> @bci=1, line=204 (Interpreted frame)
>>>  -
>>> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp()
>>> @bci=392, line=152 (Interpreted frame)
>>>  -
>>> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply()
>>> @bci=1, line=108 (Interpreted frame)
>>>  -
>>> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply()
>>> @bci=1, line=108 (Interpreted frame)
>>>  -
>>> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(org.apache.spark.sql.SQLContext,
>>> org.apache.spark.sql.SQLContext$QueryExecution, scala.Function0) @bci=96,
>>> line=56 

Re: java.util.NoSuchElementException: key not found error

2015-10-21 Thread Josh Rosen
This is https://issues.apache.org/jira/browse/SPARK-10422, which has been
fixed in Spark 1.5.1.

On Wed, Oct 21, 2015 at 4:40 PM, Sourav Mazumder <
sourav.mazumde...@gmail.com> wrote:

> In 1.5.0 if I use randomSplit on a data frame I get this error.
>
> Here is teh code snippet -
>
> val splitData = merged.randomSplit(Array(70,30))
> val trainData = splitData(0).persist()
> val testData = splitData(1)
>
> trainData.registerTempTable("trn")
>
> %sql select * from trn
>
> The exception goes like this -
>
> java.util.NoSuchElementException: key not found: 1910 at
> scala.collection.MapLike$class.default(MapLike.scala:228) at
> scala.collection.AbstractMap.default(Map.scala:58) at
> scala.collection.mutable.HashMap.apply(HashMap.scala:64) at
> org.apache.spark.sql.columnar.compression.DictionaryEncoding$Encoder.compress(compressionSchemes.scala:258)
> at
> org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.build(CompressibleColumnBuilder.scala:110)
> at
> org.apache.spark.sql.columnar.NativeColumnBuilder.build(ColumnBuilder.scala:87)
> at
> org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(InMemoryColumnarTableScan.scala:152)
> at
> org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(InMemoryColumnarTableScan.scala:152)
> at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at
> scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) at
> org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:152)
> at
> org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:120)
> at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:278)
> at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171)
> at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78) at
> org.apache.spark.rdd.RDD.iterator(RDD.scala:262) at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at
> org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at
> org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at
> org.apache.spark.scheduler.Task.run(Task.scala:88) at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
>
>
> Any idea ?
>
> regards,
> Sourav
>


Re: spark-avro 2.0.1 generates strange schema (spark-avro 1.0.0 is fine)

2015-10-14 Thread Josh Rosen
Can you report this as an issue at
https://github.com/databricks/spark-avro/issues so that it's easier to
track? Thanks!

On Wed, Oct 14, 2015 at 1:38 PM, Alex Nastetsky <
alex.nastet...@vervemobile.com> wrote:

> I save my dataframe to avro with spark-avro 1.0.0 and it looks like this
> (using avro-tools tojson):
>
> {"field1":"value1","field2":976200}
> {"field1":"value2","field2":976200}
> {"field1":"value3","field2":614100}
>
> But when I use spark-avro 2.0.1, it looks like this:
>
> {"field1":{"string":"value1"},"field2":{"long":976200}}
> {"field1":{"string":"value2"},"field2":{"long":976200}}
> {"field1":{"string":"value3"},"field2":{"long":614100}}
>
> At this point I'd be happy to use spark-avro 1.0.0, except that it doesn't
> seem to support specifying a compression codec (I want deflate).
>


Re: Potential racing condition in DAGScheduler when Spark 1.5 caching

2015-09-24 Thread Josh Rosen
I believe that this is an instance of
https://issues.apache.org/jira/browse/SPARK-10422, which should be fixed in
upcoming 1.5.1 release.

On Thu, Sep 24, 2015 at 12:52 PM, Mark Hamstra 
wrote:

> Where do you see a race in the DAGScheduler?  On a quick look at your
> stack trace, this just looks to me like a Job where a Stage failed and then
> the DAGScheduler aborted the failed Job.
>
> On Thu, Sep 24, 2015 at 12:00 PM, robin_up  wrote:
>
>> Hi
>>
>> After upgrade to 1.5, we found a possible racing condition in DAGScheduler
>> similar to https://issues.apache.org/jira/browse/SPARK-4454.
>>
>> Here is the code creating the problem:
>>
>>
>> app_cpm_load = sc.textFile("/user/a/app_ecpm.txt").map(lambda x:
>> x.split(',')).map(lambda p: Row(app_id=str(p[0]), loc_filter=str(p[1]),
>> cpm_required=float(p[2]) ))
>> app_cpm = sqlContext.createDataFrame(app_cpm_load)
>> app_cpm.registerTempTable("app_cpm")
>>
>> app_rev_cpm_sql = '''select loc_filter from app_cpm'''
>> app_rev_cpm = sqlContext.sql(app_rev_cpm_sql)
>> app_rev_cpm.cache()
>> app_rev_cpm.show()
>>
>>
>>
>> Traceback (most recent call last):
>>   File "", line 1, in 
>>   File "/opt/spark/python/pyspark/sql/dataframe.py", line 256, in show
>> print(self._jdf.showString(n, truncate))
>>   File "/opt/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
>> line 538, in __call__
>>   File "/opt/spark/python/pyspark/sql/utils.py", line 36, in deco
>> return f(*a, **kw)
>>   File "/opt/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line
>> 300, in get_return_value
>> py4j.protocol.Py4JJavaError: An error occurred while calling
>> o46.showString.
>> : org.apache.spark.SparkException: Job aborted due to stage failure: Task
>> 0
>> in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage
>> 1.0
>> (TID 4, spark-yarn-dn02): java.util.NoSuchElementException: key not found:
>> UK
>> at scala.collection.MapLike$class.default(MapLike.scala:228)
>> at scala.collection.AbstractMap.default(Map.scala:58)
>> at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
>> at
>>
>> org.apache.spark.sql.columnar.compression.DictionaryEncoding$Encoder.compress(compressionSchemes.scala:258)
>> at
>>
>> org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.build(CompressibleColumnBuilder.scala:110)
>> at
>>
>> org.apache.spark.sql.columnar.NativeColumnBuilder.build(ColumnBuilder.scala:87)
>> at
>>
>> org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(InMemoryColumnarTableScan.scala:152)
>> at
>>
>> org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(InMemoryColumnarTableScan.scala:152)
>> at
>>
>> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>> at
>>
>> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>> at
>>
>> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>> at
>> scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>> at
>> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>> at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
>> at
>>
>> org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:152)
>> at
>>
>> org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:120)
>> at
>> org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:278)
>> at
>> org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171)
>> at
>> org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
>> at
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>> at
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>> at
>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>> at org.apache.spark.scheduler.Task.run(Task.scala:88)
>> at
>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>> at
>>
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> at
>>
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> at java.lang.Thread.run(Thread.java:745)
>>
>> Driver stacktrace:
>> at
>> org.apache.spark.scheduler.DAGScheduler.org
>> 

Does anyone use ShuffleDependency directly?

2015-09-18 Thread Josh Rosen
Does anyone use ShuffleDependency

directly in their Spark code or libraries? If so, how do you use it?

Similarly, does anyone use ShuffleHandle

directly?


Re: Re: Table is modified by DataFrameWriter

2015-09-16 Thread Josh Rosen
What are your JDBC properties configured to? Do you have overwrite mode
enabled?

On Wed, Sep 16, 2015 at 7:39 PM, guoqing0...@yahoo.com.hk <
guoqing0...@yahoo.com.hk> wrote:

> Spark-1.4.1
>
>
> *From:* Ted Yu 
> *Date:* 2015-09-17 10:29
> *To:* guoqing0...@yahoo.com.hk
> *CC:* user 
> *Subject:* Re: Table is modified by DataFrameWriter
> Can you tell us which release you were using ?
>
> Thanks
>
>
>
> On Sep 16, 2015, at 7:11 PM, "guoqing0...@yahoo.com.hk" <
> guoqing0...@yahoo.com.hk> wrote:
>
> Hi all,
> I found the table structure was modified  when use DataFrameWriter.jdbc
> to save the content of DataFrame ,
>
> sqlContext.sql("select '2015-09-17',count(1) from
> test").write.jdbc(url,test,properties)
>
> table structure before saving:
> app_key text
> t_amount bigint(20)
>
> saved:
> _c0 text
> _c1 bigint(20)
>
> Is there any way to just save the field in sequence and do not alter the
> table ?
>
> Thanks!
>
>


Re: Exception in spark

2015-08-11 Thread Josh Rosen
Can you share a query or stack trace? More information would make this
question easier to answer.

On Tue, Aug 11, 2015 at 8:50 PM, Ravisankar Mani rrav...@gmail.com wrote:

 Hi all,

   We got an exception like
 “org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call
 to dataType on unresolved object” when using some where condition queries.
 I am using 1.4.0 version spark. Is this exception resolved in latest spark?


 Regards,
 Ravi



Re: master compile broken for scala 2.11

2015-07-14 Thread Josh Rosen
I've opened a PR to fix this; please take a look:
https://github.com/apache/spark/pull/7405

On Tue, Jul 14, 2015 at 11:22 AM, Koert Kuipers ko...@tresata.com wrote:

 it works for scala 2.10, but for 2.11 i get:

 [ERROR]
 /home/koert/src/spark/sql/catalyst/src/main/java/org/apache/spark/sql/execution/UnsafeExternalRowSorter.java:135:
 error: anonymous org.apache.spark.sql.execution.UnsafeExternalRowSorter$1
 is not abstract and does not override abstract method
 BminBy(Function1InternalRow,B,OrderingB) in TraversableOnce
 [ERROR]   return new AbstractScalaRowIterator() {




Re: com.esotericsoftware.kryo.KryoException: java.io.IOException: failed to read chunk

2015-06-25 Thread Josh Rosen
Which Spark version are you using? AFAIK the corruption bugs in sort-based
shuffle should have been fixed in newer Spark releases.

On Wed, Jun 24, 2015 at 12:25 PM, Piero Cinquegrana 
pcinquegr...@marketshare.com wrote:

  Switching spark.shuffle.manager from sort to hash fixed this issue as
 documented here:



 https://issues.apache.org/jira/browse/SPARK-4105





 *From:* Piero Cinquegrana [mailto:pcinquegr...@marketshare.com]
 *Sent:* Wednesday, June 24, 2015 12:09 PM
 *To:* 'user@spark.apache.org'
 *Subject:* com.esotericsoftware.kryo.KryoException: java.io.IOException:
 failed to read chunk



 Hello Spark Experts,



 I am facing the following issue.



 1)  I am converting a org.apache.spark.sql.Row into
 org.apache.spark.mllib.linalg.Vectors using sparse notation

 2)  After the parsing proceeds successfully I try to look at the
 result and I get the following error: com.esotericsoftware.kryo.KryoException:
 java.io.IOException: failed to read chunk



 Has anybody experience this before? I noticed this bug
 https://issues.apache.org/jira/browse/SPARK-4105



 Is it related?



 This is the command that creates the RDD of Sparse vectors.



   val parsedData = stack.map(row = LabeledPoint(row.getDouble(4),
 sparseVectorCat(row, CategoriesIdx, InteractionIds, tupleMap, vecLength)))



 parsedData:
 org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint] =
 MapPartitionsRDD[64] at map at DataFrame.scala:848





 Here is how the data looks like after the conversion into sparse vectors



  val test2 = stack.filter(rows[1] is not null).head(10)



 test2: Array[org.apache.spark.sql.Row] = Array([2014-08-17
 20:19:00,2014-08-17
 20:19:00,545,Greenville-N.Bern-Washngtn,0.0,EST5EDT,Sunday,20,19,33,8,2014,Sunday|20,545,2014-08-17
 20:19:00,ArrayBuffer(37!2014-08-17 20:19:00!2014-08-17 20:19:50!2014-08-17
 20:19:50!2014-08-17
 20:19:50!545!Greenville-N.Bern-Washngtn!EST5EDT!UNKNOWN!10!CNN!Prime!M!UNKNOWN|10|CNN|Prime|M!2019!0.006461420296348449,
 37!2014-08-17 20:19:00!2014-08-17 20:19:45!2014-08-17 20:19:45!2014-08-17
 20:19:45!545!Greenville-N.Bern-Washngtn!EST5EDT!UNKNOWN!5!CNN!Prime!M!UNKNOWN|5|CNN|Prime|M!2019!0.006461420296348449)],
 [2014-08-17 23:45:00,2014-08-18
 00:45:00,625,Waco-Temple-Bryan,0.0,CST6CDT,Sunday,23,45,33,8,2014,Sunday|23,625,2014-08-18
 00:45:00,ArrayBuffer(276!2014-08-18 00:45:00!2014-08-18 00:45:14!2014-08-17
 23:45:14...



  val parsedTest2 = test2.map(row = LabeledPoint(row.getDouble(4),
 sparseVectorCat(row, CategoriesIdx, InteractionIds, tupleMap, vecLength)))



 parsedTest2: Array[org.apache.spark.mllib.regression.LabeledPoint] =
 Array((0.0,(2450,[241,1452,1480,1608,1706],[1.0,1.0,2019.0,2019.0,1.0])),
 (0.0,(2450,[6,1133,1342,2184,2314],[1.0,1.0,1050.0,1244.0,1.0])),
 (0.0,(2450,[414,1133,1310,1605,2206],[941.0,1.0,1.0,1.0,907.0])),
 (0.0,(2450,[322,761,981,1203,1957],[5203.0,1.0,1.0,1.0,5203.0])),
 (1.0,(2450,[117,322,943,1757,1957],[1.0,910.0,1.0,1.0,910.0])),
 (0.0,(2450,[645,1018,1074,1778,1974],[1.0,1507.0,1.0,1507.0,1.0])),
 (0.0,(2450,[522,542,814,1128,1749],[1.0,432.0,796.0,1.0,1.0])),
 (0.0,(2450,[166,322,413,1706,2256],[1.0,1260.0,1.0,1.0,1260.0])),
 (1.0,(2450,[1203,1295,1354,2189,2388],[1.0,1.0,1.0,3705.0,2823.0])),
 (6.0,(2450,[293,1203,1312,1627,2035],[2716.0,1.0,1.0,1.0,2716.0])))





  parsedData.first



 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
 in stage 52.0 failed 4 times, most recent failure: Lost task 0.3 in stage
 52.0 (TID 4243, ip-10-0-0-174.ec2.internal):
 com.esotericsoftware.kryo.KryoException: java.io.IOException: failed to
 read chunk

 Serialization trace:

 values (org.apache.spark.sql.catalyst.expressions.GenericMutableRow)

 at com.esotericsoftware.kryo.io.Input.fill(Input.java:142)

 at com.esotericsoftware.kryo.io.Input.require(Input.java:169)

 at
 com.esotericsoftware.kryo.io.Input.readUtf8_slow(Input.java:524)

 at com.esotericsoftware.kryo.io.Input.readUtf8(Input.java:517)

 at
 com.esotericsoftware.kryo.io.Input.readString(Input.java:447)

 at
 com.esotericsoftware.kryo.serializers.DefaultSerializers$StringSerializer.read(DefaultSerializers.java:157)

 at
 com.esotericsoftware.kryo.serializers.DefaultSerializers$StringSerializer.read(DefaultSerializers.java:146)

 at
 com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)

 at
 com.twitter.chill.TraversableSerializer.read(Traversable.scala:43)

 at
 com.twitter.chill.TraversableSerializer.read(Traversable.scala:21)

 at
 com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)

 at
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:338)

 at
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:293)

 at 

Re: org.apache.spark.sql.ScalaReflectionLock

2015-06-23 Thread Josh Rosen
Mind filing a JIRA?

On Tue, Jun 23, 2015 at 9:34 AM, Koert Kuipers ko...@tresata.com wrote:

 just a heads up, i was doing some basic coding using DataFrame, Row,
 StructType, etc. and i ended up with deadlocks in my sbt tests due to the
 usage of
 ScalaReflectionLock.synchronized in the spark sql code.
 the issue away when i changed my tests to run consecutively...




Re: Serializer not switching

2015-06-22 Thread Josh Rosen
My hunch is that you changed spark.serializer to Kryo but left
spark.closureSerializer unmodified, so it's still using Java for closure
serialization.  Kryo doesn't really work as a closure serializer but
there's an open pull request to fix this:
https://github.com/apache/spark/pull/6361

On Mon, Jun 22, 2015 at 5:42 AM, Sean Barzilay sesnbarzi...@gmail.com
wrote:

 My program is written in Scala. I am creating a jar and submitting it
 using spark-submit.
 My code is on a computer in an internal network withe no internet so I
 can't send it.

 On Mon, Jun 22, 2015, 3:19 PM Akhil Das ak...@sigmoidanalytics.com
 wrote:

 How are you submitting the application? Could you paste the code that you
 are running?

 Thanks
 Best Regards

 On Mon, Jun 22, 2015 at 5:37 PM, Sean Barzilay sesnbarzi...@gmail.com
 wrote:

 I am trying to run a function on every line of a parquet file. The
 function is in an object. When I run the program, I get an exception that
 the object is not serializable. I read around the internet and found that I
 should use Kryo Serializer. I changed the setting in the spark conf and
 registered the object to the Kryo Serializer. When I run the program I
 still get the same exception (from the stack trace: at
 org.apache.spark.serializer.JavaSerializationStream.write
 object(JavaSerializer.scala:47)). For some reason, the program is still
 trying to serialize using the default java Serializer. I am working with
 spark 1.4.





Re: What is most efficient to do a large union and remove duplicates?

2015-06-14 Thread Josh Rosen
If your job is dying due to out of memory errors in the post-shuffle stage,
I'd consider the following approach for implementing de-duplication /
distinct():

- Use sortByKey() to perform a full sort of your dataset.
- Use mapPartitions() to iterate through each partition of the sorted
dataset, consuming contiguous runs of records with the same key and
performing your aggregation / de-duplication logic.
- If you think that your workload will benefit from pre-shuffle
de-duplication, implement this using mapPartitions() before the sortBykey()
transformation.

This is effectively a way of performing external aggregation using the
existing Spark core APIs.

In the longer term, I would recommend using DataFrames for this, since
there are planned optimizations for distinct queries that could make a
large difference here (e.g. the Tungsten data layout optimizations,
forthcoming Tungsten record-sorting optimizations, etc).

On Sat, Jun 13, 2015 at 10:49 AM, Gavin Yue yue.yuany...@gmail.com wrote:

 I have 10 folder, each with 6000 files. Each folder is roughly 500GB.  So
 totally 5TB data.

 The data is  formatted as  key t/ value.  After union,  I want to remove
 the
 duplicates among keys. So each key should be unique and  has only one
 value.

 Here is what I am doing.

 folders = Array(folder1,folder2folder10 )

 var rawData = sc.textFile(folders(0)).map(x = (x.split(\t)(0),
 x.split(\t)(1)))

 for (a - 1 to sud_paths.length - 1) {
   rawData = rawData.union(sc.textFile(folders (a)).map(x =
 (x.split(\t)(0), x.split(\t)(1
 }

 val nodups = rawData.reduceByKey((a,b)=
 {
   if(a.length  b.length)
   {a}
   else
   {b}
   }
 )
 nodups.saveAsTextFile(/nodups)

 Anything I could do to make this process faster?   Right now my process
 dies
 when output the data to the HDFS.


 Thank you !



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/What-is-most-efficient-to-do-a-large-union-and-remove-duplicates-tp23303.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: spark-sql from CLI ---EXCEPTION: java.lang.OutOfMemoryError: Java heap space

2015-06-13 Thread Josh Rosen
Try using Spark 1.4.0 with SQL code generation turned on; this should make
a huge difference.

On Sat, Jun 13, 2015 at 5:08 PM, Sanjay Subramanian 
sanjaysubraman...@yahoo.com wrote:

 hey guys

 I tried the following settings as well. No luck

 --total-executor-cores 24 --executor-memory 4G


 BTW on the same cluster , impala absolutely kills it. same query 9
 seconds. no memory issues. no issues.

 In fact I am pretty disappointed with Spark-SQL.
 I have worked with Hive during the 0.9.x stages and taken projects to
 production successfully and Hive actually very rarely craps out.

 Whether the spark folks like what I say or not, yes my expectations are
 pretty high of Spark-SQL if I were to change the ways we are doing things
 at my workplace.
 Until that time, we are going to be hugely dependent on Impala and
  Hive(with SSD speeding up the shuffle stage , even MR jobs are not that
 slow now).

 I want to clarify for those of u who may be asking - why I am not using
 spark with Scala and insisting on using spark-sql ?

 - I have already pipelined data from enterprise tables to Hive
 - I am using CDH 5.3.3 (Cloudera starving developers version)
 - I have close to 300 tables defined in Hive external tables.
 - Data if on HDFS
 - On an average we have 150 columns per table
 - One an everyday basis , we do crazy amounts of ad-hoc joining of new and
 old tables in getting datasets ready for supervised ML
 - I thought that quite simply I can point Spark to the Hive meta and do
 queries as I do - in fact the existing queries would work as is unless I am
 using some esoteric Hive/Impala function

 Anyway, if there are some settings I can use and get spark-sql to run even
 on standalone mode that will be huge help.

 On the pre-production cluster I have spark on YARN but could never get it
 to run fairly complex queries and I have no answers from this group of the
 CDH groups.

 So my assumption is that its possibly not solved , else I have always got
 very quick answers and responses :-) to my questions on all CDH groups,
 Spark, Hive

 best regards

 sanjay



   --
  *From:* Josh Rosen rosenvi...@gmail.com
 *To:* Sanjay Subramanian sanjaysubraman...@yahoo.com
 *Cc:* user@spark.apache.org user@spark.apache.org
 *Sent:* Friday, June 12, 2015 7:15 AM
 *Subject:* Re: spark-sql from CLI ---EXCEPTION:
 java.lang.OutOfMemoryError: Java heap space

 It sounds like this might be caused by a memory configuration problem.  In
 addition to looking at the executor memory, I'd also bump up the driver
 memory, since it appears that your shell is running out of memory when
 collecting a large query result.

 Sent from my phone



 On Jun 11, 2015, at 8:43 AM, Sanjay Subramanian 
 sanjaysubraman...@yahoo.com.INVALID wrote:

 hey guys

 Using Hive and Impala daily intensively.
 Want to transition to spark-sql in CLI mode

 Currently in my sandbox I am using the Spark (standalone mode) in the CDH
 distribution (starving developer version 5.3.3)
 3 datanode hadoop cluster
 32GB RAM per node
 8 cores per node

 spark
 1.2.0+cdh5.3.3+371


 I am testing some stuff on one view and getting memory errors
 Possibly reason is default memory per executor showing on 18080 is
 512M

 These options when used to start the spark-sql CLI does not seem to have
 any effect
 --total-executor-cores 12 --executor-memory 4G



 /opt/cloudera/parcels/CDH/lib/spark/bin/spark-sql -e  select distinct
 isr,event_dt,age,age_cod,sex,year,quarter from aers.aers_demo_view

 aers.aers_demo_view (7 million+ records)
 ===
 isr bigint  case id
 event_dtbigint  Event date
 age double  age of patient
 age_cod string  days,months years
 sex string  M or F
 yearint
 quarter int


 VIEW DEFINITION
 
 CREATE VIEW `aers.aers_demo_view` AS SELECT `isr` AS `isr`, `event_dt` AS
 `event_dt`, `age` AS `age`, `age_cod` AS `age_cod`, `gndr_cod` AS `sex`,
 `year` AS `year`, `quarter` AS `quarter` FROM (SELECT
`aers_demo_v1`.`isr`,
`aers_demo_v1`.`event_dt`,
`aers_demo_v1`.`age`,
`aers_demo_v1`.`age_cod`,
`aers_demo_v1`.`gndr_cod`,
`aers_demo_v1`.`year`,
`aers_demo_v1`.`quarter`
 FROM
   `aers`.`aers_demo_v1`
 UNION ALL
 SELECT
`aers_demo_v2`.`isr`,
`aers_demo_v2`.`event_dt`,
`aers_demo_v2`.`age`,
`aers_demo_v2`.`age_cod`,
`aers_demo_v2`.`gndr_cod`,
`aers_demo_v2`.`year`,
`aers_demo_v2`.`quarter`
 FROM
   `aers`.`aers_demo_v2`
 UNION ALL
 SELECT
`aers_demo_v3`.`isr`,
`aers_demo_v3`.`event_dt`,
`aers_demo_v3`.`age`,
`aers_demo_v3`.`age_cod`,
`aers_demo_v3`.`gndr_cod`,
`aers_demo_v3`.`year`,
`aers_demo_v3`.`quarter`
 FROM
   `aers`.`aers_demo_v3`
 UNION ALL
 SELECT
`aers_demo_v4`.`isr`,
`aers_demo_v4`.`event_dt`,
`aers_demo_v4`.`age`,
`aers_demo_v4`.`age_cod`,
`aers_demo_v4`.`gndr_cod`,
`aers_demo_v4`.`year`,
`aers_demo_v4`.`quarter`
 FROM
   `aers`.`aers_demo_v4`
 UNION ALL
 SELECT

Re: spark-sql from CLI ---EXCEPTION: java.lang.OutOfMemoryError: Java heap space

2015-06-12 Thread Josh Rosen


Sent from my phone

 On Jun 11, 2015, at 8:43 AM, Sanjay Subramanian 
 sanjaysubraman...@yahoo.com.INVALID wrote:
 
 hey guys
 
 Using Hive and Impala daily intensively.
 Want to transition to spark-sql in CLI mode
 
 Currently in my sandbox I am using the Spark (standalone mode) in the CDH 
 distribution (starving developer version 5.3.3)
 3 datanode hadoop cluster
 32GB RAM per node
 8 cores per node
 
 spark 
 1.2.0+cdh5.3.3+371
 
 
 I am testing some stuff on one view and getting memory errors
 Possibly reason is default memory per executor showing on 18080 is 
 512M
 
 These options when used to start the spark-sql CLI does not seem to have any 
 effect 
 --total-executor-cores 12 --executor-memory 4G
 
 
 
 /opt/cloudera/parcels/CDH/lib/spark/bin/spark-sql -e  select distinct 
 isr,event_dt,age,age_cod,sex,year,quarter from aers.aers_demo_view
 
 aers.aers_demo_view (7 million+ records)
 ===
 isr bigint  case id
 event_dtbigint  Event date
 age double  age of patient
 age_cod string  days,months years
 sex string  M or F
 yearint
 quarter int
 
 
 VIEW DEFINITION
 
 CREATE VIEW `aers.aers_demo_view` AS SELECT `isr` AS `isr`, `event_dt` AS 
 `event_dt`, `age` AS `age`, `age_cod` AS `age_cod`, `gndr_cod` AS `sex`, 
 `year` AS `year`, `quarter` AS `quarter` FROM (SELECT
`aers_demo_v1`.`isr`,
`aers_demo_v1`.`event_dt`,
`aers_demo_v1`.`age`,
`aers_demo_v1`.`age_cod`,
`aers_demo_v1`.`gndr_cod`,
`aers_demo_v1`.`year`,
`aers_demo_v1`.`quarter`
 FROM
   `aers`.`aers_demo_v1`
 UNION ALL
 SELECT
`aers_demo_v2`.`isr`,
`aers_demo_v2`.`event_dt`,
`aers_demo_v2`.`age`,
`aers_demo_v2`.`age_cod`,
`aers_demo_v2`.`gndr_cod`,
`aers_demo_v2`.`year`,
`aers_demo_v2`.`quarter`
 FROM
   `aers`.`aers_demo_v2`
 UNION ALL
 SELECT
`aers_demo_v3`.`isr`,
`aers_demo_v3`.`event_dt`,
`aers_demo_v3`.`age`,
`aers_demo_v3`.`age_cod`,
`aers_demo_v3`.`gndr_cod`,
`aers_demo_v3`.`year`,
`aers_demo_v3`.`quarter`
 FROM
   `aers`.`aers_demo_v3`
 UNION ALL
 SELECT
`aers_demo_v4`.`isr`,
`aers_demo_v4`.`event_dt`,
`aers_demo_v4`.`age`,
`aers_demo_v4`.`age_cod`,
`aers_demo_v4`.`gndr_cod`,
`aers_demo_v4`.`year`,
`aers_demo_v4`.`quarter`
 FROM
   `aers`.`aers_demo_v4`
 UNION ALL
 SELECT
`aers_demo_v5`.`primaryid` AS `ISR`,
`aers_demo_v5`.`event_dt`,
`aers_demo_v5`.`age`,
`aers_demo_v5`.`age_cod`,
`aers_demo_v5`.`gndr_cod`,
`aers_demo_v5`.`year`,
`aers_demo_v5`.`quarter`
 FROM
   `aers`.`aers_demo_v5`
 UNION ALL
 SELECT
`aers_demo_v6`.`primaryid` AS `ISR`,
`aers_demo_v6`.`event_dt`,
`aers_demo_v6`.`age`,
`aers_demo_v6`.`age_cod`,
`aers_demo_v6`.`sex` AS `GNDR_COD`,
`aers_demo_v6`.`year`,
`aers_demo_v6`.`quarter`
 FROM
   `aers`.`aers_demo_v6`) `aers_demo_view`
 
 
 
 
 
 
 
 15/06/11 08:36:36 WARN DefaultChannelPipeline: An exception was thrown by a 
 user handler while handling an exception event ([id: 0x01b99855, 
 /10.0.0.19:58117 = /10.0.0.19:52016] EXCEPTION: java.lang.OutOfMemoryError: 
 Java heap space)
 java.lang.OutOfMemoryError: Java heap space
 at 
 org.jboss.netty.buffer.HeapChannelBuffer.init(HeapChannelBuffer.java:42)
 at 
 org.jboss.netty.buffer.BigEndianHeapChannelBuffer.init(BigEndianHeapChannelBuffer.java:34)
 at 
 org.jboss.netty.buffer.ChannelBuffers.buffer(ChannelBuffers.java:134)
 at 
 org.jboss.netty.buffer.HeapChannelBufferFactory.getBuffer(HeapChannelBufferFactory.java:68)
 at 
 org.jboss.netty.buffer.AbstractChannelBufferFactory.getBuffer(AbstractChannelBufferFactory.java:48)
 at 
 org.jboss.netty.handler.codec.frame.FrameDecoder.newCumulationBuffer(FrameDecoder.java:507)
 at 
 org.jboss.netty.handler.codec.frame.FrameDecoder.updateCumulation(FrameDecoder.java:345)
 at 
 org.jboss.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:312)
 at 
 org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268)
 at 
 org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:255)
 at 
 org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
 at 
 org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:109)
 at 
 org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:312)
 at 
 org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:90)
 at 
 org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 15/06/11 08:36:40 ERROR Utils: Uncaught exception in thread 
 task-result-getter-0
 java.lang.OutOfMemoryError: GC 

Re: spark-sql from CLI ---EXCEPTION: java.lang.OutOfMemoryError: Java heap space

2015-06-12 Thread Josh Rosen
It sounds like this might be caused by a memory configuration problem.  In 
addition to looking at the executor memory, I'd also bump up the driver memory, 
since it appears that your shell is running out of memory when collecting a 
large query result.

Sent from my phone

 On Jun 11, 2015, at 8:43 AM, Sanjay Subramanian 
 sanjaysubraman...@yahoo.com.INVALID wrote:
 
 hey guys
 
 Using Hive and Impala daily intensively.
 Want to transition to spark-sql in CLI mode
 
 Currently in my sandbox I am using the Spark (standalone mode) in the CDH 
 distribution (starving developer version 5.3.3)
 3 datanode hadoop cluster
 32GB RAM per node
 8 cores per node
 
 spark 
 1.2.0+cdh5.3.3+371
 
 
 I am testing some stuff on one view and getting memory errors
 Possibly reason is default memory per executor showing on 18080 is 
 512M
 
 These options when used to start the spark-sql CLI does not seem to have any 
 effect 
 --total-executor-cores 12 --executor-memory 4G
 
 
 
 /opt/cloudera/parcels/CDH/lib/spark/bin/spark-sql -e  select distinct 
 isr,event_dt,age,age_cod,sex,year,quarter from aers.aers_demo_view
 
 aers.aers_demo_view (7 million+ records)
 ===
 isr bigint  case id
 event_dtbigint  Event date
 age double  age of patient
 age_cod string  days,months years
 sex string  M or F
 yearint
 quarter int
 
 
 VIEW DEFINITION
 
 CREATE VIEW `aers.aers_demo_view` AS SELECT `isr` AS `isr`, `event_dt` AS 
 `event_dt`, `age` AS `age`, `age_cod` AS `age_cod`, `gndr_cod` AS `sex`, 
 `year` AS `year`, `quarter` AS `quarter` FROM (SELECT
`aers_demo_v1`.`isr`,
`aers_demo_v1`.`event_dt`,
`aers_demo_v1`.`age`,
`aers_demo_v1`.`age_cod`,
`aers_demo_v1`.`gndr_cod`,
`aers_demo_v1`.`year`,
`aers_demo_v1`.`quarter`
 FROM
   `aers`.`aers_demo_v1`
 UNION ALL
 SELECT
`aers_demo_v2`.`isr`,
`aers_demo_v2`.`event_dt`,
`aers_demo_v2`.`age`,
`aers_demo_v2`.`age_cod`,
`aers_demo_v2`.`gndr_cod`,
`aers_demo_v2`.`year`,
`aers_demo_v2`.`quarter`
 FROM
   `aers`.`aers_demo_v2`
 UNION ALL
 SELECT
`aers_demo_v3`.`isr`,
`aers_demo_v3`.`event_dt`,
`aers_demo_v3`.`age`,
`aers_demo_v3`.`age_cod`,
`aers_demo_v3`.`gndr_cod`,
`aers_demo_v3`.`year`,
`aers_demo_v3`.`quarter`
 FROM
   `aers`.`aers_demo_v3`
 UNION ALL
 SELECT
`aers_demo_v4`.`isr`,
`aers_demo_v4`.`event_dt`,
`aers_demo_v4`.`age`,
`aers_demo_v4`.`age_cod`,
`aers_demo_v4`.`gndr_cod`,
`aers_demo_v4`.`year`,
`aers_demo_v4`.`quarter`
 FROM
   `aers`.`aers_demo_v4`
 UNION ALL
 SELECT
`aers_demo_v5`.`primaryid` AS `ISR`,
`aers_demo_v5`.`event_dt`,
`aers_demo_v5`.`age`,
`aers_demo_v5`.`age_cod`,
`aers_demo_v5`.`gndr_cod`,
`aers_demo_v5`.`year`,
`aers_demo_v5`.`quarter`
 FROM
   `aers`.`aers_demo_v5`
 UNION ALL
 SELECT
`aers_demo_v6`.`primaryid` AS `ISR`,
`aers_demo_v6`.`event_dt`,
`aers_demo_v6`.`age`,
`aers_demo_v6`.`age_cod`,
`aers_demo_v6`.`sex` AS `GNDR_COD`,
`aers_demo_v6`.`year`,
`aers_demo_v6`.`quarter`
 FROM
   `aers`.`aers_demo_v6`) `aers_demo_view`
 
 
 
 
 
 
 
 15/06/11 08:36:36 WARN DefaultChannelPipeline: An exception was thrown by a 
 user handler while handling an exception event ([id: 0x01b99855, 
 /10.0.0.19:58117 = /10.0.0.19:52016] EXCEPTION: java.lang.OutOfMemoryError: 
 Java heap space)
 java.lang.OutOfMemoryError: Java heap space
 at 
 org.jboss.netty.buffer.HeapChannelBuffer.init(HeapChannelBuffer.java:42)
 at 
 org.jboss.netty.buffer.BigEndianHeapChannelBuffer.init(BigEndianHeapChannelBuffer.java:34)
 at 
 org.jboss.netty.buffer.ChannelBuffers.buffer(ChannelBuffers.java:134)
 at 
 org.jboss.netty.buffer.HeapChannelBufferFactory.getBuffer(HeapChannelBufferFactory.java:68)
 at 
 org.jboss.netty.buffer.AbstractChannelBufferFactory.getBuffer(AbstractChannelBufferFactory.java:48)
 at 
 org.jboss.netty.handler.codec.frame.FrameDecoder.newCumulationBuffer(FrameDecoder.java:507)
 at 
 org.jboss.netty.handler.codec.frame.FrameDecoder.updateCumulation(FrameDecoder.java:345)
 at 
 org.jboss.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:312)
 at 
 org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268)
 at 
 org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:255)
 at 
 org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
 at 
 org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:109)
 at 
 org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:312)
 at 
 org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:90)
 at 
 org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 

Re: Fully in-memory shuffles

2015-06-10 Thread Josh Rosen
There's a discussion of this at https://github.com/apache/spark/pull/5403



On Wed, Jun 10, 2015 at 7:08 AM, Corey Nolet cjno...@gmail.com wrote:

 Is it possible to configure Spark to do all of its shuffling FULLY in
 memory (given that I have enough memory to store all the data)?






Re: union and reduceByKey wrong shuffle?

2015-06-02 Thread Josh Rosen
Ah, interesting.  While working on my new Tungsten shuffle manager, I came
up with some nice testing interfaces for allowing me to manually trigger
spills in order to deterministically test those code paths without
requiring large amounts of data to be shuffled.  Maybe I could make similar
test interface changes to the existing shuffle code, which might make it
easier to reproduce this in an isolated environment.

On Mon, Jun 1, 2015 at 11:41 PM, Igor Berman igor.ber...@gmail.com wrote:

 Hi,
 small mock data doesn't reproduce the problem. IMHO problem is reproduced
 when we make shuffle big enough to split data into disk.
 We will work on it to understand and reproduce the problem(not first
 priority though...)


 On 1 June 2015 at 23:02, Josh Rosen rosenvi...@gmail.com wrote:

 How much work is to produce a small standalone reproduction?  Can you
 create an Avro file with some mock data, maybe 10 or so records, then
 reproduce this locally?

 On Mon, Jun 1, 2015 at 12:31 PM, Igor Berman igor.ber...@gmail.com
 wrote:

 switching to use simple pojos instead of using avro for spark
 serialization solved the problem(I mean reading avro from s3 and than
 mapping each avro object to it's pojo serializable counterpart with same
 fields, pojo is registered withing kryo)
 Any thought where to look for a problem/misconfiguration?

 On 31 May 2015 at 22:48, Igor Berman igor.ber...@gmail.com wrote:

 Hi
 We are using spark 1.3.1
 Avro-chill (tomorrow will check if its important) we register avro
 classes from java
 Avro 1.7.6
 On May 31, 2015 22:37, Josh Rosen rosenvi...@gmail.com wrote:

 Which Spark version are you using?  I'd like to understand whether
 this change could be caused by recent Kryo serializer re-use changes in
 master / Spark 1.4.

 On Sun, May 31, 2015 at 11:31 AM, igor.berman igor.ber...@gmail.com
 wrote:

 after investigation the problem is somehow connected to avro
 serialization
 with kryo + chill-avro(mapping avro object to simple scala case class
 and
 running reduce on these case class objects solves the problem)




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/union-and-reduceByKey-wrong-shuffle-tp23092p23093.html
 Sent from the Apache Spark User List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org








Re: java.io.IOException: FAILED_TO_UNCOMPRESS(5)

2015-06-02 Thread Josh Rosen
My suggestion is that you change the Spark setting which controls the
compression codec that Spark uses for internal data transfers. Set
spark.io.compression.codec
to lzf in your SparkConf.

On Mon, Jun 1, 2015 at 8:46 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:

 Hello Josh,
 Are you suggesting to store the source data in LZF compression and use the
 same Spark code as is ?
 Currently its stored in sequence file format and compressed with GZIP.

 First line of the data:

 (SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text'
 org.apache.hadoop.io.compress.GzipCodec?v?
 )

 Regards,
 Deepak

 On Tue, Jun 2, 2015 at 4:16 AM, Josh Rosen rosenvi...@gmail.com wrote:

 If you can't run a patched Spark version, then you could also consider
 using LZF compression instead, since that codec isn't affected by this bug.

 On Mon, Jun 1, 2015 at 3:32 PM, Andrew Or and...@databricks.com wrote:

 Hi Deepak,

 This is a notorious bug that is being tracked at
 https://issues.apache.org/jira/browse/SPARK-4105. We have fixed one
 source of this bug (it turns out Snappy had a bug in buffer reuse that
 caused data corruption). There are other known sources that are being
 addressed in outstanding patches currently.

 Since you're using 1.3.1 my guess is that you don't have this patch:
 https://github.com/apache/spark/pull/6176, which I believe should fix
 the issue in your case. It's merged for 1.3.2 (not yet released) but not in
 time for 1.3.1, so feel free to patch it yourself and see if it works.

 -Andrew


 2015-06-01 8:00 GMT-07:00 ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com:

 Any suggestions ?

 I using Spark 1.3.1 to read   sequence file stored in Sequence File
 format
 (SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text'org.apache.hadoop.io.compress.GzipCodec?v?
 )

 with this code and settings
 sc.sequenceFile(dwTable, classOf[Text], classOf[Text]).partitionBy(new
 org.apache.spark.HashPartitioner(2053))
 .set(spark.serializer, org.apache.spark.serializer.KryoSerializer)
   .set(spark.kryoserializer.buffer.mb,
 arguments.get(buffersize).get)
   .set(spark.kryoserializer.buffer.max.mb,
 arguments.get(maxbuffersize).get)
   .set(spark.driver.maxResultSize,
 arguments.get(maxResultSize).get)
   .set(spark.yarn.maxAppAttempts, 0)
   //.set(spark.akka.askTimeout, arguments.get(askTimeout).get)
   //.set(spark.akka.timeout, arguments.get(akkaTimeout).get)
   //.set(spark.worker.timeout, arguments.get(workerTimeout).get)

 .registerKryoClasses(Array(classOf[com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum]))


 and values are
 buffersize=128 maxbuffersize=1068 maxResultSize=200G


 And i see this exception in each executor task

 FetchFailed(BlockManagerId(278, phxaishdc9dn1830.stratus.phx.ebay.com,
 54757), shuffleId=6, mapId=2810, reduceId=1117, message=

 org.apache.spark.shuffle.FetchFailedException: FAILED_TO_UNCOMPRESS(5)

 at
 org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67)

 at
 org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)

 at
 org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)

 at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)

 at
 org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)

 at
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)

 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)

 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)

 at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)

 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)

 at
 org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:248)

 at
 org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:172)

 at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:79)

 at org.apache.spark.rdd.RDD.iterator(RDD.scala:242)

 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)

 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)

 at org.apache.spark.scheduler.Task.run(Task.scala:64)

 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)

 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

 at java.lang.Thread.run(Thread.java:745)

 *Caused by: java.io.IOException: FAILED_TO_UNCOMPRESS(5)*

 at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:84)

 at org.xerial.snappy.SnappyNative.rawUncompress(Native Method)

 at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:444)

 at org.xerial.snappy.Snappy.uncompress(Snappy.java:480)

 at
 org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:135

Re: union and reduceByKey wrong shuffle?

2015-06-01 Thread Josh Rosen
How much work is to produce a small standalone reproduction?  Can you
create an Avro file with some mock data, maybe 10 or so records, then
reproduce this locally?

On Mon, Jun 1, 2015 at 12:31 PM, Igor Berman igor.ber...@gmail.com wrote:

 switching to use simple pojos instead of using avro for spark
 serialization solved the problem(I mean reading avro from s3 and than
 mapping each avro object to it's pojo serializable counterpart with same
 fields, pojo is registered withing kryo)
 Any thought where to look for a problem/misconfiguration?

 On 31 May 2015 at 22:48, Igor Berman igor.ber...@gmail.com wrote:

 Hi
 We are using spark 1.3.1
 Avro-chill (tomorrow will check if its important) we register avro
 classes from java
 Avro 1.7.6
 On May 31, 2015 22:37, Josh Rosen rosenvi...@gmail.com wrote:

 Which Spark version are you using?  I'd like to understand whether this
 change could be caused by recent Kryo serializer re-use changes in master /
 Spark 1.4.

 On Sun, May 31, 2015 at 11:31 AM, igor.berman igor.ber...@gmail.com
 wrote:

 after investigation the problem is somehow connected to avro
 serialization
 with kryo + chill-avro(mapping avro object to simple scala case class
 and
 running reduce on these case class objects solves the problem)




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/union-and-reduceByKey-wrong-shuffle-tp23092p23093.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org






Re: java.io.IOException: FAILED_TO_UNCOMPRESS(5)

2015-06-01 Thread Josh Rosen
If you can't run a patched Spark version, then you could also consider
using LZF compression instead, since that codec isn't affected by this bug.

On Mon, Jun 1, 2015 at 3:32 PM, Andrew Or and...@databricks.com wrote:

 Hi Deepak,

 This is a notorious bug that is being tracked at
 https://issues.apache.org/jira/browse/SPARK-4105. We have fixed one
 source of this bug (it turns out Snappy had a bug in buffer reuse that
 caused data corruption). There are other known sources that are being
 addressed in outstanding patches currently.

 Since you're using 1.3.1 my guess is that you don't have this patch:
 https://github.com/apache/spark/pull/6176, which I believe should fix the
 issue in your case. It's merged for 1.3.2 (not yet released) but not in
 time for 1.3.1, so feel free to patch it yourself and see if it works.

 -Andrew


 2015-06-01 8:00 GMT-07:00 ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com:

 Any suggestions ?

 I using Spark 1.3.1 to read   sequence file stored in Sequence File
 format
 (SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text'org.apache.hadoop.io.compress.GzipCodec?v?
 )

 with this code and settings
 sc.sequenceFile(dwTable, classOf[Text], classOf[Text]).partitionBy(new
 org.apache.spark.HashPartitioner(2053))
 .set(spark.serializer, org.apache.spark.serializer.KryoSerializer)
   .set(spark.kryoserializer.buffer.mb,
 arguments.get(buffersize).get)
   .set(spark.kryoserializer.buffer.max.mb,
 arguments.get(maxbuffersize).get)
   .set(spark.driver.maxResultSize,
 arguments.get(maxResultSize).get)
   .set(spark.yarn.maxAppAttempts, 0)
   //.set(spark.akka.askTimeout, arguments.get(askTimeout).get)
   //.set(spark.akka.timeout, arguments.get(akkaTimeout).get)
   //.set(spark.worker.timeout, arguments.get(workerTimeout).get)

 .registerKryoClasses(Array(classOf[com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum]))


 and values are
 buffersize=128 maxbuffersize=1068 maxResultSize=200G


 And i see this exception in each executor task

 FetchFailed(BlockManagerId(278, phxaishdc9dn1830.stratus.phx.ebay.com,
 54757), shuffleId=6, mapId=2810, reduceId=1117, message=

 org.apache.spark.shuffle.FetchFailedException: FAILED_TO_UNCOMPRESS(5)

 at
 org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67)

 at
 org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)

 at
 org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)

 at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)

 at
 org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)

 at
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)

 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)

 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)

 at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)

 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)

 at
 org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:248)

 at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:172)

 at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:79)

 at org.apache.spark.rdd.RDD.iterator(RDD.scala:242)

 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)

 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)

 at org.apache.spark.scheduler.Task.run(Task.scala:64)

 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)

 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

 at java.lang.Thread.run(Thread.java:745)

 *Caused by: java.io.IOException: FAILED_TO_UNCOMPRESS(5)*

 at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:84)

 at org.xerial.snappy.SnappyNative.rawUncompress(Native Method)

 at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:444)

 at org.xerial.snappy.Snappy.uncompress(Snappy.java:480)

 at
 org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:135)

 at
 org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:92)

 at org.xerial.snappy.SnappyInputStream.init(SnappyInputStream.java:58)

 at
 org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:160)

 at
 org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1165)

 at
 org.apache.spark.storage.ShuffleBlockFetcherIterator$$anonfun$4.apply(ShuffleBlockFetcherIterator.scala:301)

 at
 org.apache.spark.storage.ShuffleBlockFetcherIterator$$anonfun$4.apply(ShuffleBlockFetcherIterator.scala:300)

 at scala.util.Success$$anonfun$map$1.apply(Try.scala:206)

 at scala.util.Try$.apply(Try.scala:161)

 at 

Re: Performance degradation between spark 0.9.3 and 1.3.1

2015-05-22 Thread Josh Rosen
I don't think that 0.9.3 has been released, so I'm assuming that you're
running on branch-0.9.

There's been over 4000 commits between 0.9.3 and 1.3.1, so I'm afraid that
this question doesn't have a concise answer:
https://github.com/apache/spark/compare/branch-0.9...v1.3.1

To narrow down the potential causes, have you tried comparing 0.9.3 to,
say, 1.0.2 or branch-1.0, or some other version that's closer to 0.9?

On Fri, May 22, 2015 at 9:43 AM, Shay Seng s...@urbanengines.com wrote:

 Hi.
 I have a job that takes
 ~50min with Spark 0.9.3 and
 ~1.8hrs on Spark 1.3.1 on the same cluster.

 The only code difference between the two code bases is to fix the Seq -
 Iter changes that happened in the Spark 1.x series.

 Are there any other changes in the defaults from spark 0.9.3 - 1.3.1 that
 would cause such a large degradation in performance? Changes in
 partitioning algorithms, scheduling etc?

 shay




Re: Does long-lived SparkContext hold on to executor resources?

2015-05-12 Thread Josh Rosen
I would be cautious regarding use of spark.cleaner.ttl, as it can lead to
confusing error messages if time-based cleaning deletes resources that are
still needed.  See my comment at
https://issues.apache.org/jira/browse/SPARK-5594?focusedCommentId=14486034page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14486034

On Mon, May 11, 2015 at 10:17 AM, Ganelin, Ilya ilya.gane...@capitalone.com
 wrote:

  Also check out the spark.cleaner.ttl property. Otherwise, you will
 accumulate shuffle metadata in the memory of the driver.



 Sent with Good (www.good.com)



 -Original Message-
 *From: *Silvio Fiorito [silvio.fior...@granturing.com]
 *Sent: *Monday, May 11, 2015 01:03 PM Eastern Standard Time
 *To: *stanley; user@spark.apache.org
 *Subject: *Re: Does long-lived SparkContext hold on to executor resources?

 You want to look at dynamic resource allocation, here:
 http://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation





 On 5/11/15, 11:23 AM, stanley wangshua...@yahoo.com wrote:

 I am building an analytics app with Spark. I plan to use long-lived
 SparkContexts to minimize the overhead for creating Spark contexts, which
 in
 turn reduces the analytics query response time.
 
 The number of queries that are run in the system is relatively small each
 day. Would long lived contexts hold on to the executor resources when
 there
 is no queries running? Is there a way to free executor resources in this
 type of use cases?
 
 
 
 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Does-long-lived-SparkContext-hold-on-to-executor-resources-tp22848.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 

 --

 The information contained in this e-mail is confidential and/or
 proprietary to Capital One and/or its affiliates. The information
 transmitted herewith is intended only for use by the individual or entity
 to which it is addressed.  If the reader of this message is not the
 intended recipient, you are hereby notified that any review,
 retransmission, dissemination, distribution, copying or other use of, or
 taking of any action in reliance upon this information is strictly
 prohibited. If you have received this communication in error, please
 contact the sender and delete the material from your computer.



Re: large volume spark job spends most of the time in AppendOnlyMap.changeValue

2015-05-08 Thread Josh Rosen
Do you have any more specific profiling data that you can share?  I'm
curious to know where AppendOnlyMap.changeValue is being called from.

On Fri, May 8, 2015 at 1:26 PM, Michal Haris michal.ha...@visualdna.com
wrote:

 +dev
 On 6 May 2015 10:45, Michal Haris michal.ha...@visualdna.com wrote:

  Just wanted to check if somebody has seen similar behaviour or knows what
  we might be doing wrong. We have a relatively complex spark application
  which processes half a terabyte of data at various stages. We have
 profiled
  it in several ways and everything seems to point to one place where 90%
 of
  the time is spent:  AppendOnlyMap.changeValue. The job scales and is
  relatively faster than its map-reduce alternative but it still feels
 slower
  than it should be. I am suspecting too much spill but I haven't seen any
  improvement by increasing number of partitions to 10k. Any idea would be
  appreciated.
 
  --
  Michal Haris
  Technical Architect
  direct line: +44 (0) 207 749 0229
  www.visualdna.com | t: +44 (0) 207 734 7033,
 



Python 3 support for PySpark has been merged into master

2015-04-16 Thread Josh Rosen
Hi everyone,

We just merged Python 3 support for PySpark into Spark's master branch
(which will become Spark 1.4.0).  This means that PySpark now supports
Python 2.6+, PyPy 2.5+, and Python 3.4+.

To run with Python 3, download and build Spark from the master branch then
configure the PYSPARK_PYTHON environment variable to point to a Python 3.4
executable.  For example:

PYSPARK_PYTHON=python3.4 ./bin/pyspark


For more details on this feature, see the pull request and JIRA:

- https://github.com/apache/spark/pull/5173
- https://issues.apache.org/jira/browse/SPARK-4897

For Spark contributors, this change means that any open PySpark pull
requests are now likely to have merge conflicts.  If a pull request does
not have merge conflicts, we should still re-test it with Jenkins to check
that it still works under Python 3.  When backporting Python patches,
committers may wish to run the PySpark unit tests locally to make sure that
the change still work correctly in older branches.  I can also help with
backports / fixing conflicts.

Thanks to Davies Liu, Shane Knapp, Thom Neale, Xiangrui Meng, and everyone
else who helped with this patch.

- Josh


Re: A problem with Spark 1.3 artifacts

2015-04-06 Thread Josh Rosen
My hunch is that this behavior was introduced by a patch to start shading
Jetty in Spark 1.3: https://issues.apache.org/jira/browse/SPARK-3996.

Note that Spark's *MetricsSystem* class is marked as *private[spark]* and
thus isn't intended to be interacted with directly by users.  It's not
super likely that this API would break, but it's excluded from our MiMa
checks and thus is liable to change in incompatible ways across releases.

If you add these Jetty classes as a compile-only dependency but don't add
them to the runtime classpath, do you get runtime errors?  If the metrics
system is usable at runtime and we only have errors when attempting to
compile user code against non-public APIs, then I'm not sure that this is a
high-priority issue to fix since.  If the metrics system doesn't work at
runtime, on the other hand, then that's definitely a bug that should be
fixed.

If you'd like to continue debugging this issue, I think we should move this
discussion over to JIRA so it's easier to track and reference.

Hope this helps,
Josh

On Thu, Apr 2, 2015 at 7:34 AM, Jacek Lewandowski 
jacek.lewandow...@datastax.com wrote:

 A very simple example which works well with Spark 1.2, and fail to compile
 with Spark 1.3:

 build.sbt:

 name := untitled
 version := 1.0
 scalaVersion := 2.10.4
 libraryDependencies += org.apache.spark %% spark-core % 1.3.0

 Test.scala:

 package org.apache.spark.metrics
 import org.apache.spark.SparkEnv
 class Test {
   SparkEnv.get.metricsSystem.report()
 }

 Produces:

 Error:scalac: bad symbolic reference. A signature in MetricsSystem.class
 refers to term eclipse
 in package org which is not available.
 It may be completely missing from the current classpath, or the version on
 the classpath might be incompatible with the version used when compiling
 MetricsSystem.class.

 Error:scalac: bad symbolic reference. A signature in MetricsSystem.class
 refers to term jetty
 in value org.eclipse which is not available.
 It may be completely missing from the current classpath, or the version on
 the classpath might be incompatible with the version used when compiling
 MetricsSystem.class.

 This looks like something wrong with shading jetty.
 MetricsSystem references MetricsServlet which references some classes from
 Jetty, in the original package instead of shaded one. I'm not sure, but
 adding the following dependencies solves the problem:

 libraryDependencies += org.eclipse.jetty % jetty-server %
 8.1.14.v20131031
 libraryDependencies += org.eclipse.jetty % jetty-servlet %
 8.1.14.v20131031

 Is it intended or is it a bug?


 Thanks !


 Jacek





Re: Which OutputCommitter to use for S3?

2015-02-20 Thread Josh Rosen
We (Databricks) use our own DirectOutputCommitter implementation, which is
a couple tens of lines of Scala code.  The class would almost entirely be a
no-op except we took some care to properly handle the _SUCCESS file.

On Fri, Feb 20, 2015 at 3:52 PM, Mingyu Kim m...@palantir.com wrote:

  I didn’t get any response. It’d be really appreciated if anyone using a
 special OutputCommitter for S3 can comment on this!

  Thanks,
 Mingyu

   From: Mingyu Kim m...@palantir.com
 Date: Monday, February 16, 2015 at 1:15 AM
 To: user@spark.apache.org user@spark.apache.org
 Subject: Which OutputCommitter to use for S3?

   HI all,

  The default OutputCommitter used by RDD, which is FileOutputCommitter,
 seems to require moving files at the commit step, which is not a constant
 operation in S3, as discussed in
 http://mail-archives.apache.org/mod_mbox/spark-user/201410.mbox/%3c543e33fa.2000...@entropy.be%3E
 https://urldefense.proofpoint.com/v2/url?u=http-3A__mail-2Darchives.apache.org_mod-5Fmbox_spark-2Duser_201410.mbox_-253C543E33FA.2000802-40entropy.be-253Ed=AwMFAgc=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8r=ennQJq47pNnObsDh-88a9YUrUulcYQoV8giPASqXB84m=CQfyLCSSjJfOHcbsMrRNihcDeMtHvLkCD5_O0J786BYs=2t0BawrpQPkJJgxklG_YX6LFzD1VaHTgDXI-w37smyce=.
 People seem to develop their own NullOutputCommitter implementation or use
 DirectFileOutputCommitter (as mentioned in SPARK-3595
 https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_SPARK-2D3595d=AwMFAgc=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8r=ennQJq47pNnObsDh-88a9YUrUulcYQoV8giPASqXB84m=CQfyLCSSjJfOHcbsMrRNihcDeMtHvLkCD5_O0J786BYs=i-gC5iPL8kGUDicLXowgLl5ncIyDknsulTlh7o23W_ge=),
 but I wanted to check if there is a de facto standard, publicly available
 OutputCommitter to use for S3 in conjunction with Spark.

  Thanks,
 Mingyu



Re: spark-shell has syntax error on windows.

2015-01-23 Thread Josh Rosen
Do you mind filing a JIRA issue for this which includes the actual error
message string that you saw?  https://issues.apache.org/jira/browse/SPARK

On Thu, Jan 22, 2015 at 8:31 AM, Yana Kadiyska yana.kadiy...@gmail.com
wrote:

 I am not sure if you get the same exception as I do -- spark-shell2.cmd
 works fine for me. Windows 7 as well. I've never bothered looking to fix it
 as it seems spark-shell just calls spark-shell2 anyway...

 On Thu, Jan 22, 2015 at 3:16 AM, Vladimir Protsenko protsenk...@gmail.com
  wrote:

 I have a problem with running spark shell in windows 7. I made the
 following
 steps:

 1. downloaded and installed Scala 2.11.5
 2. downloaded spark 1.2.0 by git clone git://github.com/apache/spark.git
 3. run dev/change-version-to-2.11.sh and mvn -Dscala-2.11 -DskipTests
 clean
 package (in git bash)

 After installation tried to run spark-shell.cmd in cmd shell and it says
 there is a syntax error in file. What could I do to fix problem?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/spark-shell-has-syntax-error-on-windows-tp21313.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





Re: Recent Git Builds Application WebUI Problem and Exception Stating Log directory /tmp/spark-events does not exist.

2015-01-18 Thread Josh Rosen
This looks like a bug in the master branch of Spark, related to some recent
changes to EventLoggingListener.  You can reproduce this bug on a fresh
Spark checkout by running

./bin/spark-shell --conf spark.eventLog.enabled=true --conf
spark.eventLog.dir=/tmp/nonexistent-dir

where /tmp/nonexistent-dir is a directory that doesn't exist and /tmp
exists.

It looks like older versions of EventLoggingListener would create the
directory if it didn't exist.  I think the issue here is that the
error-checking code is overzealous and catches some non-error conditions,
too; I've filed https://issues.apache.org/jira/browse/SPARK-5311 to
investigate this.

On Sun, Jan 18, 2015 at 1:59 PM, Ganon Pierce ganon.pie...@me.com wrote:

 I posted about the Application WebUI error (specifically application WebUI
 not the master WebUI generally) and have spent at least a few hours a day
 for over week trying to resolve it so I’d be very grateful for any
 suggestions. It is quite troubling that I appear to be the only one
 encountering this issue and I’ve tried to include everything here which
 might be relevant (sorry for the length). Please see the thread Current
 Build Gives HTTP ERROR”
 https://www.mail-archive.com/user@spark.apache.org/msg18752.html to see
 specifics about the application webUI issue and the master log.


 Environment:

 I’m doing my spark builds and application programming in scala locally on
 my macbook pro in eclipse, using modified ec2 launch scripts to launch my
 cluster, uploading my spark builds and models to s3, and uploading
 applications to and submitting them from ec2. I’m using java 8 locally and
 also installing and using java 8 on my ec2 instances (which works with
 spark 1.2.0). I have a windows machine at home (macbook is work machine),
 but have not yet attempted to launch from there.


 Errors:

 I’ve built two different recent git versions of spark both multiple times,
 and when running applications both have produced an Application WebUI error
 and this exception:

 Exception in thread main java.lang.IllegalArgumentException: Log
 directory /tmp/spark-events does not exist.

 While both will display the master webUI just fine including
 running/completed applications, registered workers etc, when I try to
 access a running or completed application’s WebUI by clicking their
 respective link, I receive a server error. When I manually create the above
 log directory, the exception goes away, but the WebUI problem does not. I
 don’t have any strong evidence, but I suspect these errors and whatever is
 causing them are related.


 Why and How of Modifications to Launch Scripts for Installation of
 Unreleased Spark Versions:

 When using a prebuilt version of spark on my cluster everything works
 except the new methods I need, which I had previously added to my custom
 version of spark and used by building the spark-assembly.jar locally and
 then replacing the assembly file produced through the 1.1.0 ec2 launch
 scripts. However, since my pull request was accepted and can now be found
 in the apache/spark repository along with some additional features I’d like
 to use and because I’d like a more elegant permanent solution for launching
 a cluster and installing unreleased versions of spark to my ec2 clusters,
 I’ve modified the included ec2 launch scripts in this way (credit to gen
 tang here:
 https://www.mail-archive.com/user%40spark.apache.org/msg18761.html
 https://www.mail-archive.com/user@spark.apache.org/msg18761.html):

 1. Clone the most recent git version of spark
 2. Use the make-dist script
 3. Tar the dist folder and upload the resulting
 spark-1.3.0-snapshot-hadoop1.tgz to s3 and change file permissions
 4. Fork the mesos/spark-ec2 repository and modify the spark/init.sh script
 to do a wget of my hosted distribution instead of spark’s stable release
 5. Modify my spark_ec2.py script to point to my repository.
 6. Modify my spark_ec2.py script to install java 8 on my ec2 instances.
 (This works and does not produce the above stated errors when using a
 stable release like 1.2.0).


 Additional Possibly Related Info:

 As far as I can tell (I went through line by line), when I launch my
 recent build vs when I launch the most recent stable release the console
 prints almost identical INFO and WARNINGS except where you would expect
 things to be different e.g. version numbers. I’ve noted that after launch
 the prebuilt stable version does not have a /tmp/spark-events directory,
 but it is created when the application is launched, while it is never
 created in my build. Further, in my unreleased builds the application logs
 that I find are always stored as .inprogress files (when I set the logging
 directory to /root/ or add the /tmp/spark-events directory manually) even
 after completion, which I believe is supposed to change to .completed (or
 something similar) when the application finishes.


 Thanks for any help!




Re: how to run python app in yarn?

2015-01-14 Thread Josh Rosen
There's an open PR for supporting yarn-cluster mode in PySpark:
https://github.com/apache/spark/pull/3976 (currently blocked on reviewer
attention / time)

On Wed, Jan 14, 2015 at 3:16 PM, Marcelo Vanzin van...@cloudera.com wrote:

 As the error message says...

 On Wed, Jan 14, 2015 at 3:14 PM, freedafeng freedaf...@yahoo.com wrote:
  Error: Cluster deploy mode is currently not supported for python
  applications.

 Use yarn-client instead of yarn-cluster for pyspark apps.


 --
 Marcelo

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Spark Standalone Cluster not correctly configured

2015-01-08 Thread Josh Rosen
Can you please file a JIRA issue for this?  This will make it easier to
triage this issue.

https://issues.apache.org/jira/browse/SPARK

Thanks,
Josh

On Thu, Jan 8, 2015 at 2:34 AM, frodo777 roberto.vaquer...@bitmonlab.com
wrote:

 Hello everyone.

 With respect to the configuration problem that I explained before 

 Do you have any idea what is wrong there?

 The problem in a nutshell:
 - When more than one master is started in the cluster, all of them are
 scheduling independently, thinking they are all leaders.
 - zookeeper configuration seems to be correct, only one leader is reported.
 The remaining master nodes are followers.
 - Default /spark directory is used for zookeeper.

 Thanks a lot.
 -Bob



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Standalone-Cluster-not-correctly-configured-tp20909p21029.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Shuffle Problems in 1.2.0

2015-01-04 Thread Josh Rosen
(ActorCell.scala:462)
    at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478)
    at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263)
    at akka.dispatch.Mailbox.run(Mailbox.scala:219)
    at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
    at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
    at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
    at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
    at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)


Thank you!
-Sven


On Tue, Dec 30, 2014 at 12:15 PM, Josh Rosen rosenvi...@gmail.com wrote:
Hi Sven,

Do you have a small example program that you can share which will allow me to 
reproduce this issue?  If you have a workload that runs into this, you should 
be able to keep iteratively simplifying the job and reducing the data set size 
until you hit a fairly minimal reproduction (assuming the issue is 
deterministic, which it sounds like it is).

On Tue, Dec 30, 2014 at 9:49 AM, Sven Krasser kras...@gmail.com wrote:
Hey all,

Since upgrading to 1.2.0 a pyspark job that worked fine in 1.1.1 fails during 
shuffle. I've tried reverting from the sort-based shuffle back to the hash one, 
and that fails as well. Does anyone see similar problems or has an idea on 
where to look next?

For the sort-based shuffle I get a bunch of exception like this in the executor 
logs:


2014-12-30 03:13:04,061 ERROR [Executor task launch worker-2] executor.Executor 
(Logging.scala:logError(96)) - Exception in task 4523.0 in stage 1.0 (TID 4524)
org.apache.spark.SparkException: PairwiseRDD: unexpected value: 
List([B@130dc7ad)
at 
org.apache.spark.api.python.PairwiseRDD$$anonfun$compute$2.apply(PythonRDD.scala:307)
at 
org.apache.spark.api.python.PairwiseRDD$$anonfun$compute$2.apply(PythonRDD.scala:305)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at 
org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:219)
at 
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:65)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)



For the hash-based shuffle, there are now a bunch of these exceptions in the 
logs:


2014-12-30 04:14:01,688 ERROR [Executor task launch worker-0] executor.Executor 
(Logging.scala:logError(96)) - Exception in task 4479.0 in stage 1.0 (TID 4480)
java.io.FileNotFoundException: 
/mnt/var/lib/hadoop/tmp/nm-local-dir/usercache/hadoop/appcache/application_1419905501183_0004/spark-local-20141230035728-8fc0/23/merged_shuffle_1_68_0
 (No such file or directory)
at java.io.FileOutputStream.open(Native Method)
at java.io.FileOutputStream.init(FileOutputStream.java:221)
at 
org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:123)
at 
org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:192)
at 
org.apache.spark.shuffle.hash.HashShuffleWriter$$anonfun$write$1.apply(HashShuffleWriter.scala:67)
at 
org.apache.spark.shuffle.hash.HashShuffleWriter$$anonfun$write$1.apply(HashShuffleWriter.scala:65)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:65)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)



Thank you!
-Sven



--
http://sites.google.com/site/krasser/?utm_source=sig




--
http://sites.google.com/site/krasser/?utm_source=sig


Re: spark.akka.frameSize limit error

2015-01-04 Thread Josh Rosen
Ah, so I guess this *is* still an issue since we needed to use a bitmap for
tracking zero-sized blocks (see
https://issues.apache.org/jira/browse/SPARK-3740; this isn't just a
performance issue; it's necessary for correctness).  This will require a
bit more effort to fix, since we'll either have to find a way to use a
fixed size / capped size encoding for MapOutputStatuses (which might
require changes to let us fetch empty blocks safely) or figure out some
other strategy for shipping these statues.

I've filed https://issues.apache.org/jira/browse/SPARK-5077 to try to come
up with a proper fix.  In the meantime, I recommend that you increase your
Akka frame size.

On Sat, Jan 3, 2015 at 8:51 PM, Saeed Shahrivari saeed.shahriv...@gmail.com
 wrote:

 I use the 1.2 version.

 On Sun, Jan 4, 2015 at 3:01 AM, Josh Rosen rosenvi...@gmail.com wrote:

 Which version of Spark are you using?  It seems like the issue here is
 that the map output statuses are too large to fit in the Akka frame size.
 This issue has been fixed in Spark 1.2 by using a different encoding for
 map outputs for jobs with many reducers (
 https://issues.apache.org/jira/browse/SPARK-3613).  On earlier Spark
 versions, your options are either reducing the number of reducers (e.g. by
 explicitly specifying the number of reducers in the reduceByKey() call)
 or increasing the Akka frame size (via the spark.akka.frameSize
 configuration option).

 On Sat, Jan 3, 2015 at 10:40 AM, Saeed Shahrivari 
 saeed.shahriv...@gmail.com wrote:

 Hi,

 I am trying to get the frequency of each Unicode char in a document
 collection using Spark. Here is the code snippet that does the job:

 JavaPairRDDLongWritable, Text rows = sc.sequenceFile(args[0],
 LongWritable.class, Text.class);
 rows = rows.coalesce(1);

 JavaPairRDDCharacter,Long pairs = rows.flatMapToPair(t - {
 String content=t._2.toString();
 MultisetCharacter chars= HashMultiset.create();
 for(int i=0;icontent.length();i++)
 chars.add(content.charAt(i));
 Listlt;Tuple2lt;Character,Long list=new
 ArrayListTuple2lt;Character, Long();
 for(Character ch:chars.elementSet()){
 list.add(new
 Tuple2Character,Long(ch,(long)chars.count(ch)));
 }
 return list;
 });

 JavaPairRDDCharacter, Long counts = pairs.reduceByKey((a, b)
 - a
 + b);
 System.out.printf(MapCount %,d\n,counts.count());

 But, I get the following exception:

 15/01/03 21:51:34 ERROR MapOutputTrackerMasterActor: Map output statuses
 were 11141547 bytes which exceeds spark.akka.frameSize (10485760 bytes).
 org.apache.spark.SparkException: Map output statuses were 11141547 bytes
 which exceeds spark.akka.frameSize (10485760 bytes).
 at

 org.apache.spark.MapOutputTrackerMasterActor$$anonfun$receiveWithLogging$1.applyOrElse(MapOutputTracker.scala:59)
 at

 scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
 at

 scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
 at

 scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
 at

 org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:53)
 at

 org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42)
 at
 scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
 at

 org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42)
 at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
 at

 org.apache.spark.MapOutputTrackerMasterActor.aroundReceive(MapOutputTracker.scala:42)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
 at akka.actor.ActorCell.invoke(ActorCell.scala:487)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
 at akka.dispatch.Mailbox.run(Mailbox.scala:220)
 at

 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
 at
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at

 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at

 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

 Would you please tell me where is the fault?
 If I process fewer rows, there is no problem. However, when the number of
 rows is large I always get this exception.

 Thanks beforehand.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/spark-akka-frameSize-limit-error-tp20955.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user

Re: Repartition Memory Leak

2015-01-04 Thread Josh Rosen
@Brad, I'm guessing that the additional memory usage is coming from the
shuffle performed by coalesce, so that at least explains the memory blowup.

On Sun, Jan 4, 2015 at 10:16 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:

 You can try:

 - Using KryoSerializer
 - Enabling RDD Compression
 - Setting storage type to MEMORY_ONLY_SER or MEMORY_AND_DISK_SER


 Thanks
 Best Regards

 On Sun, Jan 4, 2015 at 11:53 PM, Brad Willard bradwill...@gmail.com
 wrote:

 I have a 10 node cluster with 600gb of ram. I'm loading a fairly large
 dataset from json files. When I load the dataset it is about 200gb however
 it only creates 60 partitions. I'm trying to repartition to 256 to
 increase
 cpu utilization however when I do that it balloons in memory to way over
 2x
 the initial size killing nodes from memory failures.


 https://s3.amazonaws.com/f.cl.ly/items/3k2n2n3j35273i2v1Y3t/Screen%20Shot%202015-01-04%20at%201.20.29%20PM.png

 Is this a bug? How can I work around this.

 Thanks!



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Repartition-Memory-Leak-tp20965.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





Re: spark.akka.frameSize limit error

2015-01-03 Thread Josh Rosen
Which version of Spark are you using?  It seems like the issue here is that
the map output statuses are too large to fit in the Akka frame size.  This
issue has been fixed in Spark 1.2 by using a different encoding for map
outputs for jobs with many reducers (
https://issues.apache.org/jira/browse/SPARK-3613).  On earlier Spark
versions, your options are either reducing the number of reducers (e.g. by
explicitly specifying the number of reducers in the reduceByKey() call) or
increasing the Akka frame size (via the spark.akka.frameSize configuration
option).

On Sat, Jan 3, 2015 at 10:40 AM, Saeed Shahrivari 
saeed.shahriv...@gmail.com wrote:

 Hi,

 I am trying to get the frequency of each Unicode char in a document
 collection using Spark. Here is the code snippet that does the job:

 JavaPairRDDLongWritable, Text rows = sc.sequenceFile(args[0],
 LongWritable.class, Text.class);
 rows = rows.coalesce(1);

 JavaPairRDDCharacter,Long pairs = rows.flatMapToPair(t - {
 String content=t._2.toString();
 MultisetCharacter chars= HashMultiset.create();
 for(int i=0;icontent.length();i++)
 chars.add(content.charAt(i));
 Listlt;Tuple2lt;Character,Long list=new
 ArrayListTuple2lt;Character, Long();
 for(Character ch:chars.elementSet()){
 list.add(new
 Tuple2Character,Long(ch,(long)chars.count(ch)));
 }
 return list;
 });

 JavaPairRDDCharacter, Long counts = pairs.reduceByKey((a, b) - a
 + b);
 System.out.printf(MapCount %,d\n,counts.count());

 But, I get the following exception:

 15/01/03 21:51:34 ERROR MapOutputTrackerMasterActor: Map output statuses
 were 11141547 bytes which exceeds spark.akka.frameSize (10485760 bytes).
 org.apache.spark.SparkException: Map output statuses were 11141547 bytes
 which exceeds spark.akka.frameSize (10485760 bytes).
 at

 org.apache.spark.MapOutputTrackerMasterActor$$anonfun$receiveWithLogging$1.applyOrElse(MapOutputTracker.scala:59)
 at

 scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
 at

 scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
 at

 scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
 at

 org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:53)
 at

 org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42)
 at
 scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
 at

 org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42)
 at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
 at

 org.apache.spark.MapOutputTrackerMasterActor.aroundReceive(MapOutputTracker.scala:42)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
 at akka.actor.ActorCell.invoke(ActorCell.scala:487)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
 at akka.dispatch.Mailbox.run(Mailbox.scala:220)
 at

 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
 at
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at

 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at

 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

 Would you please tell me where is the fault?
 If I process fewer rows, there is no problem. However, when the number of
 rows is large I always get this exception.

 Thanks beforehand.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/spark-akka-frameSize-limit-error-tp20955.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: DAG info

2015-01-01 Thread Josh Rosen
This log message is normal; in this case, this message is saying that the
final stage needed to compute your job does not have any dependencies /
parent stages and that there are no parent stages that need to be computed.

On Thu, Jan 1, 2015 at 11:02 PM, shahid sha...@trialx.com wrote:

 hi guys


 i have just starting using spark, i am getting this as an info
 15/01/02 11:54:17 INFO DAGScheduler: Parents of final stage: List()
 15/01/02 11:54:17 INFO DAGScheduler: Missing parents: List()
 15/01/02 11:54:17 INFO DAGScheduler: Submitting Stage 6 (PythonRDD[12] at
 RDD at PythonRDD.scala:43), which has no missing parents

 Also my program is taking lot of time to execute.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/DAG-info-tp20940.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: NullPointerException

2014-12-31 Thread Josh Rosen
Which version of Spark are you using?

On Wed, Dec 31, 2014 at 10:24 PM, rapelly kartheek kartheek.m...@gmail.com
wrote:

 Hi,
 I get this following Exception when I submit spark application that
 calculates the frequency of characters in a file. Especially, when I
 increase the size of data, I face this problem.

 Exception in thread Thread-47 org.apache.spark.SparkException: Job
 aborted due to stage failure: Task 11.0:10 failed 4 times, most recent
 failure: Exception failure in TID 295 on host s1:
 java.lang.NullPointerException
 org.apache.spark.storage.BlockManager.org
 $apache$spark$storage$BlockManager$$replicate(BlockManager.scala:786)
 org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:752)
 org.apache.spark.storage.BlockManager.put(BlockManager.scala:574)
 org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:108)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
 org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
 org.apache.spark.scheduler.Task.run(Task.scala:51)

 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)

 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 java.lang.Thread.run(Thread.java:745)
 Driver stacktrace:
 at org.apache.spark.scheduler.DAGScheduler.org
 $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015)
 at
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
 at scala.Option.foreach(Option.scala:236)
 at
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:633)
 at
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1207)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
 at akka.actor.ActorCell.invoke(ActorCell.scala:456)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
 at akka.dispatch.Mailbox.run(Mailbox.scala:219)
 at
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)


 Any help?

 Thank you!



Re: NullPointerException

2014-12-31 Thread Josh Rosen
It looks like 'null' might be selected as a block replication peer?
https://github.com/apache/spark/blob/v1.0.0/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L786

I know that we fixed some replication bugs in newer versions of Spark (such
as https://github.com/apache/spark/pull/2366), so it's possible that this
issue would be resolved by updating.  Can you try re-running your job with
a newer Spark version to see whether you still see the same error?

On Wed, Dec 31, 2014 at 10:35 PM, rapelly kartheek kartheek.m...@gmail.com
wrote:

 spark-1.0.0

 On Thu, Jan 1, 2015 at 12:04 PM, Josh Rosen rosenvi...@gmail.com wrote:

 Which version of Spark are you using?

 On Wed, Dec 31, 2014 at 10:24 PM, rapelly kartheek 
 kartheek.m...@gmail.com wrote:

 Hi,
 I get this following Exception when I submit spark application that
 calculates the frequency of characters in a file. Especially, when I
 increase the size of data, I face this problem.

 Exception in thread Thread-47 org.apache.spark.SparkException: Job
 aborted due to stage failure: Task 11.0:10 failed 4 times, most recent
 failure: Exception failure in TID 295 on host s1:
 java.lang.NullPointerException
 org.apache.spark.storage.BlockManager.org
 $apache$spark$storage$BlockManager$$replicate(BlockManager.scala:786)

 org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:752)
 org.apache.spark.storage.BlockManager.put(BlockManager.scala:574)

 org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:108)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
 org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
 org.apache.spark.scheduler.Task.run(Task.scala:51)

 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)

 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 java.lang.Thread.run(Thread.java:745)
 Driver stacktrace:
 at org.apache.spark.scheduler.DAGScheduler.org
 $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015)
 at
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
 at scala.Option.foreach(Option.scala:236)
 at
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:633)
 at
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1207)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
 at akka.actor.ActorCell.invoke(ActorCell.scala:456)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
 at akka.dispatch.Mailbox.run(Mailbox.scala:219)
 at
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)


 Any help?

 Thank you!






Re: Shuffle Problems in 1.2.0

2014-12-30 Thread Josh Rosen
Hi Sven,

Do you have a small example program that you can share which will allow me
to reproduce this issue?  If you have a workload that runs into this, you
should be able to keep iteratively simplifying the job and reducing the
data set size until you hit a fairly minimal reproduction (assuming the
issue is deterministic, which it sounds like it is).

On Tue, Dec 30, 2014 at 9:49 AM, Sven Krasser kras...@gmail.com wrote:

 Hey all,

 Since upgrading to 1.2.0 a pyspark job that worked fine in 1.1.1 fails
 during shuffle. I've tried reverting from the sort-based shuffle back to
 the hash one, and that fails as well. Does anyone see similar problems or
 has an idea on where to look next?

 For the sort-based shuffle I get a bunch of exception like this in the
 executor logs:

 2014-12-30 03:13:04,061 ERROR [Executor task launch worker-2] 
 executor.Executor (Logging.scala:logError(96)) - Exception in task 4523.0 in 
 stage 1.0 (TID 4524)
 org.apache.spark.SparkException: PairwiseRDD: unexpected value: 
 List([B@130dc7ad)
   at 
 org.apache.spark.api.python.PairwiseRDD$$anonfun$compute$2.apply(PythonRDD.scala:307)
   at 
 org.apache.spark.api.python.PairwiseRDD$$anonfun$compute$2.apply(PythonRDD.scala:305)
   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
   at 
 org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:219)
   at 
 org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:65)
   at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
   at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
   at org.apache.spark.scheduler.Task.run(Task.scala:56)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)

 For the hash-based shuffle, there are now a bunch of these exceptions in the 
 logs:


 2014-12-30 04:14:01,688 ERROR [Executor task launch worker-0] 
 executor.Executor (Logging.scala:logError(96)) - Exception in task 4479.0 in 
 stage 1.0 (TID 4480)
 java.io.FileNotFoundException: 
 /mnt/var/lib/hadoop/tmp/nm-local-dir/usercache/hadoop/appcache/application_1419905501183_0004/spark-local-20141230035728-8fc0/23/merged_shuffle_1_68_0
  (No such file or directory)
   at java.io.FileOutputStream.open(Native Method)
   at java.io.FileOutputStream.init(FileOutputStream.java:221)
   at 
 org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:123)
   at 
 org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:192)
   at 
 org.apache.spark.shuffle.hash.HashShuffleWriter$$anonfun$write$1.apply(HashShuffleWriter.scala:67)
   at 
 org.apache.spark.shuffle.hash.HashShuffleWriter$$anonfun$write$1.apply(HashShuffleWriter.scala:65)
   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
   at 
 org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:65)
   at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
   at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
   at org.apache.spark.scheduler.Task.run(Task.scala:56)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)

 Thank you!
 -Sven



 --
 http://sites.google.com/site/krasser/?utm_source=sig



Re: SparkContext with error from PySpark

2014-12-30 Thread Josh Rosen
To configure the Python executable used by PySpark, see the Using the
Shell Python section in the Spark Programming Guide:
https://spark.apache.org/docs/latest/programming-guide.html#using-the-shell

You can set the PYSPARK_PYTHON environment variable to choose the Python
executable that will be used on the driver and executors.  In addition, you
can set PYSPARK_DRIVER_PYTHON to use a different Python executable only on
the driver (this is useful if you want to use IPython on the driver but not
on the executors).

On Tue, Dec 30, 2014 at 11:13 AM, JAGANADH G jagana...@gmail.com wrote:

 Hi

 I am using Aanonda Python. Is there any way to specify the Python which we
 have o use for running pyspark in a cluster.

 Best regards

 Jagan

 On Tue, Dec 30, 2014 at 6:27 PM, Eric Friedman eric.d.fried...@gmail.com
 wrote:

 The Python installed in your cluster is 2.5. You need at least 2.6.

 
 Eric Friedman

  On Dec 30, 2014, at 7:45 AM, Jaggu jagana...@gmail.com wrote:
 
  Hi Team,
 
  I was trying to execute a Pyspark code in cluster. It gives me the
 following
  error. (Wne I run the same job in local it is working fine too :-()
 
  Eoor
 
  Error from python worker:
   /usr/lib/spark-1.2.0-bin-hadoop2.3/python/pyspark/context.py:209:
 Warning:
  'with' will become a reserved keyword in Python 2.6
   Traceback (most recent call last):
 File
 
 /home/beehive/toolchain/x86_64-unknown-linux-gnu/python-2.5.2/lib/python2.5/runpy.py,
  line 85, in run_module
   loader = get_loader(mod_name)
 File
 
 /home/beehive/toolchain/x86_64-unknown-linux-gnu/python-2.5.2/lib/python2.5/pkgutil.py,
  line 456, in get_loader
   return find_loader(fullname)
 File
 
 /home/beehive/toolchain/x86_64-unknown-linux-gnu/python-2.5.2/lib/python2.5/pkgutil.py,
  line 466, in find_loader
   for importer in iter_importers(fullname):
 File
 
 /home/beehive/toolchain/x86_64-unknown-linux-gnu/python-2.5.2/lib/python2.5/pkgutil.py,
  line 422, in iter_importers
   __import__(pkg)
 File /usr/lib/spark-1.2.0-bin-hadoop2.3/python/pyspark/__init__.py,
  line 41, in module
   from pyspark.context import SparkContext
 File /usr/lib/spark-1.2.0-bin-hadoop2.3/python/pyspark/context.py,
  line 209
   with SparkContext._lock:
   ^
   SyntaxError: invalid syntax
  PYTHONPATH was:
 
 
 /usr/lib/spark-1.2.0-bin-hadoop2.3/python:/usr/lib/spark-1.2.0-bin-hadoop2.3/python/lib/py4j-0.8.2.1-src.zip:/usr/lib/spark-1.2.0-bin-hadoop2.3/lib/spark-assembly-1.2.0-hadoop2.3.0.jar:/usr/lib/spark-1.2.0-bin-hadoop2.3/sbin/../python/lib/py4j-0.8.2.1-src.zip:/usr/lib/spark-1.2.0-bin-hadoop2.3/sbin/../python:/home/beehive/bin/utils/primitives:/home/beehive/bin/utils/pylogger:/home/beehive/bin/utils/asterScript:/home/beehive/bin/lib:/home/beehive/bin/utils/init:/home/beehive/installer/packages:/home/beehive/ncli
  java.io.EOFException
 at java.io.DataInputStream.readInt(DataInputStream.java:392)
 at
 
 org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:163)
 at
 
 org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:86)
 at
 
 org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62)
 at
 org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:102)
 at
 org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
 at
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
 at
  org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
 at org.apache.spark.scheduler.Task.run(Task.scala:56)
 at
  org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
 at
 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:722)
 
  14/12/31 04:49:58 INFO TaskSetManager: Starting task 0.1 in stage 0.0
 (TID
  1, aster4, NODE_LOCAL, 1321 bytes)
  14/12/31 04:49:58 INFO BlockManagerInfo: Added broadcast_2_piece0 in
 memory
  on aster4:43309 (size: 3.8 KB, free: 265.0 MB)
  14/12/31 04:49:59 INFO TaskSetManager: Lost task 0.1 in stage 0.0 (TID
 1) on
  executor aster4: org.apache.spark.SparkException (
 
 
  Any clue how to resolve the same.
 
  Best regards
 
  Jagan
 
 
 
  --
  View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-with-error-from-PySpark-tp20907.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 




 --
 **
 JAGANADH G
 http://jaganadhg.in
 *ILUGCBE*
 

Re: action progress in ipython notebook?

2014-12-29 Thread Josh Rosen
It's accessed through the `statusTracker` field on SparkContext.

*Scala*:

https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkStatusTracker

*Java*:

https://spark.apache.org/docs/latest/api/java/org/apache/spark/api/java/JavaSparkStatusTracker.html

Don't create new instances of this yourself; instead, use sc.statusTracker
to obtain the current instance.

This API is missing a bunch of stuff that's available in the web UI, but it
was designed so that we can add new methods without breaking binary
compatibility. Although it would technically be a new feature, I'd hope
that we can backport some additions to 1.2.1 since it's just adding a
facade / stable interface in front of JobProgressListener and thus has
little to no risk to introduce new bugs elsewhere in Spark.



On Mon, Dec 29, 2014 at 3:08 AM, Aniket Bhatnagar 
aniket.bhatna...@gmail.com wrote:

 Hi Josh

 Is there documentation available for status API? I would like to use it.

 Thanks,
 Aniket


 On Sun Dec 28 2014 at 02:37:32 Josh Rosen rosenvi...@gmail.com wrote:

 The console progress bars are implemented on top of a new stable status
 API that was added in Spark 1.2.  It's possible to query job progress
 using this interface (in older versions of Spark, you could implement a
 custom SparkListener and maintain the counts of completed / running /
 failed tasks / stages yourself).

 There are actually several subtleties involved in implementing
 job-level progress bars which behave in an intuitive way; there's a
 pretty extensive discussion of the challenges at
 https://github.com/apache/spark/pull/3009.  Also, check out the pull
 request for the console progress bars for an interesting design discussion
 around how they handle parallel stages:
 https://github.com/apache/spark/pull/3029.

 I'm not sure about the plumbing that would be necessary to display live
 progress updates in the IPython notebook UI, though.  The general pattern
 would probably involve a mapping to relate notebook cells to Spark jobs
 (you can do this with job groups, I think), plus some periodic timer that
 polls the driver for the status of the current job in order to update the
 progress bar.

 For Spark 1.3, I'm working on designing a REST interface to accesses this
 type of job / stage / task progress information, as well as expanding the
 types of information exposed through the stable status API interface.

 - Josh

 On Thu, Dec 25, 2014 at 10:01 AM, Eric Friedman 
 eric.d.fried...@gmail.com wrote:

 Spark 1.2.0 is SO much more usable than previous releases -- many thanks
 to the team for this release.

 A question about progress of actions.  I can see how things are
 progressing using the Spark UI.  I can also see the nice ASCII art
 animation on the spark driver console.

 Has anyone come up with a way to accomplish something similar in an
 iPython notebook using pyspark?

 Thanks
 Eric





Re: action progress in ipython notebook?

2014-12-27 Thread Josh Rosen
The console progress bars are implemented on top of a new stable status
API that was added in Spark 1.2.  It's possible to query job progress
using this interface (in older versions of Spark, you could implement a
custom SparkListener and maintain the counts of completed / running /
failed tasks / stages yourself).

There are actually several subtleties involved in implementing job-level
progress bars which behave in an intuitive way; there's a pretty extensive
discussion of the challenges at https://github.com/apache/spark/pull/3009.
Also, check out the pull request for the console progress bars for an
interesting design discussion around how they handle parallel stages:
https://github.com/apache/spark/pull/3029.

I'm not sure about the plumbing that would be necessary to display live
progress updates in the IPython notebook UI, though.  The general pattern
would probably involve a mapping to relate notebook cells to Spark jobs
(you can do this with job groups, I think), plus some periodic timer that
polls the driver for the status of the current job in order to update the
progress bar.

For Spark 1.3, I'm working on designing a REST interface to accesses this
type of job / stage / task progress information, as well as expanding the
types of information exposed through the stable status API interface.

- Josh

On Thu, Dec 25, 2014 at 10:01 AM, Eric Friedman eric.d.fried...@gmail.com
wrote:

 Spark 1.2.0 is SO much more usable than previous releases -- many thanks
 to the team for this release.

 A question about progress of actions.  I can see how things are
 progressing using the Spark UI.  I can also see the nice ASCII art
 animation on the spark driver console.

 Has anyone come up with a way to accomplish something similar in an
 iPython notebook using pyspark?

 Thanks
 Eric



Re: Discourse: A proposed alternative to the Spark User list

2014-12-25 Thread Josh Rosen
We have a mirror of the user and developer mailing lists on Nabble, but
unfortunately this has led to significant usability issues because users
may attempt to post messages through Nabble which silently fail to get
posted to the actual Apache list and thus are never read by most
subscribers:
http://apache-spark-developers-list.1001551.n3.nabble.com/Nabble-mailing-list-mirror-errors-quot-This-post-has-NOT-been-accepted-by-the-mailing-list-yet-quot-td9772.html.
In fact, there are replies to this very thread that were not properly
mirrored from the Apache list to Nabble.

Before Spark moved to Apache, our mailing list was hosted on Google
Groups.  Several community members were in favor of keeping the discussion
list on Google Groups, since its interface is a bit more user-friendly:
https://groups.google.com/forum/#!topic/spark-users/vtg-5db8JWY

See also:
https://mail-archives.apache.org/mod_mbox/spark-dev/201308.mbox/%3cce3c361b.fc97f%25chris.a.mattm...@jpl.nasa.gov%3E

Since Andy mentioned IRC, there's actually an #apache-spark channel on
Freenode (I idle there sometimes).

I'll comment more on the actual proposals here in a separate followup
email, but I just wanted to add a bit of additional context in the meantime.

- Josh

On Thu, Dec 25, 2014 at 5:36 PM, Tobias Pfeiffer t...@preferred.jp wrote:

 Nick,

 uh, I would have expected a rather heated discussion, but the opposite
 seems to be the case ;-)

 Independent of my personal preferences w.r.t. usability, habits etc., I
 think it is not good for a software/tool/framework if questions and
 discussions are spread over too many places. I guess everyone of us knows
 an example of where this makes/has made it very hard for newcomers to get
 started ;-)

 As it is now, I think the mailing list has somewhat of an official
 touch, while Stack Overflow is, well, Stack Overflow ;-) To introduce
 another discussion platform next to the mailing list (your proposal (2.))
 would increase confusion, the number of double-postings and, as you said,
 effectively fork the community. Your proposal (1.) sounds attractive, but I
 highly doubt that the user experience can match people's expectations
 towards the pure solution on either the mailing list or Discourse, given
 the rather different discussion styles.

 Having said that, I totally agree to the points you mentioned; even just
 linking to a thread where a question has been discussed before is very
 time-consuming and I would be happy to use a platform where all those
 points are addressed. Stack Overflow seems to provide that, too, and except
 for the broader range of discussions you mentioned, I don't see the
 benefit of using Discourse over Stack Overflow. So personally, I would
 suggest to go with (3.) and encourage SO as a platform for questions that
 are ok to be asked there and try to reduce/focus mailing list communication
 for everything else. (Note that this is pretty much the same state as now
 plus encouraging people in an unspecified way, which means that maybe
 nothing changes at all.)

 Just my 2 cent,
 Tobias


 On Wed Dec 24 2014 at 21:50:48 Nick Chammas nicholas.cham...@gmail.com
 wrote:

 When people have questions about Spark, there are 2 main places (as far
 as I can tell) where they ask them:

- Stack Overflow, under the apache-spark tag
http://stackoverflow.com/questions/tagged/apache-spark
- This mailing list

 The mailing list is valuable as an independent place for discussion that
 is part of the Spark project itself. Furthermore, it allows for a broader
 range of discussions than would be allowed on Stack Overflow
 http://stackoverflow.com/help/dont-ask.

 As the Spark project has grown in popularity, I see that a few problems
 have emerged with this mailing list:

- It’s hard to follow topics (e.g. Streaming vs. SQL) that you’re
interested in, and it’s hard to know when someone has mentioned you
specifically.
- It’s hard to search for existing threads and link information
across disparate threads.
- It’s hard to format code and log snippets nicely, and by
extension, hard to read other people’s posts with this kind of 
 information.

 There are existing solutions to all these (and other) problems based
 around straight-up discipline or client-side tooling, which users have to
 conjure up for themselves.

 I’d like us as a community to consider using Discourse
 http://www.discourse.org/ as an alternative to, or overlay on top of,
 this mailing list, that provides better out-of-the-box solutions to these
 problems.

 Discourse is a modern discussion platform built by some of the same
 people who created Stack Overflow. It has many neat features
 http://v1.discourse.org/about/ that I believe this community would
 benefit from.

 For example:

- When a user starts typing up a new post, they get a panel *showing
existing conversations that look similar*, just like on Stack
Overflow.
- It’s easy to search for posts and link between 

Re: Nabble mailing list mirror errors: This post has NOT been accepted by the mailing list yet

2014-12-17 Thread Josh Rosen
Yeah, it looks like messages that are successfully posted via Nabble end up
on the Apache mailing list, but messages posted directly to Apache aren't
mirrored to Nabble anymore because it's based off the incubator mailing
list.  We should fix this so that Nabble posts to / archives the
non-incubator list.

On Sat, Dec 13, 2014 at 6:27 PM, Yana Kadiyska yana.kadiy...@gmail.com
wrote:

 Since you mentioned this, I had a related quandry recently -- it also says
 that the forum archives *u...@spark.incubator.apache.org
 u...@spark.incubator.apache.org/* *d...@spark.incubator.apache.org
 d...@spark.incubator.apache.org *respectively, yet the Community page
 clearly says to email the @spark.apache.org list (but the nabble archive
 is linked right there too). IMO even putting a clear explanation at the top

 Posting here requires that you create an account via the UI. Your message
 will be sent to both spark.incubator.apache.org and spark.apache.org (if
 that is the case, i'm not sure which alias nabble posts get sent to) would
 make things a lot more clear.

 On Sat, Dec 13, 2014 at 5:05 PM, Josh Rosen rosenvi...@gmail.com wrote:

 I've noticed that several users are attempting to post messages to
 Spark's user / dev mailing lists using the Nabble web UI (
 http://apache-spark-user-list.1001560.n3.nabble.com/).  However, there
 are many posts in Nabble that are not posted to the Apache lists and are
 flagged with This post has NOT been accepted by the mailing list yet.
 errors.

 I suspect that the issue is that users are not completing the sign-up
 confirmation process (
 http://apache-spark-user-list.1001560.n3.nabble.com/mailing_list/MailingListOptions.jtp?forum=1),
 which is preventing their emails from being accepted by the mailing list.

 I wanted to mention this issue to the Spark community to see whether
 there are any good solutions to address this.  I have spoken to users who
 think that our mailing list is unresponsive / inactive because their
 un-posted messages haven't received any replies.

 - Josh




Nabble mailing list mirror errors: This post has NOT been accepted by the mailing list yet

2014-12-13 Thread Josh Rosen
I've noticed that several users are attempting to post messages to Spark's
user / dev mailing lists using the Nabble web UI (
http://apache-spark-user-list.1001560.n3.nabble.com/).  However, there are
many posts in Nabble that are not posted to the Apache lists and are
flagged with This post has NOT been accepted by the mailing list yet.
errors.

I suspect that the issue is that users are not completing the sign-up
confirmation process (
http://apache-spark-user-list.1001560.n3.nabble.com/mailing_list/MailingListOptions.jtp?forum=1),
which is preventing their emails from being accepted by the mailing list.

I wanted to mention this issue to the Spark community to see whether there
are any good solutions to address this.  I have spoken to users who think
that our mailing list is unresponsive / inactive because their un-posted
messages haven't received any replies.

- Josh


Re: java.io.InvalidClassException: org.apache.spark.api.java.JavaUtils$SerializableMapWrapper; no valid constructor

2014-12-01 Thread Josh Rosen
SerializableMapWrapper was added in
https://issues.apache.org/jira/browse/SPARK-3926; do you mind opening a new
JIRA and linking it to that one?

On Mon, Dec 1, 2014 at 12:17 AM, lokeshkumar lok...@dataken.net wrote:

 The workaround was to wrap the map returned by spark libraries into HashMap
 and then broadcast them.
 Could anyone please let me know if there is any issue open?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/java-io-InvalidClassException-org-apache-spark-api-java-JavaUtils-SerializableMapWrapper-no-valid-cor-tp20034p20070.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: small bug in pyspark

2014-10-12 Thread Josh Rosen
Hi Andy,

You may be interested in https://github.com/apache/spark/pull/2651, a
recent pull request of mine which cleans up / simplifies the configuration
of PySpark's Python executables.  For instance, it makes it much easier to
control which Python options are passed when launching the PySpark drivers
and workers.

- Josh

On Fri, Oct 10, 2014 at 5:24 PM, Andy Davidson 
a...@santacruzintegration.com wrote:

 Hi

 I am running spark on an ec2 cluster. I need to update python to 2.7. I
 have been following the directions on
 http://nbviewer.ipython.org/gist/JoshRosen/6856670

 https://issues.apache.org/jira/browse/SPARK-922


 I noticed that when I start a shell using pyspark, I correctly got
 python2.7, how ever when I tried to start a notebook I got python2.6



 change

 exec ipython $IPYTHON_OPTS

 to

  exec ipython2 $IPYTHON_OPTS


 One clean way to resolve this would be to add another environmental
 variable like PYSPARK_PYTHON


 Andy



 P.s. Matplotlab does not upgrade because of dependency problems. I’ll let
 you know once I get this resolved






Re: What if I port Spark from TCP/IP to RDMA?

2014-10-12 Thread Josh Rosen
Hi Theo,

Check out *spark-perf*, a suite of performance benchmarks for Spark:
https://github.com/databricks/spark-perf.

- Josh

On Fri, Oct 10, 2014 at 7:27 PM, Theodore Si sjyz...@gmail.com wrote:

 Hi,

 Let's say that I managed to port Spark from TCP/IP to RDMA.
 What tool or benchmark can I use to test the performance improvement?

 BR,
 Theo

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: pyspark on python 3

2014-10-03 Thread Josh Rosen
It would be great if we supported Python 3 and I'd be happy to review any
pull requests to add it.  I don't know that Python 3 is very widely-used,
but I'm open to supporting it if it won't require too much work.

By the way, we recently added support for PyPy:
https://github.com/apache/spark/pull/2144

- Josh



On Fri, Oct 3, 2014 at 6:44 PM, tomo cocoa cocoatom...@gmail.com wrote:

 Hi,

 I prefer that PySpark can also be executed on Python 3.

 Do you have some reason or demand to use PySpark through Python3?
 If you create an issue on JIRA, I would try to resolve it.


 On 4 October 2014 06:47, Gen gen.tan...@gmail.com wrote:

 According to the official site of spark, for the latest version of
 spark(1.1.0), it does not work with python 3

 Spark 1.1.0 works with Python 2.6 or higher (but not Python 3). It uses
 the
 standard CPython interpreter, so C libraries like NumPy can be used.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-on-python-3-tp15706p15707.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




 --
 class Cocoatomo:
 name = 'cocoatomo'
 email_address = 'cocoatom...@gmail.com'
 twitter_id = '@cocoatomo'



Re: distcp on ec2 standalone spark cluster

2014-09-07 Thread Josh Rosen
If I recall, you should be able to start Hadoop MapReduce using
~/ephemeral-hdfs/sbin/start-mapred.sh.

On Sun, Sep 7, 2014 at 6:42 AM, Tomer Benyamini tomer@gmail.com wrote:

 Hi,

 I would like to copy log files from s3 to the cluster's
 ephemeral-hdfs. I tried to use distcp, but I guess mapred is not
 running on the cluster - I'm getting the exception below.

 Is there a way to activate it, or is there a spark alternative to distcp?

 Thanks,
 Tomer

 mapreduce.Cluster (Cluster.java:initialize(114)) - Failed to use
 org.apache.hadoop.mapred.LocalClientProtocolProvider due to error:
 Invalid mapreduce.jobtracker.address configuration value for
 LocalJobRunner : XXX:9001

 ERROR tools.DistCp (DistCp.java:run(126)) - Exception encountered

 java.io.IOException: Cannot initialize Cluster. Please check your
 configuration for mapreduce.framework.name and the correspond server
 addresses.

 at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:121)

 at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:83)

 at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:76)

 at org.apache.hadoop.tools.DistCp.createMetaFolderPath(DistCp.java:352)

 at org.apache.hadoop.tools.DistCp.execute(DistCp.java:146)

 at org.apache.hadoop.tools.DistCp.run(DistCp.java:118)

 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)

 at org.apache.hadoop.tools.DistCp.main(DistCp.java:374)

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Question on mappartitionwithsplit

2014-08-17 Thread Josh Rosen
Has anyone tried using functools.partial (
https://docs.python.org/2/library/functools.html#functools.partial) with
PySpark?  If it works, it might be a nice way to address this use-case.


On Sun, Aug 17, 2014 at 7:35 PM, Davies Liu dav...@databricks.com wrote:

 On Sun, Aug 17, 2014 at 11:21 AM, Chengi Liu chengi.liu...@gmail.com
 wrote:
  Hi,
Thanks for the response..
  In the second case f2??
  foo will have to be declared globablly??right??
 
  My function is somthing like:
  def indexing(splitIndex, iterator):
count = 0
offset = sum(offset_lists[:splitIndex]) if splitIndex else 0
indexed = []
for i, e in enumerate(iterator):
  index = count + offset + i
  for j, ele in enumerate(e):
indexed.append((index, j, ele))
yield indexed

 In this function `indexing`, `offset_lists` should be global.

  def another_funct(offset_lists):
  #get that damn offset_lists
  rdd.mapPartitionsWithSplit(indexing)
  But then, the issue is that offset_lists?
  Any suggestions?

 Basically, you can do what you do in normal Python program, PySpark
 will send the global variables or closures to worker processes
 automatically.

 So, you can :

 def indexing(splitIndex, iterator, offset_lists):
  pass

 def another_func(offset_lists):
  rdd.mapPartitionsWithSplit(lambda index, it: indexing(index, it,
 offset_lists))

 Or:

 def indexing(splitIndex, iterrator):
  # access offset_lists

 def another_func(offset):
  global offset_lists
  offset_lists = offset
  rdd. mapPartitionsWithSplit(indexing)

 Or:

 def another_func(offset_lists):
   def indexing(index, iterator):
   # access offset_lists
   pass
   rdd.mapPartitionsWithIndex(indexing)


 
  On Sun, Aug 17, 2014 at 11:15 AM, Davies Liu dav...@databricks.com
 wrote:
 
  The callback function f only accept 2 arguments, if you want to pass
  another objects to it, you need closure, such as:
 
  foo=xxx
  def f(index, iterator, foo):
   yield (index, foo)
  rdd.mapPartitionsWithIndex(lambda index, it: f(index, it, foo))
 
  also you can make f become `closure`:
 
  def f2(index, iterator):
  yield (index, foo)
  rdd.mapPartitionsWithIndex(f2)
 
  On Sun, Aug 17, 2014 at 10:25 AM, Chengi Liu chengi.liu...@gmail.com
  wrote:
   Hi,
 In this example:
  
  
 http://www.cs.berkeley.edu/~pwendell/strataconf/api/pyspark/pyspark.rdd.RDD-class.html#mapPartitionsWithSplit
   Let say, f takes three arguments:
   def f(splitIndex, iterator, foo): yield splitIndex
   Now, how do i send this foo parameter to this method?
   rdd.mapPartitionsWithSplit(f)
   Thanks
  
 
 

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Broadcasting a set in PySpark

2014-07-18 Thread Josh Rosen
You have to use `myBroadcastVariable.value` to access the broadcasted
value; see
https://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables


On Fri, Jul 18, 2014 at 2:56 PM, Vedant Dhandhania 
ved...@retentionscience.com wrote:

 Hi All,

 I am trying to broadcast a set in a PySpark script.

 I create the set like this:

 Uid_male_set = set(maleUsers.map(lambda x:x[1]).collect())


 Then execute this line:


 uid_iid_iscore_tuple_GenderFlag = uid_iid_iscore.map(lambda
 x:(x[0],zip(x[1],x[2]),x[0] in Uid_male_set))


  An error occurred while calling o104.collectPartitions.

 : org.apache.spark.SparkException: Job aborted due to stage failure:
 Serialized task 1131:0 was 23503247 bytes which exceeds
 spark.akka.frameSize (10485760 bytes). Consider using broadcast variables
 for large values.



 So I tried broadcasting it:

 Uid_male_setbc = sc.broadcast(Uid_male_set)


  Uid_male_setbc

 pyspark.broadcast.Broadcast object at 0x1ba2ed0


 Then I execute it line:


 uid_iid_iscore_tuple_GenderFlag = uid_iid_iscore.map(lambda
 x:(x[0],zip(x[1],x[2]),x[0] in Uid_male_setbc))

 ile stdin, line 1, in lambda

 TypeError: argument of type 'Broadcast' is not iterable

  [duplicate 1]


 So I am stuck either ways, the script runs locally well on a smaller
 dataset, but throws me this error. Could any one point out how to correct
 this or where I am going wrong?

 Thanks


 *Vedant Dhandhania*

 *Retention** Science*

 call: 805.574.0873

 visit: Site http://www.retentionscience.com/ | like: Facebook
 http://www.facebook.com/RetentionScience | follow: Twitter
 http://twitter.com/RetentionSci



Re: flatten RDD[RDD[T]]

2014-03-02 Thread Josh Rosen
Nope, nested RDDs aren't supported:

https://groups.google.com/d/msg/spark-users/_Efj40upvx4/DbHCixW7W7kJ
https://groups.google.com/d/msg/spark-users/KC1UJEmUeg8/N_qkTJ3nnxMJ
https://groups.google.com/d/msg/spark-users/rkVPXAiCiBk/CORV5jyeZpAJ


On Sun, Mar 2, 2014 at 5:37 PM, Cosmin Radoi cosmin.ra...@gmail.com wrote:


 I'm trying to flatten an RDD of RDDs. The straightforward approach:

 a: [RDD[RDD[Int]]
 a flatMap { _.collect }

 throws a java.lang.NullPointerException at
 org.apache.spark.rdd.RDD.collect(RDD.scala:602)

 In a more complex scenario I also got:
 Task not serializable: java.io.NotSerializableException:
 org.apache.spark.SparkContext
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)

 So I guess this may be related to the context not being available inside
 the map.

 Are nested RDDs not supported?

 Thanks,

 Cosmin Radoi