from:"\"Nicholas Chammas\""

Re: Regarding tooling/performance vs RedShift

2014-08-06 Thread Nicholas Chammas

On Wed, Aug 6, 2014 at 3:41 PM, Daniel, Ronald (ELS-SDG)<
r.dan...@elsevier.com> wrote:

> Mostly I was just objecting to " Redshift does very well, but Shark is on
> par or better than it in most of the tests " when that was not how I read
> the results, and Redshift was on HDDs.

My bad. You are correct; the only test Shark (mem) does better on is test
#1 "Scan Query".

And indeed, it would be good to see an updated benchmark with Redshift
running on SSDs.

Nick

Re: Regarding tooling/performance vs RedShift

2014-08-06 Thread Nicholas Chammas

On Wed, Aug 6, 2014 at 4:30 PM, Daniel, Ronald (ELS-SDG) <
r.dan...@elsevier.com> wrote:

> Major architectural advantage to Spark.


Amen to that. For a really cool and succinct demonstration of this, check
out Aaron's demo  at the Hadoop
Summit earlier this ear where he combines SQL, machine learning, and stream
processing using Spark. I don't think you can do this with any other
platform.

Nick

Re: Save an RDD to a SQL Database

2014-08-07 Thread Nicholas Chammas

On Thu, Aug 7, 2014 at 11:08 AM, 诺铁  wrote:

> what if network broken in half of the process?  should we drop all data in
> database and restart from beginning?

The best way to deal with this -- which, unfortunately, is not commonly
supported -- is with a two-phase commit that can span connections
. PostgreSQL supports it, for
example.

This would guarantee that a multi-connection data load is atomic.

Nick

Re: Save an RDD to a SQL Database

2014-08-07 Thread Nicholas Chammas

On Thu, Aug 7, 2014 at 11:25 AM, Cheng Lian  wrote:

> Maybe a little off topic, but would you mind to share your motivation of
> saving the RDD into an SQL DB?


Many possible reasons (Vida, please chime in with yours!):

   - You have an existing database you want to load new data into so
   everything's together.
   - You want very low query latency, which you can probably get with Spark
   SQL but currently not with the ease you can get it from your average DBMS.
   - Tooling around traditional DBMSs is currently much more mature than
   tooling around Spark SQL, especially in the JDBC area.

Nick

Re: Save an RDD to a SQL Database

2014-08-07 Thread Nicholas Chammas

Vida,

What kind of database are you trying to write to?

For example, I found that for loading into Redshift, by far the easiest
thing to do was to save my output from Spark as a CSV to S3, and then load
it from there into Redshift. This is not a slow as you think, because Spark
can write the output in parallel to S3, and Redshift, too, can load data
from multiple files in parallel
<http://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-single-copy-command.html>
.

Nick

On Thu, Aug 7, 2014 at 1:52 PM, Vida Ha  wrote:

> The use case I was thinking of was outputting calculations made in Spark
> into a SQL database for the presentation layer to access.  So in other
> words, having a Spark backend in Java that writes to a SQL database and
> then having a Rails front-end that can display the data nicely.
>
>
> On Thu, Aug 7, 2014 at 8:42 AM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> On Thu, Aug 7, 2014 at 11:25 AM, Cheng Lian 
>> wrote:
>>
>>> Maybe a little off topic, but would you mind to share your motivation of
>>> saving the RDD into an SQL DB?
>>
>>
>> Many possible reasons (Vida, please chime in with yours!):
>>
>>- You have an existing database you want to load new data into so
>>everything's together.
>>- You want very low query latency, which you can probably get with
>>Spark SQL but currently not with the ease you can get it from your average
>>DBMS.
>>- Tooling around traditional DBMSs is currently much more mature than
>>tooling around Spark SQL, especially in the JDBC area.
>>
>> Nick
>>
>
>

Re: Subscribing to news releases

2014-08-14 Thread Nicholas Chammas

I've created an issue to track this: SPARK-3044: Create RSS feed for Spark
News 


On Fri, May 30, 2014 at 11:07 AM, Nick Chammas 
wrote:

> Is there a way to subscribe to news releases
> ? That would be swell.
>
> Nick
>
>
> --
> View this message in context: Subscribing to news releases
> 
> Sent from the Apache Spark User List mailing list archive
>  at Nabble.com.
>

Re: Open source project: Deploy Spark to a cluster with Puppet and Fabric.

2014-08-16 Thread Nicholas Chammas

Hey Brandon,

Thank you for sharing this.

What is the relationship of this project to the spark-ec2 tool that comes
with Spark? Does it provide a superset of the functionality of spark-ec2?

Nick


2014년 8월 13일 수요일, bdamos님이 작성한 메시지:

> Hi Spark community,
>
> We're excited about Spark at Adobe Research and have
> just open sourced a project we use to automatically provision
> a Spark cluster and submit applications.
> The project is on GitHub, and we're happy for any feedback
> from the community:
> https://github.com/adobe-research/spark-cluster-deployment
>
> Regards,
> Brandon.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Open-source-project-Deploy-Spark-to-a-cluster-with-Puppet-and-Fabric-tp12057.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> For additional commands, e-mail: user-h...@spark.apache.org 
>
>

Re: Spark Screencast doesn't show in Chrome on OS X

2014-08-25 Thread Nicholas Chammas

Yeah, I just picked the link up from a post somewhere on Stack Overflow.
Dunno were the original poster got it from.


On Mon, Aug 25, 2014 at 9:50 PM, Matei Zaharia 
wrote:

> It seems to be because you went there with https:// instead of http://.
> That said, we'll fix it so that it works on both protocols.
>
> Matei
>
> On August 25, 2014 at 1:56:16 PM, Nick Chammas (nicholas.cham...@gmail.com)
> wrote:
>
> https://spark.apache.org/screencasts/1-first-steps-with-spark.html
>
> The embedded YouTube video shows up in Safari on OS X but not in Chrome.
>
> How come?
>
> Nick
>
>
> --
> View this message in context: Spark Screencast doesn't show in Chrome on
> OS X
> 
> Sent from the Apache Spark User List mailing list archive
>  at Nabble.com.
>
>

Re: Spark Screencast doesn't show in Chrome on OS X

2014-08-26 Thread Nicholas Chammas

For the record, I'm using Chrome 36.0.1985.143 on 10.9.4 as well. Maybe
it's a Chrome add-on I'm running?

Anyway, as Matei pointed out, if I change the https to http, it works fine.


On Tue, Aug 26, 2014 at 1:46 AM, Michael Hausenblas <
michael.hausenb...@gmail.com> wrote:

>
> > https://spark.apache.org/screencasts/1-first-steps-with-spark.html
> >
> > The embedded YouTube video shows up in Safari on OS X but not in Chrome.
>
> I’m using Chrome 36.0.1985.143 on MacOS 10.9.4 and it it works like a
> charm for me.
>
>
> Cheers,
> Michael
>
> --
> Michael Hausenblas
> Ireland, Europe
> http://mhausenblas.info/
>
> On 25 Aug 2014, at 21:55, Nick Chammas  wrote:
>
> > https://spark.apache.org/screencasts/1-first-steps-with-spark.html
> >
> > The embedded YouTube video shows up in Safari on OS X but not in Chrome.
> >
> > How come?
> >
> > Nick
> >
> >
> > View this message in context: Spark Screencast doesn't show in Chrome on
> OS X
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
>

Re: Spark Screencast doesn't show in Chrome on OS X

2014-08-26 Thread Nicholas Chammas

On Tue, Aug 26, 2014 at 10:28 AM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> Maybe it's a Chrome add-on I'm running?


Hmm, scratch that. Trying in incognito mode (which disables add-ons, I
believe) also yields the same behavior.

Nick

Re: Spark Screencast doesn't show in Chrome on OS X

2014-08-26 Thread Nicholas Chammas

Confirmed. Works now. Thanks Matei.

(BTW, on OS X Command + Shift + R also refreshes the page without cache.)


On Tue, Aug 26, 2014 at 3:06 PM, Matei Zaharia 
wrote:

> It should be fixed now. Maybe you have a cached version of the page in
> your browser. Open DevTools (cmd-shift-I), press the gear icon, and check
> "disable cache while devtools open", then refresh the page to refresh
> without cache.
>
> Matei
>
> On August 26, 2014 at 7:31:18 AM, Nicholas Chammas (
> nicholas.cham...@gmail.com) wrote:
>
>  On Tue, Aug 26, 2014 at 10:28 AM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> Maybe it's a Chrome add-on I'm running?
>
>
> Hmm, scratch that. Trying in incognito mode (which disables add-ons, I
> believe) also yields the same behavior.
>
> Nick
>
>

Re: how can I get the number of cores

2014-08-29 Thread Nicholas Chammas

What version of Spark are you running?

Try calling sc.defaultParallelism. I’ve found that it is typically set to
the number of worker cores in your cluster.



On Fri, Aug 29, 2014 at 3:39 AM, Kevin Jung  wrote:

> Hi all
> Spark web ui gives me the information about total cores and used cores.
> I want to get this information programmatically.
> How can I do this?
>
> Thanks
> Kevin
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/how-can-I-get-the-number-of-cores-tp13111.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: unsubscribe

2014-09-06 Thread Nicholas Chammas

To unsubscribe send an email to user-unsubscr...@spark.apache.org

Links to sub/unsub are here: https://spark.apache.org/community.html


On Sat, Sep 6, 2014 at 7:52 AM, Derek Schoettle  wrote:

> Unsubscribe
>
> > On Sep 6, 2014, at 7:48 AM, "Murali Raju" 
> wrote:
> >
> >
>

Re: distcp on ec2 standalone spark cluster

2014-09-07 Thread Nicholas Chammas

I think you need to run start-all.sh or something similar on the EC2
cluster. MR is installed but is not running by default on EC2 clusters spun
up by spark-ec2.


On Sun, Sep 7, 2014 at 12:33 PM, Tomer Benyamini 
wrote:

> I've installed a spark standalone cluster on ec2 as defined here -
> https://spark.apache.org/docs/latest/ec2-scripts.html. I'm not sure if
> mr1/2 is part of this installation.
>
>
> On Sun, Sep 7, 2014 at 7:25 PM, Ye Xianjin  wrote:
> > Distcp requires a mr1(or mr2) cluster to start. Do you have a mapreduce
> > cluster on your hdfs?
> > And from the error message, it seems that you didn't specify your
> jobtracker
> > address.
> >
> > --
> > Ye Xianjin
> > Sent with Sparrow
> >
> > On Sunday, September 7, 2014 at 9:42 PM, Tomer Benyamini wrote:
> >
> > Hi,
> >
> > I would like to copy log files from s3 to the cluster's
> > ephemeral-hdfs. I tried to use distcp, but I guess mapred is not
> > running on the cluster - I'm getting the exception below.
> >
> > Is there a way to activate it, or is there a spark alternative to distcp?
> >
> > Thanks,
> > Tomer
> >
> > mapreduce.Cluster (Cluster.java:initialize(114)) - Failed to use
> > org.apache.hadoop.mapred.LocalClientProtocolProvider due to error:
> > Invalid "mapreduce.jobtracker.address" configuration value for
> > LocalJobRunner : "XXX:9001"
> >
> > ERROR tools.DistCp (DistCp.java:run(126)) - Exception encountered
> >
> > java.io.IOException: Cannot initialize Cluster. Please check your
> > configuration for mapreduce.framework.name and the correspond server
> > addresses.
> >
> > at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:121)
> >
> > at org.apache.hadoop.mapreduce.Cluster.(Cluster.java:83)
> >
> > at org.apache.hadoop.mapreduce.Cluster.(Cluster.java:76)
> >
> > at org.apache.hadoop.tools.DistCp.createMetaFolderPath(DistCp.java:352)
> >
> > at org.apache.hadoop.tools.DistCp.execute(DistCp.java:146)
> >
> > at org.apache.hadoop.tools.DistCp.run(DistCp.java:118)
> >
> > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> >
> > at org.apache.hadoop.tools.DistCp.main(DistCp.java:374)
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> >
> >
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: distcp on ec2 standalone spark cluster

2014-09-08 Thread Nicholas Chammas

Tomer,

Did you try start-all.sh? It worked for me the last time I tried using
distcp, and it worked for this guy too
.

Nick


On Mon, Sep 8, 2014 at 3:28 AM, Tomer Benyamini  wrote:

> ~/ephemeral-hdfs/sbin/start-mapred.sh does not exist on spark-1.0.2;
>
> I restarted hdfs using ~/ephemeral-hdfs/sbin/stop-dfs.sh and
> ~/ephemeral-hdfs/sbin/start-dfs.sh, but still getting the same error
> when trying to run distcp:
>
> ERROR tools.DistCp (DistCp.java:run(126)) - Exception encountered
>
> java.io.IOException: Cannot initialize Cluster. Please check your
> configuration for mapreduce.framework.name and the correspond server
> addresses.
>
> at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:121)
>
> at org.apache.hadoop.mapreduce.Cluster.(Cluster.java:83)
>
> at org.apache.hadoop.mapreduce.Cluster.(Cluster.java:76)
>
> at org.apache.hadoop.tools.DistCp.createMetaFolderPath(DistCp.java:352)
>
> at org.apache.hadoop.tools.DistCp.execute(DistCp.java:146)
>
> at org.apache.hadoop.tools.DistCp.run(DistCp.java:118)
>
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>
> at org.apache.hadoop.tools.DistCp.main(DistCp.java:374)
>
> Any idea?
>
> Thanks!
> Tomer
>
> On Sun, Sep 7, 2014 at 9:27 PM, Josh Rosen  wrote:
> > If I recall, you should be able to start Hadoop MapReduce using
> > ~/ephemeral-hdfs/sbin/start-mapred.sh.
> >
> > On Sun, Sep 7, 2014 at 6:42 AM, Tomer Benyamini 
> wrote:
> >>
> >> Hi,
> >>
> >> I would like to copy log files from s3 to the cluster's
> >> ephemeral-hdfs. I tried to use distcp, but I guess mapred is not
> >> running on the cluster - I'm getting the exception below.
> >>
> >> Is there a way to activate it, or is there a spark alternative to
> distcp?
> >>
> >> Thanks,
> >> Tomer
> >>
> >> mapreduce.Cluster (Cluster.java:initialize(114)) - Failed to use
> >> org.apache.hadoop.mapred.LocalClientProtocolProvider due to error:
> >> Invalid "mapreduce.jobtracker.address" configuration value for
> >> LocalJobRunner : "XXX:9001"
> >>
> >> ERROR tools.DistCp (DistCp.java:run(126)) - Exception encountered
> >>
> >> java.io.IOException: Cannot initialize Cluster. Please check your
> >> configuration for mapreduce.framework.name and the correspond server
> >> addresses.
> >>
> >> at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:121)
> >>
> >> at org.apache.hadoop.mapreduce.Cluster.(Cluster.java:83)
> >>
> >> at org.apache.hadoop.mapreduce.Cluster.(Cluster.java:76)
> >>
> >> at org.apache.hadoop.tools.DistCp.createMetaFolderPath(DistCp.java:352)
> >>
> >> at org.apache.hadoop.tools.DistCp.execute(DistCp.java:146)
> >>
> >> at org.apache.hadoop.tools.DistCp.run(DistCp.java:118)
> >>
> >> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> >>
> >> at org.apache.hadoop.tools.DistCp.main(DistCp.java:374)
> >>
> >> -
> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> >> For additional commands, e-mail: user-h...@spark.apache.org
> >>
> >
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Announcing Spark 1.1.0!

2014-09-11 Thread Nicholas Chammas

Nice work everybody! I'm looking forward to trying out this release!

On Thu, Sep 11, 2014 at 8:12 PM, Patrick Wendell  wrote:

> I am happy to announce the availability of Spark 1.1.0! Spark 1.1.0 is
> the second release on the API-compatible 1.X line. It is Spark's
> largest release ever, with contributions from 171 developers!
>
> This release brings operational and performance improvements in Spark
> core including a new implementation of the Spark shuffle designed for
> very large scale workloads. Spark 1.1 adds significant extensions to
> the newest Spark modules, MLlib and Spark SQL. Spark SQL introduces a
> JDBC server, byte code generation for fast expression evaluation, a
> public types API, JSON support, and other features and optimizations.
> MLlib introduces a new statistics library along with several new
> algorithms and optimizations. Spark 1.1 also builds out Spark's Python
> support and adds new components to the Spark Streaming module.
>
> Visit the release notes [1] to read about the new features, or
> download [2] the release today.
>
> [1] http://spark.eu.apache.org/releases/spark-release-1-1-0.html
> [2] http://spark.eu.apache.org/downloads.html
>
> NOTE: SOME ASF DOWNLOAD MIRRORS WILL NOT CONTAIN THE RELEASE FOR SEVERAL
> HOURS.
>
> Please e-mail me directly for any type-o's in the release notes or name
> listing.
>
> Thanks, and congratulations!
> - Patrick
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: DistCP - Spark-based

2014-09-11 Thread Nicholas Chammas

I've created SPARK-3499  to
track creating a Spark-based distcp utility.

Nick

On Tue, Aug 12, 2014 at 4:20 PM, Matei Zaharia 
wrote:

> Good question; I don't know of one but I believe people at Cloudera had
> some thoughts of porting Sqoop to Spark in the future, and maybe they'd
> consider DistCP as part of this effort. I agree it's missing right now.
>
> Matei
>
> On August 12, 2014 at 11:04:28 AM, Gary Malouf (malouf.g...@gmail.com)
> wrote:
>
> We are probably still the minority, but our analytics platform based on
> Spark + HDFS does not have map/reduce installed.  I'm wondering if there is
> a distcp equivalent that leverages Spark to do the work.
>
> Our team is trying to find the best way to do cross-datacenter replication
> of our HDFS data to minimize the impact of outages/dc failure.
>
>

Re: Spark and Scala

2014-09-12 Thread Nicholas Chammas

unpersist is a method on RDDs. RDDs are abstractions introduced by Spark.

An Int is just a Scala Int. You can't call unpersist on Int in Scala, and
that doesn't change in Spark.

On Fri, Sep 12, 2014 at 12:33 PM, Deep Pradhan 
wrote:

> There is one thing that I am confused about.
> Spark has codes that have been implemented in Scala. Now, can we run any
> Scala code on the Spark framework? What will be the difference in the
> execution of the scala code in normal systems and on Spark?
> The reason for my question is the following:
> I had a variable
> *val temp = *
> This temp was being created inside the loop, so as to manually throw it
> out of the cache, every time the loop ends I was calling
> *temp.unpersist()*, this was returning an error saying that *value
> unpersist is not a method of Int*, which means that temp is an Int.
> Can some one explain to me why I was not able to call *unpersist* on
> *temp*?
>
> Thank You
>

Re: When does Spark switch from PROCESS_LOCAL to NODE_LOCAL or RACK_LOCAL?

2014-09-12 Thread Nicholas Chammas

Andrew,

This email was pretty helpful. I feel like this stuff should be summarized
in the docs somewhere, or perhaps in a blog post.

Do you know if it is?

Nick


On Thu, Jun 5, 2014 at 6:36 PM, Andrew Ash  wrote:

> The locality is how close the data is to the code that's processing it.
>  PROCESS_LOCAL means data is in the same JVM as the code that's running, so
> it's really fast.  NODE_LOCAL might mean that the data is in HDFS on the
> same node, or in another executor on the same node, so is a little slower
> because the data has to travel across an IPC connection.  RACK_LOCAL is
> even slower -- data is on a different server so needs to be sent over the
> network.
>
> Spark switches to lower locality levels when there's no unprocessed data
> on a node that has idle CPUs.  In that situation you have two options: wait
> until the busy CPUs free up so you can start another task that uses data on
> that server, or start a new task on a farther away server that needs to
> bring data from that remote place.  What Spark typically does is wait a bit
> in the hopes that a busy CPU frees up.  Once that timeout expires, it
> starts moving the data from far away to the free CPU.
>
> The main tunable option is how far long the scheduler waits before
> starting to move data rather than code.  Those are the spark.locality.*
> settings here: http://spark.apache.org/docs/latest/configuration.html
>
> If you want to prevent this from happening entirely, you can set the
> values to ridiculously high numbers.  The documentation also mentions that
> "0" has special meaning, so you can try that as well.
>
> Good luck!
> Andrew
>
>
> On Thu, Jun 5, 2014 at 3:13 PM, Sung Hwan Chung 
> wrote:
>
>> I noticed that sometimes tasks would switch from PROCESS_LOCAL (I'd
>> assume that this means fully cached) to NODE_LOCAL or even RACK_LOCAL.
>>
>> When these happen things get extremely slow.
>>
>> Does this mean that the executor got terminated and restarted?
>>
>> Is there a way to prevent this from happening (barring the machine
>> actually going down, I'd rather stick with the same process)?
>>
>
>

Re: RDDs and Immutability

2014-09-13 Thread Nicholas Chammas

Have you tried using RDD.map() to transform some of the RDD elements from 0
to 1? Why doesn’t that work? That’s how you change data in Spark, by
defining a new RDD that’s a transformation of an old one.

On Sat, Sep 13, 2014 at 5:39 AM, Deep Pradhan 
wrote:

> Hi,
> We all know that RDDs are immutable.
> There are not enough operations that can achieve anything and everything
> on RDDs.
> Take for example this:
> I want an Array of Bytes filled with zeros which during the program should
> change. Some elements of that Array should change to 1.
> If I make an RDD with all elements as zero, I won't be able to change the
> elements. On the other hand, if I declare as Array then so much memory will
> be consumed.
> Please clarify this to me.
>
> Thank You
>

Re: Write 1 RDD to multiple output paths in one go

2014-09-15 Thread Nicholas Chammas

Any tips from anybody on how to do this in PySpark? (Or regular Spark, for
that matter.)

On Sat, Sep 13, 2014 at 1:25 PM, Nick Chammas 
wrote:

> Howdy doody Spark Users,
>
> I’d like to somehow write out a single RDD to multiple paths in one go.
> Here’s an example.
>
> I have an RDD of (key, value) pairs like this:
>
> >>> a = sc.parallelize(['Nick', 'Nancy', 'Bob', 'Ben', 
> >>> 'Frankie']).keyBy(lambda x: x[0])>>> a.collect()
> [('N', 'Nick'), ('N', 'Nancy'), ('B', 'Bob'), ('B', 'Ben'), ('F', 'Frankie')]
>
> Now I want to write the RDD out to different paths depending on the keys,
> so that I have one output directory per distinct key. Each output directory
> could potentially have multiple part- files or whatever.
>
> So my output would be something like:
>
> /path/prefix/n [/part-1, /part-2, etc]
> /path/prefix/b [/part-1, /part-2, etc]
> /path/prefix/f [/part-1, /part-2, etc]
>
> How would you do that?
>
> I suspect I need to use saveAsNewAPIHadoopFile
> 
> or saveAsHadoopFile
> 
> along with the MultipleTextOutputFormat output format class, but I’m not
> sure how.
>
> By the way, there is a very similar question to this here on Stack
> Overflow
> 
> .
>
> Nick
> 
>
> --
> View this message in context: Write 1 RDD to multiple output paths in one
> go
> 
> Sent from the Apache Spark User List mailing list archive
>  at Nabble.com.
>

Re: Write 1 RDD to multiple output paths in one go

2014-09-15 Thread Nicholas Chammas

Davies,

That’s pretty neat. I heard there was a pure Python clone of Spark out
there—so you were one of the people behind it!

I’ve created a JIRA issue about this. SPARK-3533: Add saveAsTextFileByKey()
method to RDDs <https://issues.apache.org/jira/browse/SPARK-3533>

Sean,

I think you might be able to get this working with a subclass of
MultipleTextOutputFormat, which overrides generateFileNameForKeyValue,
generateActualKey, etc. A bit of work for sure, but probably works.

I’m looking at how to make this work in PySpark as of 1.1.0. The closest
examples I can see of how to use the saveAsHadoop...() methods in this way
are these two examples: HBase Output Format
<https://github.com/apache/spark/blob/cc14644460872efb344e8d895859d70213a40840/examples/src/main/python/hbase_outputformat.py#L60>
and Avro Input Format
<https://github.com/apache/spark/blob/cc14644460872efb344e8d895859d70213a40840/examples/src/main/python/avro_inputformat.py#L73>

Basically, I’m thinking I need to subclass MultipleTextOutputFormat and
override some methods in a Scala file, and then reference that from Python?
Like how the AvroWrapperToJavaConverter class is done? Seems pretty
involved, but I’ll give it a shot if that’s the right direction to go in.

Nick

On Mon, Sep 15, 2014 at 1:08 PM, Davies Liu  wrote:

> Maybe we should provide an API like saveTextFilesByKey(path),
> could you create an JIRA for it ?
>
> There is one in DPark [1] actually.
>
> [1] https://github.com/douban/dpark/blob/master/dpark/rdd.py#L309
>
> On Mon, Sep 15, 2014 at 7:08 AM, Nicholas Chammas
>  wrote:
> > Any tips from anybody on how to do this in PySpark? (Or regular Spark,
> for
> > that matter.)
> >
> > On Sat, Sep 13, 2014 at 1:25 PM, Nick Chammas <
> nicholas.cham...@gmail.com>
> > wrote:
> >>
> >> Howdy doody Spark Users,
> >>
> >> I’d like to somehow write out a single RDD to multiple paths in one go.
> >> Here’s an example.
> >>
> >> I have an RDD of (key, value) pairs like this:
> >>
> >> >>> a = sc.parallelize(['Nick', 'Nancy', 'Bob', 'Ben',
> >> >>> 'Frankie']).keyBy(lambda x: x[0])
> >> >>> a.collect()
> >> [('N', 'Nick'), ('N', 'Nancy'), ('B', 'Bob'), ('B', 'Ben'), ('F',
> >> 'Frankie')]
> >>
> >> Now I want to write the RDD out to different paths depending on the
> keys,
> >> so that I have one output directory per distinct key. Each output
> directory
> >> could potentially have multiple part- files or whatever.
> >>
> >> So my output would be something like:
> >>
> >> /path/prefix/n [/part-1, /part-2, etc]
> >> /path/prefix/b [/part-1, /part-2, etc]
> >> /path/prefix/f [/part-1, /part-2, etc]
> >>
> >> How would you do that?
> >>
> >> I suspect I need to use saveAsNewAPIHadoopFile or saveAsHadoopFile along
> >> with the MultipleTextOutputFormat output format class, but I’m not sure
> how.
> >>
> >> By the way, there is a very similar question to this here on Stack
> >> Overflow.
> >>
> >> Nick
> >>
> >>
> >> 
> >> View this message in context: Write 1 RDD to multiple output paths in
> one
> >> go
> >> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> >
>

Re: HBase and non-existent TableInputFormat

2014-09-16 Thread Nicholas Chammas

Btw, there are some examples in the Spark GitHub repo that you may find
helpful. Here's one

related to HBase.

On Tue, Sep 16, 2014 at 1:22 PM,  wrote:

>  *Hi, *
>
>
>
> *I had a similar situation in which I needed to read data from HBase and
> work with the data inside of a spark context. After much ggling, I
> finally got mine to work. There are a bunch of steps that you need to do
> get this working – *
>
>
>
> *The problem is that the spark context does not know anything about hbase,
> so you have to provide all the information about hbase classes to both the
> driver code and executor code…*
>
>
>
>
>
> SparkConf sconf = *new* SparkConf().setAppName(“App").setMaster("local");
>
> JavaSparkContext sc = *new* JavaSparkContext(sconf);
>
>
>
> sparkConf.set("spark.executor.extraClassPath", "$(hbase classpath)");  
> //ç=
> you will need to add this to tell the executor about the classpath for
> HBase.
>
>
>
> Configuration conf = HBaseConfiguration.*create*();
>
> conf.set(*TableInputFormat*.INPUT_TABLE, "Article");
>
>
>
> JavaPairRDD hBaseRDD = sc.
> *newAPIHadoopRDD*(conf, *TableInputFormat*.*class*
> ,org.apache.hadoop.hbase.io.ImmutableBytesWritable.*class*,
>
> org.apache.hadoop.hbase.client.Result.*class*);
>
>
>
>
>
> *The when you submit the spark job – *
>
>
>
>
>
> *spark-submit --driver-class-path $(hbase classpath) --jars
> /usr/lib/hbase/hbase-server.jar,/usr/lib/hbase/hbase-client.jar,/usr/lib/hbase/hbase-common.jar,/usr/lib/hbase/hbase-protocol.jar,/usr/lib/hbase/lib/protobuf-java-2.5.0.jar,/usr/lib/hbase/lib/htrace-core.jar
> --class YourClassName --master local App.jar *
>
>
>
>
>
> Try this and see if it works for you.
>
>
>
>
>
> *From:* Y. Dong [mailto:tq00...@gmail.com]
> *Sent:* Tuesday, September 16, 2014 8:18 AM
> *To:* user@spark.apache.org
> *Subject:* HBase and non-existent TableInputFormat
>
>
>
> Hello,
>
>
>
> I’m currently using spark-core 1.1 and hbase 0.98.5 and I want to simply
> read from hbase. The Java code is attached. However the problem is
> TableInputFormat does not even exist in hbase-client API, is there any
> other way I can read from
>
> hbase? Thanks
>
>
>
> SparkConf sconf = *new* SparkConf().setAppName(“App").setMaster("local");
>
> JavaSparkContext sc = *new* JavaSparkContext(sconf);
>
>
>
> Configuration conf = HBaseConfiguration.*create*();
>
> conf.set(*TableInputFormat*.INPUT_TABLE, "Article");
>
>
>
> JavaPairRDD hBaseRDD = sc.
> *newAPIHadoopRDD*(conf, *TableInputFormat*.*class*
> ,org.apache.hadoop.hbase.io.ImmutableBytesWritable.*class*,
>
> org.apache.hadoop.hbase.client.Result.*class*);
>
>
>
>
>
>
>

Re: how to report documentation bug?

2014-09-16 Thread Nicholas Chammas

You can send an email like you just did or open an issue in the Spark issue
tracker . This looks like a problem with
how the version is generated in this file
.

On Tue, Sep 16, 2014 at 8:55 PM, Andy Davidson <
a...@santacruzintegration.com> wrote:

>
>
> http://spark.apache.org/docs/latest/quick-start.html#standalone-applications
>
> Click on java tab There is a bug in the maven section
>
>   1.1.0-SNAPSHOT
>
>
>
> Should be
> 1.1.0
>
> Hope this helps
>
> Andy
>

Re: Size exceeds Integer.MAX_VALUE in BlockFetcherIterator

2014-09-17 Thread Nicholas Chammas

Which appears in turn to be caused by SPARK-1476
.

On Wed, Sep 17, 2014 at 9:14 PM, francisco  wrote:

> Looks like this is a known issue:
>
> https://issues.apache.org/jira/browse/SPARK-1353
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Size-exceeds-Integer-MAX-VALUE-in-BlockFetcherIterator-tp14483p14500.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: spark-submit: fire-and-forget mode?

2014-09-18 Thread Nicholas Chammas

Dunno about having the application be independent of whether spark-submit
is still alive, but you can have spark-submit run in a new session in Linux
using setsid .

That way even if you terminate your SSH session, spark-submit will keep
running independently. Of course, if you terminate the host running
spark-submit, you will still have problems.

On Thu, Sep 18, 2014 at 4:19 AM, Tobias Pfeiffer  wrote:

> Hi,
>
> I am wondering: Is it possible to run spark-submit in a mode where it will
> start an application on a YARN cluster (i.e., driver and executors run on
> the cluster) and then forget about it in the sense that the Spark
> application is completely independent from the host that ran the
> spark-submit command and will not be affected if that controlling machine
> shuts down etc.? I was using spark-submit with YARN in cluster mode, but
> spark-submit stayed in the foreground and as far as I understood, it
> terminated the application on the cluster when spark-submit was Ctrl+C'ed.
>
> Thanks
> Tobias
>

Re: spark-submit: fire-and-forget mode?

2014-09-18 Thread Nicholas Chammas

And for the record, the issue is here:
https://issues.apache.org/jira/browse/SPARK-3591

On Thu, Sep 18, 2014 at 1:19 PM, Andrew Or  wrote:

> Thanks Tobias, I have filed a JIRA for it.
>
> 2014-09-18 10:09 GMT-07:00 Patrick Wendell :
>
> I agree, that's a good idea Marcelo. There isn't AFAIK any reason the
>> client needs to hang there for correct operation.
>>
>> On Thu, Sep 18, 2014 at 9:39 AM, Marcelo Vanzin 
>> wrote:
>> > Yes, what Sandy said.
>> >
>> > On top of that, I would suggest filing a bug for a new command line
>> > argument for spark-submit to make the launcher process exit cleanly as
>> > soon as a cluster job starts successfully. That can be helpful for
>> > code that launches Spark jobs but monitors the job through different
>> > means.
>> >
>> > On Thu, Sep 18, 2014 at 7:37 AM, Sandy Ryza 
>> wrote:
>> >> Hi Tobias,
>> >>
>> >> YARN cluster mode should have the behavior you're looking for.  The
>> client
>> >> process will stick around to report on things, but should be able to be
>> >> killed without affecting the application.  If this isn't the behavior
>> you're
>> >> observing, and your application isn't failing for a different reason,
>> >> there's a bug.
>> >>
>> >> -Sandy
>> >>
>> >> On Thu, Sep 18, 2014 at 10:20 AM, Nicholas Chammas
>> >>  wrote:
>> >>>
>> >>> Dunno about having the application be independent of whether
>> spark-submit
>> >>> is still alive, but you can have spark-submit run in a new session in
>> Linux
>> >>> using setsid.
>> >>>
>> >>> That way even if you terminate your SSH session, spark-submit will
>> keep
>> >>> running independently. Of course, if you terminate the host running
>> >>> spark-submit, you will still have problems.
>> >>>
>> >>>
>> >>> On Thu, Sep 18, 2014 at 4:19 AM, Tobias Pfeiffer 
>> wrote:
>> >>>>
>> >>>> Hi,
>> >>>>
>> >>>> I am wondering: Is it possible to run spark-submit in a mode where it
>> >>>> will start an application on a YARN cluster (i.e., driver and
>> executors run
>> >>>> on the cluster) and then forget about it in the sense that the Spark
>> >>>> application is completely independent from the host that ran the
>> >>>> spark-submit command and will not be affected if that controlling
>> machine
>> >>>> shuts down etc.? I was using spark-submit with YARN in cluster mode,
>> but
>> >>>> spark-submit stayed in the foreground and as far as I understood, it
>> >>>> terminated the application on the cluster when spark-submit was
>> Ctrl+C'ed.
>> >>>>
>> >>>> Thanks
>> >>>> Tobias
>> >>>
>> >>>
>> >>
>> >
>> >
>> >
>> > --
>> > Marcelo
>> >
>> > -
>> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> > For additional commands, e-mail: user-h...@spark.apache.org
>> >
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

Re: Why recommend 2-3 tasks per CPU core ?

2014-09-22 Thread Nicholas Chammas

On Tue, Sep 23, 2014 at 1:58 AM, myasuka  wrote:

> Thus I want to know why  recommend
> 2-3 tasks per CPU core?
>

You want at least 1 task per core so that you fully utilize the cluster's
parallelism.

You want 2-3 tasks per core so that tasks are a bit smaller than they would
otherwise be, making them shorter and more likely to complete successfully.

Nick

Re: parquetFile and wilcards

2014-09-24 Thread Nicholas Chammas

Does it make sense for us to open a JIRA to track enhancing the Parquet
input format to support wildcards? Or is this something outside of Spark's
control?

Nick

On Wed, Sep 24, 2014 at 1:01 PM, Michael Armbrust 
wrote:

> This behavior is inherited from the parquet input format that we use.  You
> could list the files manually and pass them as a comma separated list.
>
> On Wed, Sep 24, 2014 at 7:46 AM, Marius Soutier  wrote:
>
>> Hello,
>>
>> sc.textFile and so on support wildcards in their path, but apparently
>> sqlc.parquetFile() does not. I always receive “File
>> /file/to/path/*/input.parquet does not exist". Is this normal or a bug? Is
>> there are a workaround?
>>
>> Thanks
>> - Marius
>>
>>
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

Re: Spark SQL use of alias in where clause

2014-09-25 Thread Nicholas Chammas

That is correct. Aliases in the SELECT clause can only be referenced in the
ORDER BY and HAVING clauses. Otherwise, you'll have to just repeat the
statement, like concat() in this case.

A more elegant alternative, which is probably not available in Spark SQL
yet, is to use Common Table Expressions
.

On Wed, Sep 24, 2014 at 11:32 PM, Yanbo Liang  wrote:

> Maybe it's the way SQL works.
> The select part is executed after the where filter is applied, so you
> cannot use alias declared in select part in where clause.
> Hive and Oracle behavior the same as Spark SQL.
>
> 2014-09-25 8:58 GMT+08:00 Du Li :
>
>>   Hi,
>>
>>  The following query does not work in Shark nor in the new Spark
>> SQLContext or HiveContext.
>> SELECT key, value, concat(key, value) as combined from src where combined
>> like ’11%’;
>>
>>  The following tweak of syntax works fine although a bit ugly.
>> SELECT key, value, concat(key, value) as combined from src where
>> concat(key,value) like ’11%’ order by combined;
>>
>>  Are you going to support alias in where clause soon?
>>
>>  Thanks,
>> Du
>>
>
>

Re: problem with spark-ec2 launch script Re: spark-ec2 ERROR: Line magic function `%matplotlib` not found

2014-09-26 Thread Nicholas Chammas

Are you able to use the regular PySpark shell on your EC2 cluster? That
would be the first thing to confirm is working.

I don’t know whether the version of Python on the cluster would affect
whether IPython works or not, but if you want to try manually upgrading
Python on a cluster launched by spark-ec2, there are some instructions in
the comments here  for
doing so.

Nick


On Fri, Sep 26, 2014 at 2:18 PM, Andy Davidson <
a...@santacruzintegration.com> wrote:

> Hi Davies
>
> The real issue is about cluster management. I am new to the spark world
> and am not a system administrator.  It seem like the problem is with the
> spark-ec2 launch script. It is installing  old version of python
>
> In the mean time I am trying to figure out how I can manually install the
> correct version on all the machines in my cluster
>
> Thanks
>
> Andy
>
> From: Davies Liu 
> Date: Thursday, September 25, 2014 at 9:58 PM
> To: Andrew Davidson 
> Cc: "user@spark.apache.org" 
> Subject: Re: spark-ec2 ERROR: Line magic function `%matplotlib` not found
>
> Maybe you have Python 2.7 on master but Python 2.6 in cluster,
> you should upgrade python to 2.7 in cluster, or use python 2.6 in
> master by set PYSPARK_PYTHON=python2.6
>
> On Thu, Sep 25, 2014 at 5:11 PM, Andy Davidson
>  wrote:
>
> Hi
>
> I am running into trouble using iPython notebook on my cluster. Use the
> following command to set the cluster up
>
> $ ./spark-ec2 --key-pair=$KEY_PAIR --identity-file=$KEY_FILE
> --region=$REGION --slaves=$NUM_SLAVES launch $CLUSTER_NAME
>
>
> On master I launch python as follows
>
> $ IPYTHON_OPTS="notebook --pylab inline --no-browser --port=7000"
> $SPARK_HOME/bin/pyspark
>
>
> It looks like the problem is the cluster is using an old version of python
> and python. Any idea how I can easily upgrade ? The following version works
> on my mac
>
> Thanks
>
> Andy
>
> {'commit_hash': '681fd77',
>   'commit_source': 'installation',
>   'default_encoding': 'UTF-8',
>   'ipython_path': '/Library/Python/2.7/site-packages/IPython',
>   'ipython_version': '2.1.0',
>   'os_name': 'posix',
>   'platform': 'Darwin-13.3.0-x86_64-i386-64bit',
>   'sys_executable': '/usr/bin/python',
>   'sys_platform': 'darwin',
>   'sys_version': '2.7.5 (default, Mar  9 2014, 22:15:05) \n[GCC 4.2.1
> Compatible Apple LLVM 5.0 (clang-500.0.68)]’}
>
>
>
>
>
>

Re: iPython notebook ec2 cluster matlabplot not found?

2014-09-27 Thread Nicholas Chammas

Can you first confirm that the regular PySpark shell works on your cluster?
Without upgrading to 2.7. That is, you log on to your master using spark-ec2
login and run bin/pyspark successfully without any special flags.

And as far as I can tell, you should be able to use IPython at 2.6, so I’d
next confirm that that is working before throwing the 2.7 upgrade into the
mix.

Also, when upgrading or installing things, try doing so for all the nodes
in your cluster using pssh. If you install stuff just on the master without
somehow transferring it to the slaves, that will be problematic.

Finally, there is an open pull request
 related to IPython that may be
relevant, though I haven’t looked at it too closely.

Nick


On Sat, Sep 27, 2014 at 7:33 PM, Andy Davidson <
a...@santacruzintegration.com> wrote:

> Hi
>
> I am having a heck of time trying to get python to work correctly on my
> cluster created using  the spark-ec2 script
>
> The following link was really helpful
> https://issues.apache.org/jira/browse/SPARK-922
>
>
> I am still running into problem with matplotlib. (it works fine on my
> mac). I can not figure out how to get libagg, freetype, or Qhull
> dependencies installed.
>
> Has anyone else run into this problem?
>
> Thanks
>
> Andy
>
> sudo yum install freetype-devel
>
> sudo yum install libpng-devel
>
> sudo pip2.7 install six
>
> sudo pip2.7 install python-dateutil
>
> sudo pip2.7 install pyparsing
>
> sudo pip2.7 install pycxx
>
>
> sudo pip2.7 install matplotlib
>
> ec2-user@ip-172-31-15-87 ~]$ sudo pip2.7 install matplotlib
>
> Downloading/unpacking matplotlib
>
>   Downloading matplotlib-1.4.0.tar.gz (51.2MB): 51.2MB downloaded
>
>   Running setup.py (path:/tmp/pip_build_root/matplotlib/setup.py) egg_info
> for package matplotlib
>
>
> 
>
> Edit setup.cfg to change the build options
>
>
>
> BUILDING MATPLOTLIB
>
> matplotlib: yes [1.4.0]
>
> python: yes [2.7.5 (default, Sep 15 2014, 17:30:20)
> [GCC
>
> 4.8.2 20140120 (Red Hat 4.8.2-16)]]
>
>   platform: yes [linux2]
>
>
>
> REQUIRED DEPENDENCIES AND EXTENSIONS
>
>  numpy: yes [version 1.9.0]
>
>six: yes [using six version 1.8.0]
>
>   dateutil: yes [using dateutil version 2.2]
>
>tornado: yes [using tornado version 4.0.2]
>
>  pyparsing: yes [using pyparsing version 2.0.2]
>
>  pycxx: yes [Couldn't import.  Using local copy.]
>
> libagg: yes [pkg-config information for 'libagg' could
> not
>
> be found. Using local copy.]
>
>   freetype: no  [Requires freetype2 2.4 or later.  Found
>
> 2.3.11.]
>
>png: yes [version 1.2.49]
>
>  qhull: yes [pkg-config information for 'qhull' could
> not be
>
> found. Using local copy.]
>
>
>
> OPTIONAL SUBPACKAGES
>
>sample_data: yes [installing]
>
>   toolkits: yes [installing]
>
>  tests: yes [using nose version 1.3.4 / mock is
> required to
>
> run the matplotlib test suite.
> pip/easy_install may
>
> attempt to install it after matplotlib.]
>
> toolkits_tests: yes [using nose version 1.3.4 / mock is
> required to
>
> run the matplotlib test suite.
> pip/easy_install may
>
> attempt to install it after matplotlib.]
>
>
>
> OPTIONAL BACKEND EXTENSIONS
>
> macosx: no  [Mac OS-X only]
>
> qt5agg: no  [PyQt5 not found]
>
> qt4agg: no  [PyQt4 not found]
>
> pyside: no  [PySide not found]
>
>gtk3agg: no  [Requires pygobject to be installed.]
>
>  gtk3cairo: no  [Requires cairocffi or pycairo to be
> installed.]
>
> gtkagg: no  [Requires pygtk]
>
>  tkagg: no  [TKAgg requires Tkinter.]
>
>  wxagg: no  [requires wxPython]
>
>gtk: no  [Requires pygtk]
>
>agg: yes [installing]
>
>  cairo: no  [cairocffi or pycairo not found]
>
>  windowing: no  [Microsoft Windows only]
>
>
>
> OPTIONAL LATEX DEPENDENCIES
>
> dvipng: no
>
>ghostscript: yes [version 8.70]
>
>  latex: yes [version 3.141592]
>
>pdftops: no
>
>
>
>
> 
>
> * The following required packages can not be
> built:
>
> * freetype
>

Re: S3 - Extra $_folder$ files for every directory node

2014-09-30 Thread Nicholas Chammas

Those files are created by the Hadoop API that Spark leverages. Spark does
not directly control that.

You may be able to check with the Hadoop project on whether they are
looking at changing this behavior. I believe it was introduced because S3
at one point required it, though it doesn't anymore.

On Tue, Sep 30, 2014 at 10:43 AM, pouryas  wrote:

> I would like to know a way for not adding those $_folder$ files to S3 as
> well. I can go ahead and delete them but it would be nice if Spark handles
> this for you.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/S3-Extra-folder-files-for-every-directory-node-tp15078p15402.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: MultipleTextOutputFormat with new hadoop API

2014-10-01 Thread Nicholas Chammas

Are you trying to do something along the lines of what's described here?
https://issues.apache.org/jira/browse/SPARK-3533

On Wed, Oct 1, 2014 at 10:53 AM, Tomer Benyamini 
wrote:

> Hi,
>
> I'm trying to write my JavaPairRDD using saveAsNewAPIHadoopFile with
> MultipleTextOutputFormat,:
>
> outRdd.saveAsNewAPIHadoopFile("/tmp", String.class, String.class,
> MultipleTextOutputFormat.class);
>
> but I'm getting this compilation error:
>
> Bound mismatch: The generic method saveAsNewAPIHadoopFile(String,
> Class, Class, Class) of type JavaPairRDD is not
> applicable for the arguments (String, Class, Class,
> Class). The inferred type
> MultipleTextOutputFormat is not a valid substitute for the bounded
> parameter >
>
> I bumped into some discussions suggesting to use MultipleOutputs
> (
> http://hadoop.apache.org/docs/r2.3.0/api/org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html
> ),
> but this also fails from the same reason.
>
> Would love some assistance :)
>
> Thanks,
> Tomer
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: MultipleTextOutputFormat with new hadoop API

2014-10-01 Thread Nicholas Chammas

Not that I'm aware of. I'm looking for a work-around myself!

On Wed, Oct 1, 2014 at 11:15 AM, Tomer Benyamini 
wrote:

> Yes exactly.. so I guess this is still an open request. Any workaround?
>
> On Wed, Oct 1, 2014 at 6:04 PM, Nicholas Chammas
>  wrote:
> > Are you trying to do something along the lines of what's described here?
> > https://issues.apache.org/jira/browse/SPARK-3533
> >
> > On Wed, Oct 1, 2014 at 10:53 AM, Tomer Benyamini 
> > wrote:
> >>
> >> Hi,
> >>
> >> I'm trying to write my JavaPairRDD using saveAsNewAPIHadoopFile with
> >> MultipleTextOutputFormat,:
> >>
> >> outRdd.saveAsNewAPIHadoopFile("/tmp", String.class, String.class,
> >> MultipleTextOutputFormat.class);
> >>
> >> but I'm getting this compilation error:
> >>
> >> Bound mismatch: The generic method saveAsNewAPIHadoopFile(String,
> >> Class, Class, Class) of type JavaPairRDD is not
> >> applicable for the arguments (String, Class, Class,
> >> Class). The inferred type
> >> MultipleTextOutputFormat is not a valid substitute for the bounded
> >> parameter >
> >>
> >> I bumped into some discussions suggesting to use MultipleOutputs
> >>
> >> (
> http://hadoop.apache.org/docs/r2.3.0/api/org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html
> ),
> >> but this also fails from the same reason.
> >>
> >> Would love some assistance :)
> >>
> >> Thanks,
> >> Tomer
> >>
> >> -
> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> >> For additional commands, e-mail: user-h...@spark.apache.org
> >>
> >
>

Re: MultipleTextOutputFormat with new hadoop API

2014-10-01 Thread Nicholas Chammas

There is this thread on Stack Overflow
<http://stackoverflow.com/questions/23995040/write-to-multiple-outputs-by-key-spark-one-spark-job>
about
the same topic, which you may find helpful.

On Wed, Oct 1, 2014 at 11:17 AM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> Not that I'm aware of. I'm looking for a work-around myself!
>
> On Wed, Oct 1, 2014 at 11:15 AM, Tomer Benyamini 
> wrote:
>
>> Yes exactly.. so I guess this is still an open request. Any workaround?
>>
>> On Wed, Oct 1, 2014 at 6:04 PM, Nicholas Chammas
>>  wrote:
>> > Are you trying to do something along the lines of what's described here?
>> > https://issues.apache.org/jira/browse/SPARK-3533
>> >
>> > On Wed, Oct 1, 2014 at 10:53 AM, Tomer Benyamini 
>> > wrote:
>> >>
>> >> Hi,
>> >>
>> >> I'm trying to write my JavaPairRDD using saveAsNewAPIHadoopFile with
>> >> MultipleTextOutputFormat,:
>> >>
>> >> outRdd.saveAsNewAPIHadoopFile("/tmp", String.class, String.class,
>> >> MultipleTextOutputFormat.class);
>> >>
>> >> but I'm getting this compilation error:
>> >>
>> >> Bound mismatch: The generic method saveAsNewAPIHadoopFile(String,
>> >> Class, Class, Class) of type JavaPairRDD is not
>> >> applicable for the arguments (String, Class, Class,
>> >> Class). The inferred type
>> >> MultipleTextOutputFormat is not a valid substitute for the bounded
>> >> parameter >
>> >>
>> >> I bumped into some discussions suggesting to use MultipleOutputs
>> >>
>> >> (
>> http://hadoop.apache.org/docs/r2.3.0/api/org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html
>> ),
>> >> but this also fails from the same reason.
>> >>
>> >> Would love some assistance :)
>> >>
>> >> Thanks,
>> >> Tomer
>> >>
>> >> -
>> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> >> For additional commands, e-mail: user-h...@spark.apache.org
>> >>
>> >
>>
>
>

Re: Strategies for reading large numbers of files

2014-10-02 Thread Nicholas Chammas

I believe this is known as the "Hadoop Small Files Problem", and it affects
Spark as well. The best approach I've seen to merging small files like this
is by using s3distcp, as suggested here
,
as a pre-processing step.

It would be great if Spark could somehow handle this common situation out
of the box, but for now I don't think it does.

Nick

On Thu, Oct 2, 2014 at 7:10 PM, Landon Kuhn  wrote:

> Hello, I'm trying to use Spark to process a large number of files in S3.
> I'm running into an issue that I believe is related to the high number of
> files, and the resources required to build the listing within the driver
> program. If anyone in the Spark community can provide insight or guidance,
> it would be greatly appreciated.
>
> The task at hand is to read ~100 million files stored in S3, and
> repartition the data into a sensible number of files (perhaps 1,000). The
> files are organized in a directory structure like so:
>
>
> s3://bucket/event_type/year/month/day/hour/minute/second/customer_id/file_name
>
> (Note that each file is very small, containing 1-10 records each.
> Unfortunately this is an artifact of the upstream systems that put data in
> S3.)
>
> My Spark program is simple, and works when I target a relatively specific
> subdirectory. For example:
>
>
> sparkContext.textFile("s3n://bucket/purchase/2014/01/01/00/*/*/*/*").coalesce(...).write(...)
>
> This targets 1 hour's worth of purchase records, containing about 10,000
> files. The driver program blocks (I assume it is making S3 calls to
> traverse the directories), and during this time no activity is visible in
> the driver UI. After about a minute, the stages and tasks allocate in the
> UI, and then everything progresses and completes within a few minutes.
>
> I need to process all the data (several year's worth). Something like:
>
>
> sparkContext.textFile("s3n://bucket/*/*/*/*/*/*/*/*/*").coalesce(...).write(...)
>
> This blocks "forever" (I have only run the program for as long as
> overnight). The stages and tasks never appear in the UI. I assume Spark is
> building the file listing, which will either take too long and/or cause the
> driver to eventually run out of memory.
>
> I would appreciate any comments or suggestions. I'm happy to provide more
> information if that would be helpful.
>
> Thanks
>
> Landon
>
>

Re: Strategies for reading large numbers of files

2014-10-06 Thread Nicholas Chammas

Unfortunately not. Again, I wonder if adding support targeted at this
"small files problem" would make sense for Spark core, as it is a common
problem in our space.

Right now, I don't know of any other options.

Nick


On Mon, Oct 6, 2014 at 2:24 PM, Landon Kuhn  wrote:

> Nicholas, thanks for the tip. Your suggestion certainly seemed like the
> right approach, but after a few days of fiddling I've come to the
> conclusion that s3distcp will not work for my use case. It is unable to
> flatten directory hierarchies, which I need because my source directories
> contain hour/minute/second parts.
>
> See https://forums.aws.amazon.com/message.jspa?messageID=478960. It seems
> that s3distcp can only combine files in the same path.
>
> Thanks again. That gave me a lot to go on. Any further suggestions?
>
> L
>
>
> On Thu, Oct 2, 2014 at 4:15 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> I believe this is known as the "Hadoop Small Files Problem", and it
>> affects Spark as well. The best approach I've seen to merging small files
>> like this is by using s3distcp, as suggested here
>> <http://snowplowanalytics.com/blog/2013/05/30/dealing-with-hadoops-small-files-problem/>,
>> as a pre-processing step.
>>
>> It would be great if Spark could somehow handle this common situation out
>> of the box, but for now I don't think it does.
>>
>> Nick
>>
>> On Thu, Oct 2, 2014 at 7:10 PM, Landon Kuhn  wrote:
>>
>>> Hello, I'm trying to use Spark to process a large number of files in S3.
>>> I'm running into an issue that I believe is related to the high number of
>>> files, and the resources required to build the listing within the driver
>>> program. If anyone in the Spark community can provide insight or guidance,
>>> it would be greatly appreciated.
>>>
>>> The task at hand is to read ~100 million files stored in S3, and
>>> repartition the data into a sensible number of files (perhaps 1,000). The
>>> files are organized in a directory structure like so:
>>>
>>>
>>> s3://bucket/event_type/year/month/day/hour/minute/second/customer_id/file_name
>>>
>>> (Note that each file is very small, containing 1-10 records each.
>>> Unfortunately this is an artifact of the upstream systems that put data in
>>> S3.)
>>>
>>> My Spark program is simple, and works when I target a relatively
>>> specific subdirectory. For example:
>>>
>>>
>>> sparkContext.textFile("s3n://bucket/purchase/2014/01/01/00/*/*/*/*").coalesce(...).write(...)
>>>
>>> This targets 1 hour's worth of purchase records, containing about 10,000
>>> files. The driver program blocks (I assume it is making S3 calls to
>>> traverse the directories), and during this time no activity is visible in
>>> the driver UI. After about a minute, the stages and tasks allocate in the
>>> UI, and then everything progresses and completes within a few minutes.
>>>
>>> I need to process all the data (several year's worth). Something like:
>>>
>>>
>>> sparkContext.textFile("s3n://bucket/*/*/*/*/*/*/*/*/*").coalesce(...).write(...)
>>>
>>> This blocks "forever" (I have only run the program for as long as
>>> overnight). The stages and tasks never appear in the UI. I assume Spark is
>>> building the file listing, which will either take too long and/or cause the
>>> driver to eventually run out of memory.
>>>
>>> I would appreciate any comments or suggestions. I'm happy to provide
>>> more information if that would be helpful.
>>>
>>> Thanks
>>>
>>> Landon
>>>
>>>
>>
>
>
> --
> *Landon Kuhn*, *Software Architect*, Janrain, Inc. <http://bit.ly/cKKudR>
> E: lan...@janrain.com | M: 971-645-5501 | F: 888-267-9025
> Follow Janrain: Facebook <http://bit.ly/9CGHdf> | Twitter
> <http://bit.ly/9umxlK> | YouTube <http://bit.ly/N0OiBT> | LinkedIn
> <http://bit.ly/a7WZMC> | Blog <http://bit.ly/OI2uOR>
> Follow Me: LinkedIn <http://www.linkedin.com/in/landonkuhn>
>
> -
> *Acquire, understand, and engage your users. Watch our video
> <http://bit.ly/janrain-overview> or sign up for a live demo
> <http://bit.ly/janraindemo> to see what it's all about.*
>

Re: spark-ec2 - HDFS doesn't start on AWS EC2 cluster

2014-10-08 Thread Nicholas Chammas

Yup, though to be clear, Josh reverted a change to a hosted script that
spark-ec2 references. The spark-ec2 script y’all are running locally hasn’t
changed, obviously.

On Wed, Oct 8, 2014 at 12:20 PM, mrm  wrote:

> They reverted to a previous version of the spark-ec2 script and things are
> working again!
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/spark-ec2-HDFS-doesn-t-start-on-AWS-EC2-cluster-tp15921p15945.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Dedup

2014-10-08 Thread Nicholas Chammas

Multiple values may be different, yet still be considered duplicates
depending on how the dedup criteria is selected. Is that correct? Do you
care in that case what value you select for a given key?

On Wed, Oct 8, 2014 at 3:37 PM, Ge, Yao (Y.)  wrote:

>  I need to do deduplication processing in Spark. The current plan is to
> generate a tuple where key is the dedup criteria and value is the original
> input. I am thinking to use reduceByKey to discard duplicate values. If I
> do that, can I simply return the first argument or should I return a copy
> of the first argument. Is there are better way to do dedup in Spark?
>
>
>
> -Yao
>

Re: read all parquet files in a directory in spark-sql

2014-10-13 Thread Nicholas Chammas

Right now I believe the only supported option is to pass a comma-delimited
list of paths.

I've opened SPARK-3928: Support wildcard matches on Parquet files
 to request this feature.

Nick

On Mon, Oct 13, 2014 at 12:21 PM, Sadhan Sood  wrote:

> How can we read all parquet files in a directory in spark-sql. We are
> following this example which shows a way to read one file:
>
> // Read in the parquet file created above.  Parquet files are self-describing 
> so the schema is preserved.// The result of loading a Parquet file is also a 
> SchemaRDD.val parquetFile = sqlContext.parquetFile("people.parquet")
> //Parquet files can also be registered as tables and then used in SQL 
> statements.parquetFile.registerTempTable("parquetFile")
>
>

Re: parquetFile and wilcards

2014-10-13 Thread Nicholas Chammas

SPARK-3928: Support wildcard matches on Parquet files
<https://issues.apache.org/jira/browse/SPARK-3928>

On Wed, Sep 24, 2014 at 2:14 PM, Michael Armbrust 
wrote:

> We could certainly do this.  The comma separated support is something I
> added.
>
> On Wed, Sep 24, 2014 at 10:20 AM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> Does it make sense for us to open a JIRA to track enhancing the Parquet
>> input format to support wildcards? Or is this something outside of Spark's
>> control?
>>
>> Nick
>>
>> On Wed, Sep 24, 2014 at 1:01 PM, Michael Armbrust > > wrote:
>>
>>> This behavior is inherited from the parquet input format that we use.
>>> You could list the files manually and pass them as a comma separated list.
>>>
>>> On Wed, Sep 24, 2014 at 7:46 AM, Marius Soutier 
>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> sc.textFile and so on support wildcards in their path, but apparently
>>>> sqlc.parquetFile() does not. I always receive “File
>>>> /file/to/path/*/input.parquet does not exist". Is this normal or a bug? Is
>>>> there are a workaround?
>>>>
>>>> Thanks
>>>> - Marius
>>>>
>>>>
>>>>
>>>> -
>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>
>>>>
>>>
>>
>

Re: Multipart uploads to Amazon S3 from Apache Spark

2014-10-13 Thread Nicholas Chammas

Oh, that's a straight reversal from their position up until earlier this
year

.

Was there an announcement explaining the change in recommendation?

Nick

On Mon, Oct 13, 2014 at 4:54 PM, Daniil Osipov 
wrote:

> Not directly related, but FWIW, EMR seems to back away from s3n usage:
>
> "Previously, Amazon EMR used the S3 Native FileSystem with the URI
> scheme, s3n. While this still works, we recommend that you use the s3 URI
> scheme for the best performance, security, and reliability."
>
>
> http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-file-systems.html
>
>
> On Mon, Oct 13, 2014 at 1:42 PM, Nick Chammas 
> wrote:
>
>> Cross posting an interesting question on Stack Overflow
>> 
>> .
>>
>> Nick
>>
>>
>> --
>> View this message in context: Multipart uploads to Amazon S3 from Apache
>> Spark
>> 
>> Sent from the Apache Spark User List mailing list archive
>>  at Nabble.com.
>>
>
>

Re: mllib.linalg.Vectors vs Breeze?

2014-10-17 Thread Nicholas Chammas

I don't know the answer for sure, but just from an API perspective I'd
guess that the Spark authors don't want to tie their API to Breeze. If at a
future point they swap out a different implementation for Breeze, they
don't have to change their public interface. MLlib's interface remains
consistent while the internals are free to evolve.

Nick


2014년 10월 17일 금요일, ll님이 작성한 메시지:

> hello... i'm looking at the source code for mllib.linalg.Vectors and it
> looks
> like it's a wrapper around Breeze with very small changes (mostly changing
> the names).
>
> i don't have any problem with using spark wrapper around Breeze or Breeze
> directly.  i'm just curious to understand why this wrapper was created vs.
> pointing everyone to Breeze directly?
>
>
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/mllib-linalg-Vectors-vs-Breeze-tp16722.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> For additional commands, e-mail: user-h...@spark.apache.org 
>
>

Re: input split size

2014-10-18 Thread Nicholas Chammas

Side note: I thought bzip2 was splittable. Perhaps you meant gzip?

2014년 10월 18일 토요일, Aaron Davidson님이 작성한 메시지:

> The "minPartitions" argument of textFile/hadoopFile cannot decrease the
> number of splits past the physical number of blocks/files. So if you have 3
> HDFS blocks, asking for 2 minPartitions will still give you 3 partitions
> (hence the "min"). It can, however, convert a file with fewer HDFS blocks
> into more (so you could ask for and get 4 partitions), assuming the blocks
> are "splittable". HDFS blocks are usually splittable, but if it's
> compressed with something like bzip2, it would not be.
>
> If you wish to combine splits from a larger file, you can use
> RDD#coalesce. With shuffle=false, this will simply concatenate partitions,
> but it does not provide any ordering guarantees (it uses an algorithm which
> attempts to coalesce co-located partitions, to maintain locality
> information).
>
> coalesce() with shuffle=true causes all of the elements will be shuffled
> around randomly into new partitions, which is an expensive operation but
> guarantees uniformity of data distribution.
>
> On Sat, Oct 18, 2014 at 10:47 AM, Mayur Rustagi  > wrote:
>
>> Does it retain the order if its pulling from the hdfs blocks, meaning
>> if  file1 => a, b, c partition in order
>> if I convert to 2 partition read will it map to ab, c or a, bc or it can
>> also be a, cb ?
>>
>>
>> Mayur Rustagi
>> Ph: +1 (760) 203 3257
>> http://www.sigmoidanalytics.com
>> @mayur_rustagi 
>>
>>
>> On Sat, Oct 18, 2014 at 9:09 AM, Ilya Ganelin > > wrote:
>>
>>> Also - if you're doing a text file read you can pass the number of
>>> resulting partitions as the second argument.
>>> On Oct 17, 2014 9:05 PM, "Larry Liu" >> > wrote:
>>>
 Thanks, Andrew. What about reading out of local?

 On Fri, Oct 17, 2014 at 5:38 PM, Andrew Ash >>> > wrote:

> When reading out of HDFS it's the HDFS block size.
>
> On Fri, Oct 17, 2014 at 5:27 PM, Larry Liu  > wrote:
>
>> What is the default input split size? How to change it?
>>
>
>

>>
>

Re: How to show RDD size

2014-10-20 Thread Nicholas Chammas

I believe it won't show up there until you trigger an action that causes
the RDD to actually be cached. Remember that certain operations in Spark
are *lazy*, and caching is one of them.

Nick

On Mon, Oct 20, 2014 at 9:19 AM, marylucy  wrote:

> in spark-shell,I do in follows
> val input = sc.textfile("hdfs://192.168.1.10/people/testinput/")
> input.cache()
>
> In webui,I cannot see any rdd in storage tab.can anyone tell me how to
> show rdd size?thank you
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: How to show RDD size

2014-10-20 Thread Nicholas Chammas

No, I believe unpersist acts immediately.

On Mon, Oct 20, 2014 at 10:13 AM, marylucy 
wrote:

> thank you for your reply!
> is unpersist operation lazy?if yes,how to decrease memory size as quickly
> as possible
>
> 在 Oct 20, 2014，21:26，"Nicholas Chammas"  写道：
>
> I believe it won't show up there until you trigger an action that causes
> the RDD to actually be cached. Remember that certain operations in Spark
> are *lazy*, and caching is one of them.
>
> Nick
>
> On Mon, Oct 20, 2014 at 9:19 AM, marylucy 
> wrote:
>
>> in spark-shell,I do in follows
>> val input = sc.textfile("hdfs://192.168.1.10/people/testinput/")
>> input.cache()
>>
>> In webui,I cannot see any rdd in storage tab.can anyone tell me how to
>> show rdd size?thank you
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

Re: Getting spark to use more than 4 cores on Amazon EC2

2014-10-20 Thread Nicholas Chammas

Perhaps your RDD is not partitioned enough to utilize all the cores in your
system.

Could you post a simple code snippet and explain what kind of parallelism
you are seeing for it? And can you report on how many partitions your RDDs
have?

On Mon, Oct 20, 2014 at 3:53 PM, Daniel Mahler  wrote:

>
> I am launching EC2 clusters using the spark-ec2 scripts.
> My understanding is that this configures spark to use the available
> resources.
> I can see that spark will use the available memory on larger istance types.
> However I have never seen spark running at more than 400% (using 100% on 4
> cores)
> on machines with many more cores.
> Am I misunderstanding the docs? Is it just that high end ec2 instances get
> I/O starved when running spark? It would be strange if that consistently
> produced a 400% hard limit though.
>
> thanks
> Daniel
>

Re: Getting spark to use more than 4 cores on Amazon EC2

2014-10-20 Thread Nicholas Chammas

Are you dealing with gzipped files by any chance? Does explicitly
repartitioning your RDD to match the number of cores in your cluster help
at all? How about if you don't specify the configs you listed and just go
with defaults all around?

On Mon, Oct 20, 2014 at 5:22 PM, Daniel Mahler  wrote:

> I launch the cluster using vanilla spark-ec2 scripts.
> I just specify the number of slaves and instance type
>
> On Mon, Oct 20, 2014 at 4:07 PM, Daniel Mahler  wrote:
>
>> I usually run interactively from the spark-shell.
>> My data definitely has more than enough partitions to keep all the
>> workers busy.
>> When I first launch the cluster I first do:
>>
>> +
>> cat <>~/spark/conf/spark-defaults.conf
>> spark.serializerorg.apache.spark.serializer.KryoSerializer
>> spark.rdd.compress  true
>> spark.shuffle.consolidateFiles  true
>> spark.akka.frameSize  20
>> EOF
>>
>> copy-dir /root/spark/conf
>> spark/sbin/stop-all.sh
>> sleep 5
>> spark/sbin/start-all.sh
>> +++++++++
>>
>> before starting the spark-shell or running any jobs.
>>
>>
>>
>>
>> On Mon, Oct 20, 2014 at 2:57 PM, Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> Perhaps your RDD is not partitioned enough to utilize all the cores in
>>> your system.
>>>
>>> Could you post a simple code snippet and explain what kind of
>>> parallelism you are seeing for it? And can you report on how many
>>> partitions your RDDs have?
>>>
>>> On Mon, Oct 20, 2014 at 3:53 PM, Daniel Mahler 
>>> wrote:
>>>
>>>>
>>>> I am launching EC2 clusters using the spark-ec2 scripts.
>>>> My understanding is that this configures spark to use the available
>>>> resources.
>>>> I can see that spark will use the available memory on larger istance
>>>> types.
>>>> However I have never seen spark running at more than 400% (using 100%
>>>> on 4 cores)
>>>> on machines with many more cores.
>>>> Am I misunderstanding the docs? Is it just that high end ec2 instances
>>>> get I/O starved when running spark? It would be strange if that
>>>> consistently produced a 400% hard limit though.
>>>>
>>>> thanks
>>>> Daniel
>>>>
>>>
>>>
>>
>

Re: Getting spark to use more than 4 cores on Amazon EC2

2014-10-20 Thread Nicholas Chammas

The biggest danger with gzipped files is this:

>>> raw = sc.textFile("/path/to/file.gz", 8)>>> raw.getNumPartitions()1

You think you’re telling Spark to parallelize the reads on the input, but
Spark cannot parallelize reads against gzipped files. So 1 gzipped file
gets assigned to 1 partition.

It might be a nice user hint if Spark warned when parallelism is disabled
by the input format.

Nick


On Mon, Oct 20, 2014 at 6:53 PM, Daniel Mahler  wrote:

> Hi Nicholas,
>
> Gzipping is a an impressive guess! Yes, they are.
> My data sets are too large to make repartitioning viable, but I could try
> it on a subset.
> I generally have many more partitions than cores.
> This was happenning before I started setting those configs.
>
> thanks
> Daniel
>
>
> On Mon, Oct 20, 2014 at 5:37 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> Are you dealing with gzipped files by any chance? Does explicitly
>> repartitioning your RDD to match the number of cores in your cluster help
>> at all? How about if you don't specify the configs you listed and just go
>> with defaults all around?
>>
>> On Mon, Oct 20, 2014 at 5:22 PM, Daniel Mahler  wrote:
>>
>>> I launch the cluster using vanilla spark-ec2 scripts.
>>> I just specify the number of slaves and instance type
>>>
>>> On Mon, Oct 20, 2014 at 4:07 PM, Daniel Mahler 
>>> wrote:
>>>
>>>> I usually run interactively from the spark-shell.
>>>> My data definitely has more than enough partitions to keep all the
>>>> workers busy.
>>>> When I first launch the cluster I first do:
>>>>
>>>> +
>>>> cat <>~/spark/conf/spark-defaults.conf
>>>> spark.serializerorg.apache.spark.serializer.KryoSerializer
>>>> spark.rdd.compress  true
>>>> spark.shuffle.consolidateFiles  true
>>>> spark.akka.frameSize  20
>>>> EOF
>>>>
>>>> copy-dir /root/spark/conf
>>>> spark/sbin/stop-all.sh
>>>> sleep 5
>>>> spark/sbin/start-all.sh
>>>> +
>>>>
>>>> before starting the spark-shell or running any jobs.
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, Oct 20, 2014 at 2:57 PM, Nicholas Chammas <
>>>> nicholas.cham...@gmail.com> wrote:
>>>>
>>>>> Perhaps your RDD is not partitioned enough to utilize all the cores in
>>>>> your system.
>>>>>
>>>>> Could you post a simple code snippet and explain what kind of
>>>>> parallelism you are seeing for it? And can you report on how many
>>>>> partitions your RDDs have?
>>>>>
>>>>> On Mon, Oct 20, 2014 at 3:53 PM, Daniel Mahler 
>>>>> wrote:
>>>>>
>>>>>>
>>>>>> I am launching EC2 clusters using the spark-ec2 scripts.
>>>>>> My understanding is that this configures spark to use the available
>>>>>> resources.
>>>>>> I can see that spark will use the available memory on larger istance
>>>>>> types.
>>>>>> However I have never seen spark running at more than 400% (using 100%
>>>>>> on 4 cores)
>>>>>> on machines with many more cores.
>>>>>> Am I misunderstanding the docs? Is it just that high end ec2
>>>>>> instances get I/O starved when running spark? It would be strange if that
>>>>>> consistently produced a 400% hard limit though.
>>>>>>
>>>>>> thanks
>>>>>> Daniel
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Python vs Scala performance

2014-10-22 Thread Nicholas Chammas

What version of Spark are you running? Some recent changes
 to how PySpark
works relative to Scala Spark may explain things.

PySpark should not be that much slower, not by a stretch.

On Wed, Oct 22, 2014 at 6:11 AM, Ashic Mahtab  wrote:

> I'm no expert, but looked into how the python bits work a while back (was
> trying to assess what it would take to add F# support). It seems python
> hosts a jvm inside of it, and talks to "scala spark" in that jvm. The
> python server bit "translates" the python calls to those in the jvm. The
> python spark context is like an adapter to the jvm spark context. If you're
> seeing performance discrepancies, this might be the reason why. If the code
> can be organised to require fewer interactions with the adapter, that may
> improve things. Take this with a pinch of salt...I might be way off on this
> :)
>
> Cheers,
> Ashic.
>
> > From: mps@gmail.com
> > Subject: Python vs Scala performance
> > Date: Wed, 22 Oct 2014 12:00:41 +0200
> > To: user@spark.apache.org
>
> >
> > Hi there,
> >
> > we have a small Spark cluster running and are processing around 40 GB of
> Gzip-compressed JSON data per day. I have written a couple of word
> count-like Scala jobs that essentially pull in all the data, do some joins,
> group bys and aggregations. A job takes around 40 minutes to complete.
> >
> > Now one of the data scientists on the team wants to do write some jobs
> using Python. To learn Spark, he rewrote one of my Scala jobs in Python.
> From the API-side, everything looks more or less identical. However his
> jobs take between 5-8 hours to complete! We can also see that the execution
> plan is quite different, I’m seeing writes to the output much later than in
> Scala.
> >
> > Is Python I/O really that slow?
> >
> >
> > Thanks
> > - Marius
> >
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> >
>

Re: Python vs Scala performance

2014-10-22 Thread Nicholas Chammas

Total guess without knowing anything about your code: Do either of these
two notes from the 1.1.0 release notes
<http://spark.apache.org/releases/spark-release-1-1-0.html> affect things
at all?


   - PySpark now performs external spilling during aggregations. Old
   behavior can be restored by setting spark.shuffle.spill to false.
   - PySpark uses a new heuristic for determining the parallelism of
   shuffle operations. Old behavior can be restored by setting
   spark.default.parallelism to the number of cores in the cluster.

 Nick


On Wed, Oct 22, 2014 at 7:29 AM, Marius Soutier  wrote:

> We’re using 1.1.0. Yes I expected Scala to be maybe twice as fast, but not
> that...
>
> On 22.10.2014, at 13:02, Nicholas Chammas 
> wrote:
>
> What version of Spark are you running? Some recent changes
> <https://spark.apache.org/releases/spark-release-1-1-0.html> to how
> PySpark works relative to Scala Spark may explain things.
>
> PySpark should not be that much slower, not by a stretch.
>
> On Wed, Oct 22, 2014 at 6:11 AM, Ashic Mahtab  wrote:
>
>> I'm no expert, but looked into how the python bits work a while back (was
>> trying to assess what it would take to add F# support). It seems python
>> hosts a jvm inside of it, and talks to "scala spark" in that jvm. The
>> python server bit "translates" the python calls to those in the jvm. The
>> python spark context is like an adapter to the jvm spark context. If you're
>> seeing performance discrepancies, this might be the reason why. If the code
>> can be organised to require fewer interactions with the adapter, that may
>> improve things. Take this with a pinch of salt...I might be way off on this
>> :)
>>
>> Cheers,
>> Ashic.
>>
>> > From: mps@gmail.com
>> > Subject: Python vs Scala performance
>> > Date: Wed, 22 Oct 2014 12:00:41 +0200
>> > To: user@spark.apache.org
>>
>> >
>> > Hi there,
>> >
>> > we have a small Spark cluster running and are processing around 40 GB
>> of Gzip-compressed JSON data per day. I have written a couple of word
>> count-like Scala jobs that essentially pull in all the data, do some joins,
>> group bys and aggregations. A job takes around 40 minutes to complete.
>> >
>> > Now one of the data scientists on the team wants to do write some jobs
>> using Python. To learn Spark, he rewrote one of my Scala jobs in Python.
>> From the API-side, everything looks more or less identical. However his
>> jobs take between 5-8 hours to complete! We can also see that the execution
>> plan is quite different, I’m seeing writes to the output much later than in
>> Scala.
>> >
>> > Is Python I/O really that slow?
>> >
>> >
>> > Thanks
>> > - Marius
>> >
>> >
>> > -
>> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> > For additional commands, e-mail: user-h...@spark.apache.org
>> >
>>
>
>
>

Re: Python vs Scala performance

2014-10-22 Thread Nicholas Chammas

On Wed, Oct 22, 2014 at 11:34 AM, Eustache DIEMERT 
wrote:

Wild guess maybe, but do you decode the json records in Python ? it could
> be much slower as the default lib is quite slow.
>
Oh yeah, this is a good place to look. Also, just upgrading to Python 2.7
may be enough performance improvement because they merged in the fast JSON
deserializing from simplejson into the standard library. So you may not
need to use an external library like ujson, though that may help too.

Nick

Re: docker spark 1.1.0 cluster

2014-10-24 Thread Nicholas Chammas

Oh snap--first I've heard of this repo.

Marek,

We are having a discussion related to this on SPARK-3821
 you may be interested in.

Nick

On Fri, Oct 24, 2014 at 5:50 PM, Marek Wiewiorka 
wrote:

> Hi,
> here you can find some info regarding 1.0:
> https://github.com/amplab/docker-scripts
>
> Marek
>
> 2014-10-24 23:38 GMT+02:00 Josh J :
>
>> Hi,
>>
>> Is there a dockerfiles available which allow to setup a docker spark
>> 1.1.0 cluster?
>>
>> Thanks,
>> Josh
>>
>
>

Re: Spark SQL Exists Clause

2014-10-26 Thread Nicholas Chammas

I believe that's correct. See:
http://spark.apache.org/docs/latest/sql-programming-guide.html#supported-hive-features

2014년 10월 27일 월요일, agg212님이 작성한 메시지:

> Hey, I'm trying to run TPC-H Query 4 (shown below), and get the following
> error:
>
> Exception in thread "main" java.lang.RuntimeException: [11.25] failure:
> ``UNION'' expected but `select' found
>
> It seems like Spark SQL doesn't support the exists clause. Is this true?
>
> select
> o_orderpriority,
> count(*) as order_count
> from
> orders
> where
> o_orderdate >= date '1993-07-01'
> and o_orderdate < date '1993-10-01'
> and exists (
> select
> *
> from
> lineitem
> where
> l_orderkey = o_orderkey
> and l_commitdate < l_receiptdate
> )
> group by
> o_orderpriority
> order by
> o_orderpriority;
>
>
> Thanks
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Exists-Clause-tp17307.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> For additional commands, e-mail: user-h...@spark.apache.org 
>
>

Re: install sbt

2014-10-28 Thread Nicholas Chammas

If you're just calling sbt from within the spark/sbt folder, it should
download and install automatically.

Nick


2014년 10월 28일 화요일, Ted Yu님이 작성한 메시지:

> Have you read this ?
> http://lancegatlin.org/tech/centos-6-install-sbt
>
> On Tue, Oct 28, 2014 at 7:54 AM, Pagliari, Roberto <
> rpagli...@appcomsci.com
> > wrote:
>
>> Is there a repo or some kind of instruction about how to install sbt for
>> centos?
>>
>>
>>
>> Thanks,
>>
>>
>>
>
>

Re: Too many files open with Spark 1.1 and CDH 5.1

2014-10-31 Thread Nicholas Chammas

As Sean suggested, try out the new sort-based shuffle in 1.1 if you know
you're triggering large shuffles. That should help a lot.

2014년 10월 31일 금요일, Bill Q님이 작성한 메시지:

> Hi Sean,
> Thanks for the reply. I think both driver and worker have the problem. You
> are right that the ulimit fixed the driver side too many files open error.
>
> And there is a very big shuffle. My maybe naive thought is to migrate the
> HQL scripts directly from Hive to Spark SQL and make them  work. It seems
> that it won't be that easy. Is that correct? And it seems that I had done
> that with Shark and it worked pretty well in the old days.
>
> Any suggestions if we are planning to migrate a large code base from
> Hive to Spark SQL with minimum code rewriting?
>
> Many thanks.
>
>
> Cao
>
> On Friday, October 31, 2014, Sean Owen  > wrote:
>
>> It's almost surely the workers, not the driver (shell) that have too
>> many files open. You can change their ulimit. But it's probably better
>> to see why it happened -- a very big shuffle? -- and repartition or
>> design differently to avoid it. The new sort-based shuffle might help
>> in this regard.
>>
>> On Fri, Oct 31, 2014 at 3:25 PM, Bill Q  wrote:
>> > Hi,
>> > I am trying to make Spark SQL 1.1 to work to replace part of our ETL
>> > processes that are currently done by Hive 0.12.
>> >
>> > A common problem that I have encountered is the "Too many files open"
>> error.
>> > Once that happened, the query just failed. I started the spark-shell by
>> > using "ulimit -n 4096 & spark-shell". And it still pops the same error.
>> >
>> > Any solutions?
>> >
>> > Many thanks.
>> >
>> >
>> > Bill
>> >
>> >
>> >
>> > --
>> > Many thanks.
>> >
>> >
>> > Bill
>> >
>>
>
>
> --
> Many thanks.
>
>
> Bill
>
>

Re: SQL COUNT DISTINCT

2014-10-31 Thread Nicholas Chammas

The only thing in your code that cannot be parallelized is the collect()
because -- by definition -- it collects all the results to the driver node.
This has nothing to do with the DISTINCT in your query.

What do you want to do with the results after you collect them? How many
results do you have in the output of collect?

Perhaps it makes more sense to continue operating on the RDDs you have or
saving them using one of the RDD methods, because that preserves the
cluster's ability to parallelize work.

Nick

2014년 10월 31일 금요일, Bojan Kostic님이 작성한 메시지:

> While i testing Spark SQL i noticed that COUNT DISTINCT works really slow.
> Map partitions phase finished fast, but collect phase is slow.
> It's only runs on single executor.
> Should this run this way?
>
> And here is the simple code which i use for testing:
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> val parquetFile = sqlContext.parquetFile("/bojan/test/2014-10-20/")
> parquetFile.registerTempTable("parquetFile")
> val count = sqlContext.sql("SELECT COUNT(DISTINCT f2) FROM parquetFile")
> count.map(t => t(0)).collect().foreach(println)
>
> I guess because of the distinct process must be on single node. But i
> wonder
> can i add some parallelism to the collect process.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/SQL-COUNT-DISTINCT-tp17818.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> For additional commands, e-mail: user-h...@spark.apache.org 
>
>

Re: spark_ec2.py for AWS region: cn-north-1, China

2014-11-04 Thread Nicholas Chammas

http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html

cn-north-1 is not a supported region for EC2, as far as I can tell. There
may be other AWS services that can use that region, but spark-ec2 relies on
EC2.

Nick

On Tue, Nov 4, 2014 at 8:09 PM, haitao .yao  wrote:

> Hi,
>Amazon aws started to provide service for China mainland, the region
> name is cn-north-1. But the script spark provides: spark_ec2.py will query
> ami id from https://github.com/mesos/spark-ec2/tree/v4/ami-list and
> there's no ami information for cn-north-1 region .
>Can anybody update the ami information and update the reo:
> https://github.com/mesos/spark-ec2.git ?
>
>Thanks.
>
> --
> haitao.yao
>
>
>
>

Re: spark_ec2.py for AWS region: cn-north-1, China

2014-11-04 Thread Nicholas Chammas

Oh, I can see that region via boto as well. Perhaps the doc is indeed out
of date.

Do you mind opening a JIRA issue
<https://issues.apache.org/jira/secure/Dashboard.jspa> to track this
request? I can do it if you've never opened a JIRA issue before.

Nick

On Tue, Nov 4, 2014 at 9:03 PM, haitao .yao  wrote:

> I'm afraid not. We have been using EC2 instances in cn-north-1 region for
> a while. And the latest version of boto has added the region: cn-north-1
> Here's the  screenshot:
> >>>> from  boto import ec2
> >>> ec2.regions()
> [RegionInfo:us-east-1, RegionInfo:cn-north-1, RegionInfo:ap-northeast-1,
> RegionInfo:eu-west-1, RegionInfo:ap-southeast-1, RegionInfo:ap-southeast-2,
> RegionInfo:us-west-2, RegionInfo:us-gov-west-1, RegionInfo:us-west-1,
> RegionInfo:eu-central-1, RegionInfo:sa-east-1]
> >>>
>
> I do think the doc is out of dated.
>
>
>
> 2014-11-05 9:45 GMT+08:00 Nicholas Chammas :
>
>>
>> http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html
>>
>> cn-north-1 is not a supported region for EC2, as far as I can tell. There
>> may be other AWS services that can use that region, but spark-ec2 relies on
>> EC2.
>>
>> Nick
>>
>> On Tue, Nov 4, 2014 at 8:09 PM, haitao .yao  wrote:
>>
>>> Hi,
>>>Amazon aws started to provide service for China mainland, the region
>>> name is cn-north-1. But the script spark provides: spark_ec2.py will query
>>> ami id from https://github.com/mesos/spark-ec2/tree/v4/ami-list and
>>> there's no ami information for cn-north-1 region .
>>>Can anybody update the ami information and update the reo:
>>> https://github.com/mesos/spark-ec2.git ?
>>>
>>>Thanks.
>>>
>>> --
>>> haitao.yao
>>>
>>>
>>>
>>>
>>
>
>
> --
> haitao.yao
>
>
>
>

Re: Still struggling with building documentation

2014-11-07 Thread Nicholas Chammas

I believe the web docs need to be built separately according to the
instructions here
.

Did you give those a shot?

It's annoying to have a separate thing with new dependencies in order to
build the web docs, but that's how it is at the moment.

Nick

On Fri, Nov 7, 2014 at 3:39 PM, Alessandro Baretta 
wrote:

> I finally came to realize that there is a special maven target to build
> the scaladocs, although arguably a very unintuitive on: mvn verify. So now
> I have scaladocs for each package, but not for the whole spark project.
> Specifically, build/docs/api/scala/index.html is missing. Indeed the whole
> build/docs/api directory referenced in api.html is missing. How do I build
> it?
>
> Alex Baretta
>

Re: supported sql functions

2014-11-09 Thread Nicholas Chammas

http://spark.apache.org/docs/latest/sql-programming-guide.html#supported-hive-features

2014년 11월 9일 일요일, Srinivas Chamarthi님이 작성한
메시지:

> can anyone point me to a documentation on supported sql functions ? I am
> trying to do a contians operation on sql array type. But I don't know how
> to type the  sql.
>
> // like hive function array_contains
> select * from business where array_contains(type, "insurance")
>
>
>
> appreciate any help.
>
>

Re: Efficient way to split an input data set into different output files

2014-11-19 Thread Nicholas Chammas

I don't have a solution for you, but it sounds like you might want to
follow this issue:

SPARK-3533  - Add
saveAsTextFileByKey() method to RDDs

On Wed Nov 19 2014 at 6:41:11 AM Tom Seddon  wrote:

> I'm trying to set up a PySpark ETL job that takes in JSON log files and
> spits out fact table files for upload to Redshift.  Is there an efficient
> way to send different event types to different outputs without having to
> just read the same cached RDD twice?  I have my first RDD which is just a
> json parsed version of the input data, and I need to create a flattened
> page views dataset off this based on eventType = 'INITIAL', and then a page
> events dataset from the same RDD based on eventType  = 'ADDITIONAL'.
> Ideally I'd like the output files for both these tables to be written at
> the same time, so I'm picturing a function with one input RDD in and two
> RDDs out, or a function utilising two CSV writers.  I'm using mapPartitions
> at the moment to write to files like this:
>
> def write_records(records):
> output = StringIO.StringIO()
> writer = vlad.CsvUnicodeWriter(output, dialect='excel')
> for record in records:
> writer.writerow(record)
> return [output.getvalue()]
>
> and I use this in the call to write the file as follows (pageviews and
> events get created off the same json parsed RDD by filtering on INITIAL or
> ADDITIONAL respectively):
>
>
> pageviews.mapPartitions(writeRecords).saveAsTextFile('s3n://output/pageviews/')
> events.mapPartitions(writeRecords).saveAsTextFile(''s3n://output/events/)
>
> Is there a way to change this so that both are written in the same process?
>

Re: Spark Streaming with Python

2014-11-26 Thread Nicholas Chammas

What version of Spark are you running? A Python API for Spark Streaming is
only available via GitHub at the moment and has not been released in any
version of Spark.

On Tue, Nov 25, 2014 at 10:23 AM, Venkat, Ankam <
ankam.ven...@centurylink.com> wrote:

>  Any idea how to resolve this?
>
>
>
> Regards,
>
> Venkat
>
>
>
> *From:* Venkat, Ankam
> *Sent:* Sunday, November 23, 2014 12:05 PM
> *To:* 'user@spark.apache.org'
> *Subject:* Spark Streaming with Python
>
>
>
> I am trying to run network_wordcount.py example mentioned at
>
>
>
>
> https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/network_wordcount.py
>
>
>
> on CDH5.2 Quickstart VM.   Getting below error.
>
>
>
> Traceback (most recent call last):
>
>   File "/usr/lib/spark/examples/lib/network_wordcount.py", line 4, in
> 
>
> from pyspark.streaming import StreamingContext
>
> ImportError: No module named streaming.
>
>
>
> How to resolve this?
>
>
>
> Regards,
>
> Venkat
>
> This communication is the property of CenturyLink and may contain
> confidential or privileged information. Unauthorized use of this
> communication is strictly prohibited and may be unlawful. If you have
> received this communication in error, please immediately notify the sender
> by reply e-mail and destroy all copies of the communication and any
> attachments.
>  This communication is the property of CenturyLink and may contain
> confidential or privileged information. Unauthorized use of this
> communication is strictly prohibited and may be unlawful. If you have
> received this communication in error, please immediately notify the sender
> by reply e-mail and destroy all copies of the communication and any
> attachments.
>

Re: Problem creating EC2 cluster using spark-ec2

2014-12-02 Thread Nicholas Chammas

Interesting. Do you have any problems when launching in us-east-1? What is
the full output of spark-ec2 when launching a cluster? (Post it to a gist
if it’s too big for email.)


On Mon, Dec 1, 2014 at 10:34 AM, Dave Challis 
wrote:

> I've been trying to create a Spark cluster on EC2 using the
> documentation at https://spark.apache.org/docs/latest/ec2-scripts.html
> (with Spark 1.1.1).
>
> Running the script successfully creates some EC2 instances, HDFS etc.,
> but appears to fail to copy the actual files needed to run Spark
> across.
>
> I ran the following commands:
>
> $ cd ~/src/spark-1.1.1/ec2
> $ ./spark-ec2 --key-pair=* --identity-file=* --slaves=1
> --region=eu-west-1 --zone=eu-west-1a --instance-type=m3.medium
> --no-ganglia launch foocluster
>
> I see the following in the script's output:
>
> (instance and HDFS set up happens here)
> ...
> Persistent HDFS installed, won't start by default...
> ~/spark-ec2 ~/spark-ec2
> Setting up spark-standalone
> RSYNC'ing /root/spark/conf to slaves...
> *.eu-west-1.compute.amazonaws.com
> RSYNC'ing /root/spark-ec2 to slaves...
> *.eu-west-1.compute.amazonaws.com
> ./spark-standalone/setup.sh: line 22: /root/spark/sbin/stop-all.sh: No
> such file or directory
> ./spark-standalone/setup.sh: line 27:
> /root/spark/sbin/start-master.sh: No such file or directory
> ./spark-standalone/setup.sh: line 33:
> /root/spark/sbin/start-slaves.sh: No such file or directory
> Setting up tachyon
> RSYNC'ing /root/tachyon to slaves...
> ...
> (Tachyon setup happens here without any problem)
>
> I can ssh to the master (using the ./spark-ec2 login), and looking in
> /root/, it contains:
>
> $ ls /root
> ephemeral-hdfs  hadoop-native  mapreduce  persistent-hdfs  scala
> shark  spark  spark-ec2  tachyon
>
> If I look in /root/spark (where the sbin directory should be found),
> it only contains a single 'conf' directory:
>
> $ ls /root/spark
> conf
>
> Any idea why spark-ec2 might have failed to copy these files across?
>
> Thanks,
> Dave
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Where can you get nightly builds?

2014-12-06 Thread Nicholas Chammas

To expand on Ted's response, there are currently no nightly builds
published for users to use. You can watch SPARK-1517 (which Ted linked to)
to be updated when that happens.

On Sat Dec 06 2014 at 10:19:10 AM Ted Yu  wrote:

> See https://amplab.cs.berkeley.edu/jenkins/view/Spark/
>
> See also https://issues.apache.org/jira/browse/SPARK-1517
>
> Cheers
>
> On Sat, Dec 6, 2014 at 6:41 AM, Simone Franzini 
> wrote:
>
>> I recently read in the mailing list that there are now nightly builds
>> available. However, I can't find them anywhere. Is this really done? If so,
>> where can I get them?
>>
>> Thanks,
>> Simone Franzini, PhD
>>
>> http://www.linkedin.com/in/simonefranzini
>>
>
>

Re: Spark-AMI version compatibility table

2014-02-21 Thread Nicholas Chammas

The table formatting shows up weird on the user list web page. You can see that
same Spark-AMI compatibility table
here
on
Google Docs.


On Fri, Feb 21, 2014 at 11:43 PM, nicholas.chammas <
nicholas.cham...@gmail.com> wrote:

> Howdy folks,
>
> I'm working through the Spark on EMR tutorial 
> here.
> The attraction of running Spark on EMR is that it is probably the fastest
> and easiest way to get Spark running and doing something useful.
>
> I had a lot of trouble finding the right combination of Spark install
> script and EMR AMI that would give me a working cluster and Spark shell.
>
> The tutorial  points to
> a 0.8.1 version of the bootstrap script and doesn't specify an AMI version.
> If you use Python/boto to complete the tutorial, this means EMR will default
> to a 1.0 AMI . This
> doesn't work and leads to errors about a missing core-site.xml file, among
> other things.
>
> Here are some other combinations I tried (up to the point of seeing if the
> Spark shell starts up successfully):
>
>  Bootstrap script AMI versionResult
> Spark shell doesn't start
> s3://elasticmapreduce/samples/spark/0.8.1/install-spark-shark.sh 1.0bootstrap
> times out; missing 
> core-site.xmls3://elasticmapreduce/samples/spark/0.8.1/install-spark-shark.sh
> 2.0 bootstrap times out; missing 
> EmrMetrics*.jars3://elasticmapreduce/samples/spark/0.8.1/install-spark-shark.sh
> 2.1 bootstrap times out
> s3://elasticmapreduce/samples/spark/0.8.1/install-spark-shark.sh 2.2bootstrap 
> times out
> s3://elasticmapreduce/samples/spark/0.8.1/install-spark-shark.sh 2.3bootstrap 
> times out
> s3://elasticmapreduce/samples/spark/0.8.1/install-spark-shark.sh 3.0bootstrap 
> times out
> s3://elasticmapreduce/samples/spark/install-spark-shark.sh 1.0 Spark
> shell fails to initialize; failure to "load native Mesos library"
> s3://elasticmapreduce/samples/spark/install-spark-shark.sh 2.4Spark shell
> hangs on initialization
> s3://elasticmapreduce/samples/spark/install-spark-shark.sh 3.0bootstrap
> fails; missing dpkg
> Spark shell starts
> s3://elasticmapreduce/samples/spark/0.8.1/install-spark-shark.sh 2.4success; 
> log4j warnings
> s3://elasticmapreduce/samples/spark/install-spark-shark.sh 2.0 success
> s3://elasticmapreduce/samples/spark/install-spark-shark.sh 2.1success
> s3://elasticmapreduce/samples/spark/install-spark-shark.sh 
> 2.2successs3://elasticmapreduce/samples/spark/install-spark-shark.sh
> 2.3success
>
> Do y'all "own" these EMR bootstrap scripts, or are they provided by
> Amazon? It would be helpful if
>
>
>1. the install script explicitly checked for a compatible AMI version,
>and/or
>2. there was an official compatibility table up somewhere, preferably
>linked to from that EMR tutorial (which has high visibility on 
> Google
>)
>
> I'm new both to Spark and to AWS in general. Forgive me if I'm barking up
> the wrong tree here.
>
> Nick
>
>
> --
> View this message in context: Spark-AMI version compatibility 
> table
> Sent from the Apache Spark User List mailing list 
> archiveat Nabble.com.
>

Re: programmatic way to tell Spark version

2014-02-22 Thread Nicholas Chammas

No use case at the moment.

What prompted the question: I was going to ask a different question on this
list and wanted to note my version of Spark. I assumed there would be a
getVersion method on SparkContext or something like that, but I couldn't
find one in the docs. I also couldn't find an environment variable with the
version. After futzing around a bit I realized it was printed out (quite
conspicuously) in the shell startup banner.

On Sat, Feb 22, 2014 at 7:15 PM, Patrick Wendell  wrote:

> AFIAK - We don't have any way to do this right now. Maybe we could add
> a getVersion method to SparkContext that would tell you. Just
> wondering - what is the use case here?
>
> - Patrick
>
> On Sat, Feb 22, 2014 at 4:04 PM, nicholas.chammas
>  wrote:
> > Is there a programmatic way to tell what version of Spark I'm running?
> >
> > I know I can look at the banner when the Spark shell starts up, but I'm
> > curious to know if there's another way.
> >
> > Nick
> >
> >
> > 
> > View this message in context: programmatic way to tell Spark version
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Re: Spark Quick Start - call to open README.md needs explicit fs prefix

2014-02-23 Thread Nicholas Chammas

Makes sense. Thank you.


On Sun, Feb 23, 2014 at 9:57 PM, Matei Zaharia wrote:

> Good catch; the Spark cluster on EC2 is configured to use HDFS as its
> default filesystem, so it can't find this file. The quick start was written
> to run on a single machine with an out-of-the-box install. If you'd like to
> upload this file to the HDFS cluster on EC2, use the following command:
>
> ~/ephemeral-hdfs/bin/hadoop fs -put README.md README.md
>
> Matei
>
> On Feb 23, 2014, at 6:33 PM, nicholas.chammas 
> wrote:
>
> I just deployed Spark 0.9.0 to EC2 using the guide 
> here.
> I then turned to the Quick Start guide 
> here and
> walked through it using the Python shell.
>
> When I do this:
>
> >>> textFile = sc.textFile("README.md")
> >>> textFile.count()
>
>
> I get a long error output right after the count() that includes this:
>
> org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:
> hdfs://
> ec2-my-node-address.compute-1.amazonaws.com:9000/user/root/README.md
>
> So I guess Spark assumed that the file was in HDFS.
>
> To get the file open and count to work, I had to do this:
>
> >>> textFile = sc.textFile("file:///root/spark/README.md")
> >>> textFile.count()
>
>
> I get the same results if I use the Scala shell.
>
> Does the quick start guide need to updated, or did I miss something?
>
> Nick
>
>
> --
> View this message in context: Spark Quick Start - call to open README.md
> needs explicit fs 
> prefix
> Sent from the Apache Spark User List mailing list 
> archiveat
> Nabble.com.
>
>
>

Re: SSH Tunneling issue with Apache Spark

2023-12-06 Thread Nicholas Chammas

PyMySQL has its own implementation 
<https://github.com/PyMySQL/PyMySQL/blob/f13f054abcc18b39855a760a84be0a517f0da658/pymysql/protocol.py>
 of the MySQL client-server protocol. It does not use JDBC.


> On Dec 6, 2023, at 10:43 PM, Venkatesan Muniappan 
>  wrote:
> 
> Thanks for the advice Nicholas. 
> 
> As mentioned in the original email, I have tried JDBC + SSH Tunnel using 
> pymysql and sshtunnel and it worked fine. The problem happens only with Spark.
> 
> Thanks,
> Venkat
> 
> 
> 
> On Wed, Dec 6, 2023 at 10:21 PM Nicholas Chammas  <mailto:nicholas.cham...@gmail.com>> wrote:
>> This is not a question for the dev list. Moving dev to bcc.
>> 
>> One thing I would try is to connect to this database using JDBC + SSH 
>> tunnel, but without Spark. That way you can focus on getting the JDBC 
>> connection to work without Spark complicating the picture for you.
>> 
>> 
>>> On Dec 5, 2023, at 8:12 PM, Venkatesan Muniappan 
>>> mailto:venkatesa...@noonacademy.com>> wrote:
>>> 
>>> Hi Team,
>>> 
>>> I am facing an issue with SSH Tunneling in Apache Spark. The behavior is 
>>> same as the one in this Stackoverflow question 
>>> <https://stackoverflow.com/questions/68278369/how-to-use-pyspark-to-read-a-mysql-database-using-a-ssh-tunnel>
>>>  but there are no answers there.
>>> 
>>> This is what I am trying:
>>> 
>>> 
>>> with SSHTunnelForwarder(
>>> (ssh_host, ssh_port),
>>> ssh_username=ssh_user,
>>> ssh_pkey=ssh_key_file,
>>> remote_bind_address=(sql_hostname, sql_port),
>>> local_bind_address=(local_host_ip_address, sql_port)) as tunnel:
>>> tunnel.local_bind_port
>>> b1_semester_df = spark.read \
>>> .format("jdbc") \
>>> .option("url", b2b_mysql_url.replace("<>", 
>>> str(tunnel.local_bind_port))) \
>>> .option("query", b1_semester_sql) \
>>> .option("database", 'b2b') \
>>> .option("password", b2b_mysql_password) \
>>> .option("driver", "com.mysql.cj.jdbc.Driver") \
>>> .load()
>>> b1_semester_df.count()
>>> 
>>> Here, the b1_semester_df is loaded but when I try count on the same Df it 
>>> fails saying this
>>> 
>>> 23/12/05 11:49:17 ERROR TaskSetManager: Task 0 in stage 2.0 failed 4 times; 
>>> aborting job
>>> Traceback (most recent call last):
>>>   File "", line 1, in 
>>>   File "/usr/lib/spark/python/pyspark/sql/dataframe.py", line 382, in show
>>> print(self._jdf.showString(n, 20, vertical))
>>>   File 
>>> "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 
>>> 1257, in __call__
>>>   File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco
>>> return f(*a, **kw)
>>>   File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", 
>>> line 328, in get_return_value
>>> py4j.protocol.Py4JJavaError: An error occurred while calling 
>>> o284.showString.
>>> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
>>> in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 
>>> 2.0 (TID 11, ip-172-32-108-1.eu-central-1.compute.internal, executor 3): 
>>> com.mysql.cj.jdbc.exceptions.CommunicationsException: Communications link 
>>> failure
>>> 
>>> However, the same is working fine with pandas df. I have tried this below 
>>> and it worked.
>>> 
>>> 
>>> with SSHTunnelForwarder(
>>> (ssh_host, ssh_port),
>>> ssh_username=ssh_user,
>>> ssh_pkey=ssh_key_file,
>>> remote_bind_address=(sql_hostname, sql_port)) as tunnel:
>>> conn = pymysql.connect(host=local_host_ip_address, user=sql_username,
>>>passwd=sql_password, db=sql_main_database,
>>>port=tunnel.local_bind_port)
>>> df = pd.read_sql_query(b1_semester_sql, conn)
>>> spark.createDataFrame(df).createOrReplaceTempView("b1_semester")
>>> 
>>> So wanted to check what I am missing with my Spark usage. Please help.
>>> 
>>> Thanks,
>>> Venkat
>>> 
>>

Re: Validate spark sql

2023-12-24 Thread Nicholas Chammas

This is a user-list question, not a dev-list question. Moving this conversation 
to the user list and BCC-ing the dev list.

Also, this statement

> We are not validating against table or column existence.

is not correct. When you call spark.sql(…), Spark will lookup the table 
references and fail with TABLE_OR_VIEW_NOT_FOUND if it cannot find them.

Also, when you run DDL via spark.sql(…), Spark will actually run it. So 
spark.sql(“drop table my_table”) will actually drop my_table. It’s not a 
validation-only operation.

This question of validating SQL is already discussed on Stack Overflow 
. You may find some useful tips 
there.

Nick


> On Dec 24, 2023, at 4:52 AM, Mich Talebzadeh  
> wrote:
> 
>   
> Yes, you can validate the syntax of your PySpark SQL queries without 
> connecting to an actual dataset or running the queries on a cluster.
> PySpark provides a method for syntax validation without executing the query. 
> Something like below
>   __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/__ / .__/\_,_/_/ /_/\_\   version 3.4.0
>   /_/
> 
> Using Python version 3.9.16 (main, Apr 24 2023 10:36:11)
> Spark context Web UI available at http://rhes75:4040 
> Spark context available as 'sc' (master = local[*], app id = 
> local-1703410019374).
> SparkSession available as 'spark'.
> >>> from pyspark.sql import SparkSession
> >>> spark = SparkSession.builder.appName("validate").getOrCreate()
> 23/12/24 09:28:02 WARN SparkSession: Using an existing Spark session; only 
> runtime SQL configurations will take effect.
> >>> sql = "SELECT * FROM  WHERE  = some value"
> >>> try:
> ...   spark.sql(sql)
> ...   print("is working")
> ... except Exception as e:
> ...   print(f"Syntax error: {e}")
> ...
> Syntax error:
> [PARSE_SYNTAX_ERROR] Syntax error at or near '<'.(line 1, pos 14)
> 
> == SQL ==
> SELECT * FROM  WHERE  = some value
> --^^^
> 
> Here we only check for syntax errors and not the actual existence of query 
> semantics. We are not validating against table or column existence.
> 
> This method is useful when you want to catch obvious syntax errors before 
> submitting your PySpark job to a cluster, especially when you don't have 
> access to the actual data.
> In summary
> Theis method validates syntax but will not catch semantic errors
> If you need more comprehensive validation, consider using a testing framework 
> and a small dataset.
> For complex queries, using a linter or code analysis tool can help identify 
> potential issues.
> HTH
> 
> Mich Talebzadeh,
> Dad | Technologist | Solutions Architect | Engineer
> London
> United Kingdom
> 
>view my Linkedin profile 
> 
> 
>  https://en.everybodywiki.com/Mich_Talebzadeh
> 
>  
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> 
> On Sun, 24 Dec 2023 at 07:57, ram manickam  > wrote:
>> Hello,
>> Is there a way to validate pyspark sql to validate only syntax errors?. I 
>> cannot connect do actual data set to perform this validation.  Any help 
>> would be appreciated.
>> 
>> 
>> Thanks
>> Ram

[jira] [Created] (HADOOP-17562) Provide mechanism for explicitly specifying the compression codec for input files

2021-03-03 Thread Nicholas Chammas (Jira)

Nicholas Chammas created HADOOP-17562:
-

 Summary: Provide mechanism for explicitly specifying the 
compression codec for input files
 Key: HADOOP-17562
 URL: https://issues.apache.org/jira/browse/HADOOP-17562
 Project: Hadoop Common
  Issue Type: Improvement
Reporter: Nicholas Chammas


I come to you via SPARK-29280.

I am looking for the file _input_ equivalents of the following settings:
{code:java}
mapreduce.output.fileoutputformat.compress
mapreduce.map.output.compress{code}
Right now, I understand that Hadoop infers the codec to use when reading a file 
from the file's extension.

However, in some cases the files may have the incorrect extension or no 
extension. There are links to some examples from SPARK-29280.

Ideally, you should be able to explicitly specify the codec to use to read 
those files. I don't believe that's possible today. Instead, the current 
workaround appears to be to [create a custom codec 
class|https://stackoverflow.com/a/17152167/877069] and override the 
getDefaultExtension method to specify the extension to expect.

Does it make sense to offer an explicit way to select the compression codec for 
file input, mirroring how things work for file output?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Created] (HADOOP-16930) Add com.amazonaws.auth.profile.ProfileCredentialsProvider to hadoop-aws docs

2020-03-20 Thread Nicholas Chammas (Jira)

Nicholas Chammas created HADOOP-16930:
-

 Summary: Add com.amazonaws.auth.profile.ProfileCredentialsProvider 
to hadoop-aws docs
 Key: HADOOP-16930
 URL: https://issues.apache.org/jira/browse/HADOOP-16930
 Project: Hadoop Common
  Issue Type: Improvement
  Components: documentation, fs/s3
Reporter: Nicholas Chammas


There is a very, very useful S3A authentication method that is not currently 
documented: {{com.amazonaws.auth.profile.ProfileCredentialsProvider}}

This provider lets you source your AWS credentials from a shared credentials 
file, typically stored under {{~/.aws/credentials}}, using a [named 
profile|https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-profiles.html].
 All you need is to set the {{AWS_PROFILE}} environment variable, and the 
provider will get the appropriate credentials for you.

I discovered this from my coworkers, but cannot find it in the docs for 
hadoop-aws. I'd expect to see it at least mentioned in [this 
section|https://hadoop.apache.org/docs/r2.9.2/hadoop-aws/tools/hadoop-aws/index.html#S3A_Authentication_methods].
 It should probably be added to the docs for every minor release that supports 
it, which I'd guess includes 2.8 on up.

(This provider should probably also be added to the default list of credential 
provider classes, but we can address that in another ticket. I can say that at 
least in 2.9.2, it's not in the default list.)

(This is not to be confused with 
{{com.amazonaws.auth.InstanceProfileCredentialsProvider}}, which serves a 
completely different purpose.)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-dev-h...@hadoop.apache.org

[jira] [Created] (HADOOP-15559) Clarity on Spark compatibility with hadoop-aws

2018-06-25 Thread Nicholas Chammas (JIRA)

Nicholas Chammas created HADOOP-15559:
-

 Summary: Clarity on Spark compatibility with hadoop-aws
 Key: HADOOP-15559
 URL: https://issues.apache.org/jira/browse/HADOOP-15559
 Project: Hadoop Common
  Issue Type: Improvement
  Components: documentation, fs/s3
Reporter: Nicholas Chammas


I'm the maintainer of [Flintrock|https://github.com/nchammas/flintrock], a 
command-line tool for launching Apache Spark clusters on AWS. One of the things 
I try to do for my users is make it straightforward to use Spark with 
{{s3a://}}. I do this by recommending that users start Spark with the 
{{hadoop-aws}} package.

For example:
{code:java}
pyspark --packages "org.apache.hadoop:hadoop-aws:2.8.4"
{code}
I'm struggling, however, to understand what versions of {{hadoop-aws}} should 
work with what versions of Spark.

Spark releases are [built against Hadoop 
2.7|http://archive.apache.org/dist/spark/spark-2.3.1/]. At the same time, I've 
been told that I should be able to use newer versions of Hadoop and Hadoop 
libraries with Spark, so for example, running Spark built against Hadoop 2.7 
alongside HDFS 2.8 should work, and there is [no need to build Spark explicitly 
against Hadoop 
2.8|http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Spark-2-3-1-RC4-tp24087p24092.html].

I'm having trouble translating this mental model into recommendations for how 
to pair Spark with {{hadoop-aws}}.

For example, Spark 2.3.1 built against Hadoop 2.7 works with 
{{hadoop-aws:2.7.6}} but not with {{hadoop-aws:2.8.4}}. Trying the latter 
yields the following error when I try to access files via {{s3a://}}.
{code:java}
py4j.protocol.Py4JJavaError: An error occurred while calling o35.text.
: java.lang.IllegalAccessError: tried to access method 
org.apache.hadoop.metrics2.lib.MutableCounterLong.(Lorg/apache/hadoop/metrics2/MetricsInfo;J)V
 from class org.apache.hadoop.fs.s3a.S3AInstrumentation
at 
org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:194)
at 
org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:216)
at 
org.apache.hadoop.fs.s3a.S3AInstrumentation.(S3AInstrumentation.java:139)
at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:174)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at 
org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:45)
at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:354)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
at org.apache.spark.sql.DataFrameReader.text(DataFrameReader.scala:693)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748){code}
So it would seem that {{hadoop-aws}} must be matched to the same MAJOR.MINOR 
release of Hadoop that Spark is built against. However, neither [this 
page|https://wiki.apache.org/hadoop/AmazonS3] nor [this 
one|https://hortonworks.github.io/hdp-aws/s3-spark/] shed any light on how to 
pair the correct version of {{hadoop-aws}} with Spark.

Would it be appropriate to add some guidance somewhere on what versions of 
{{hadoop-aws}} work with what versions and builds of Spark? It would help 
eliminate this kind of guesswork and slow spelunking.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-dev-h...@hadoop.apache.org

[jira] [Resolved] (HADOOP-15559) Clarity on Spark compatibility with hadoop-aws

2018-06-27 Thread Nicholas Chammas (JIRA)



 [ 
https://issues.apache.org/jira/browse/HADOOP-15559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas resolved HADOOP-15559.
---
Resolution: Fixed

> Clarity on Spark compatibility with hadoop-aws
> --
>
> Key: HADOOP-15559
> URL: https://issues.apache.org/jira/browse/HADOOP-15559
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: documentation, fs/s3
>        Reporter: Nicholas Chammas
>Priority: Minor
>
> I'm the maintainer of [Flintrock|https://github.com/nchammas/flintrock], a 
> command-line tool for launching Apache Spark clusters on AWS. One of the 
> things I try to do for my users is make it straightforward to use Spark with 
> {{s3a://}}. I do this by recommending that users start Spark with the 
> {{hadoop-aws}} package.
> For example:
> {code:java}
> pyspark --packages "org.apache.hadoop:hadoop-aws:2.8.4"
> {code}
> I'm struggling, however, to understand what versions of {{hadoop-aws}} should 
> work with what versions of Spark.
> Spark releases are [built against Hadoop 
> 2.7|http://archive.apache.org/dist/spark/spark-2.3.1/]. At the same time, 
> I've been told that I should be able to use newer versions of Hadoop and 
> Hadoop libraries with Spark, so for example, running Spark built against 
> Hadoop 2.7 alongside HDFS 2.8 should work, and there is [no need to build 
> Spark explicitly against Hadoop 
> 2.8|http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Spark-2-3-1-RC4-tp24087p24092.html].
> I'm having trouble translating this mental model into recommendations for how 
> to pair Spark with {{hadoop-aws}}.
> For example, Spark 2.3.1 built against Hadoop 2.7 works with 
> {{hadoop-aws:2.7.6}} but not with {{hadoop-aws:2.8.4}}. Trying the latter 
> yields the following error when I try to access files via {{s3a://}}.
> {code:java}
> py4j.protocol.Py4JJavaError: An error occurred while calling o35.text.
> : java.lang.IllegalAccessError: tried to access method 
> org.apache.hadoop.metrics2.lib.MutableCounterLong.(Lorg/apache/hadoop/metrics2/MetricsInfo;J)V
>  from class org.apache.hadoop.fs.s3a.S3AInstrumentation
> at 
> org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:194)
> at 
> org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:216)
> at 
> org.apache.hadoop.fs.s3a.S3AInstrumentation.(S3AInstrumentation.java:139)
> at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:174)
> at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
> at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
> at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
> at 
> org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:45)
> at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:354)
> at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
> at org.apache.spark.sql.DataFrameReader.text(DataFrameReader.scala:693)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
> at py4j.Gateway.invoke(Gateway.java:282)
> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
> at py4j.commands.CallCommand.execute(CallCommand.java:79)
> at py4j.GatewayConnection.run(GatewayConnection.java:238)
> at java.lang.Thread.run(Thread.java:748){code}
> So it would seem that {{hadoop-aws}} must be matched to the same MAJOR.MINOR 
> release of Hadoop that Spark is built against. However, neither [this 
> page|https://wiki.apache.org/hadoop/AmazonS3] nor [this 
> one|https://hortonworks.github.io/hdp-aws/s3-spark/] shed any light on how to 
> pair the correct version of {{hadoop-aws}} with Spark.
> Would it be appropriate to add some guidance somewhere on what versions of 
> {{hadoop-aws}} work with what versions and builds of Spark? It would hel

[jira] [Created] (HADOOP-15559) Clarity on Spark compatibility with hadoop-aws

2018-06-25 Thread Nicholas Chammas (JIRA)

Nicholas Chammas created HADOOP-15559:
-

 Summary: Clarity on Spark compatibility with hadoop-aws
 Key: HADOOP-15559
 URL: https://issues.apache.org/jira/browse/HADOOP-15559
 Project: Hadoop Common
  Issue Type: Improvement
  Components: documentation, fs/s3
Reporter: Nicholas Chammas


I'm the maintainer of [Flintrock|https://github.com/nchammas/flintrock], a 
command-line tool for launching Apache Spark clusters on AWS. One of the things 
I try to do for my users is make it straightforward to use Spark with 
{{s3a://}}. I do this by recommending that users start Spark with the 
{{hadoop-aws}} package.

For example:
{code:java}
pyspark --packages "org.apache.hadoop:hadoop-aws:2.8.4"
{code}
I'm struggling, however, to understand what versions of {{hadoop-aws}} should 
work with what versions of Spark.

Spark releases are [built against Hadoop 
2.7|http://archive.apache.org/dist/spark/spark-2.3.1/]. At the same time, I've 
been told that I should be able to use newer versions of Hadoop and Hadoop 
libraries with Spark, so for example, running Spark built against Hadoop 2.7 
alongside HDFS 2.8 should work, and there is [no need to build Spark explicitly 
against Hadoop 
2.8|http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Spark-2-3-1-RC4-tp24087p24092.html].

I'm having trouble translating this mental model into recommendations for how 
to pair Spark with {{hadoop-aws}}.

For example, Spark 2.3.1 built against Hadoop 2.7 works with 
{{hadoop-aws:2.7.6}} but not with {{hadoop-aws:2.8.4}}. Trying the latter 
yields the following error when I try to access files via {{s3a://}}.
{code:java}
py4j.protocol.Py4JJavaError: An error occurred while calling o35.text.
: java.lang.IllegalAccessError: tried to access method 
org.apache.hadoop.metrics2.lib.MutableCounterLong.(Lorg/apache/hadoop/metrics2/MetricsInfo;J)V
 from class org.apache.hadoop.fs.s3a.S3AInstrumentation
at 
org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:194)
at 
org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:216)
at 
org.apache.hadoop.fs.s3a.S3AInstrumentation.(S3AInstrumentation.java:139)
at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:174)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at 
org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:45)
at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:354)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
at org.apache.spark.sql.DataFrameReader.text(DataFrameReader.scala:693)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748){code}
So it would seem that {{hadoop-aws}} must be matched to the same MAJOR.MINOR 
release of Hadoop that Spark is built against. However, neither [this 
page|https://wiki.apache.org/hadoop/AmazonS3] nor [this 
one|https://hortonworks.github.io/hdp-aws/s3-spark/] shed any light on how to 
pair the correct version of {{hadoop-aws}} with Spark.

Would it be appropriate to add some guidance somewhere on what versions of 
{{hadoop-aws}} work with what versions and builds of Spark? It would help 
eliminate this kind of guesswork and slow spelunking.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Commented] (HADOOP-15559) Clarity on Spark compatibility with hadoop-aws

2018-06-26 Thread Nicholas Chammas (JIRA)



[ 
https://issues.apache.org/jira/browse/HADOOP-15559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16524469#comment-16524469
 ] 

Nicholas Chammas commented on HADOOP-15559:
---

Hi [~ste...@apache.org] and thank you for the thorough response and references.
 # Is [the s3a troubleshooting 
guide|https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/troubleshooting_s3a.md]
 published anywhere? Or is the GitHub URL the canonical URL? I feel like [S3 
Support in Apache Hadoop|https://wiki.apache.org/hadoop/AmazonS3] is the most 
visible bit of documentation about s3a. It would make sense to link to the 
troubleshooting guide from there.
 # In my case, I am not adding the AWS SDK individually. By using {{pyspark 
--packages}} (or {{spark-submit --packages}}) with hadoop-aws, I understand 
that Spark automatically pulls transitive dependencies for me. So my focus has 
been to just get the mapping of Spark version to hadoop-aws version correct.

Additionally, I am trying really hard to stick to the default release builds of 
Spark, as opposed to building my own versions of Spark to use with 
[Flintrock|https://github.com/nchammas/flintrock]. Being able to spin Spark 
clusters up on EC2 by downloading Spark directly from the Apache mirror network 
means one less piece of infrastructure I have to maintain myself. So I'm trying 
not to get into the business of building Spark, though I am aware of 
{{-Phadoop-cloud}}.

Thankfully, it looks like [Spark 2.3.1 built against Hadoop 
2.7|http://archive.apache.org/dist/spark/spark-2.3.1/] works with {{–packages 
"org.apache.hadoop:hadoop-aws:2.7.6"}}, and I suppose according to your comment 
in SPARK-22919 that is basically the version of hadoop-aws I need to use with 
these releases as long as Spark is built against Hadoop 2.7.

Does that sound about right to you?

> Clarity on Spark compatibility with hadoop-aws
> --
>
> Key: HADOOP-15559
> URL: https://issues.apache.org/jira/browse/HADOOP-15559
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: documentation, fs/s3
>Reporter: Nicholas Chammas
>Priority: Minor
>
> I'm the maintainer of [Flintrock|https://github.com/nchammas/flintrock], a 
> command-line tool for launching Apache Spark clusters on AWS. One of the 
> things I try to do for my users is make it straightforward to use Spark with 
> {{s3a://}}. I do this by recommending that users start Spark with the 
> {{hadoop-aws}} package.
> For example:
> {code:java}
> pyspark --packages "org.apache.hadoop:hadoop-aws:2.8.4"
> {code}
> I'm struggling, however, to understand what versions of {{hadoop-aws}} should 
> work with what versions of Spark.
> Spark releases are [built against Hadoop 
> 2.7|http://archive.apache.org/dist/spark/spark-2.3.1/]. At the same time, 
> I've been told that I should be able to use newer versions of Hadoop and 
> Hadoop libraries with Spark, so for example, running Spark built against 
> Hadoop 2.7 alongside HDFS 2.8 should work, and there is [no need to build 
> Spark explicitly against Hadoop 
> 2.8|http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Spark-2-3-1-RC4-tp24087p24092.html].
> I'm having trouble translating this mental model into recommendations for how 
> to pair Spark with {{hadoop-aws}}.
> For example, Spark 2.3.1 built against Hadoop 2.7 works with 
> {{hadoop-aws:2.7.6}} but not with {{hadoop-aws:2.8.4}}. Trying the latter 
> yields the following error when I try to access files via {{s3a://}}.
> {code:java}
> py4j.protocol.Py4JJavaError: An error occurred while calling o35.text.
> : java.lang.IllegalAccessError: tried to access method 
> org.apache.hadoop.metrics2.lib.MutableCounterLong.(Lorg/apache/hadoop/metrics2/MetricsInfo;J)V
>  from class org.apache.hadoop.fs.s3a.S3AInstrumentation
> at 
> org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:194)
> at 
> org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:216)
> at 
> org.apache.hadoop.fs.s3a.S3AInstrumentation.(S3AInstrumentation.java:139)
> at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:174)
> at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
> at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
> at org.apache.hadoop.fs.Path.getFileSystem(Path.java:

[jira] [Comment Edited] (HADOOP-15559) Clarity on Spark compatibility with hadoop-aws

2018-06-26 Thread Nicholas Chammas (JIRA)



[ 
https://issues.apache.org/jira/browse/HADOOP-15559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16524469#comment-16524469
 ] 

Nicholas Chammas edited comment on HADOOP-15559 at 6/27/18 2:27 AM:


Hi [~ste...@apache.org] and thank you for the thorough response and references.

1. Is [the s3a troubleshooting 
guide|https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/troubleshooting_s3a.md]
 published anywhere? Or is the GitHub URL the canonical URL? I feel like [S3 
Support in Apache Hadoop|https://wiki.apache.org/hadoop/AmazonS3] is the most 
visible bit of documentation about s3a. It would make sense to link to the 
troubleshooting guide from there. 

2. In my case, I am not adding the AWS SDK individually. By using {{pyspark 
--packages}} (or {{spark-submit --packages}}) with hadoop-aws, I understand 
that Spark automatically pulls transitive dependencies for me. So my focus has 
been to just get the mapping of Spark version to hadoop-aws version correct.

Additionally, I am trying really hard to stick to the default release builds of 
Spark, as opposed to building my own versions of Spark to use with 
[Flintrock|https://github.com/nchammas/flintrock]. Being able to spin Spark 
clusters up on EC2 by downloading Spark directly from the Apache mirror network 
means one less piece of infrastructure I have to maintain myself. So I'm trying 
not to get into the business of building Spark, though I am aware of 
{{-Phadoop-cloud}}.

Thankfully, it looks like [Spark 2.3.1 built against Hadoop 
2.7|http://archive.apache.org/dist/spark/spark-2.3.1/] works with {{–packages 
"org.apache.hadoop:hadoop-aws:2.7.6"}}, and I suppose according to your comment 
in SPARK-22919 that is basically the version of hadoop-aws I need to use with 
these releases as long as Spark is built against Hadoop 2.7.

Does that sound about right to you?


was (Author: nchammas):
Hi [~ste...@apache.org] and thank you for the thorough response and references.
 # Is [the s3a troubleshooting 
guide|https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/troubleshooting_s3a.md]
 published anywhere? Or is the GitHub URL the canonical URL? I feel like [S3 
Support in Apache Hadoop|https://wiki.apache.org/hadoop/AmazonS3] is the most 
visible bit of documentation about s3a. It would make sense to link to the 
troubleshooting guide from there.
 # In my case, I am not adding the AWS SDK individually. By using {{pyspark 
--packages}} (or {{spark-submit --packages}}) with hadoop-aws, I understand 
that Spark automatically pulls transitive dependencies for me. So my focus has 
been to just get the mapping of Spark version to hadoop-aws version correct.

Additionally, I am trying really hard to stick to the default release builds of 
Spark, as opposed to building my own versions of Spark to use with 
[Flintrock|https://github.com/nchammas/flintrock]. Being able to spin Spark 
clusters up on EC2 by downloading Spark directly from the Apache mirror network 
means one less piece of infrastructure I have to maintain myself. So I'm trying 
not to get into the business of building Spark, though I am aware of 
{{-Phadoop-cloud}}.

Thankfully, it looks like [Spark 2.3.1 built against Hadoop 
2.7|http://archive.apache.org/dist/spark/spark-2.3.1/] works with {{–packages 
"org.apache.hadoop:hadoop-aws:2.7.6"}}, and I suppose according to your comment 
in SPARK-22919 that is basically the version of hadoop-aws I need to use with 
these releases as long as Spark is built against Hadoop 2.7.

Does that sound about right to you?

> Clarity on Spark compatibility with hadoop-aws
> --
>
> Key: HADOOP-15559
> URL: https://issues.apache.org/jira/browse/HADOOP-15559
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: documentation, fs/s3
>Reporter: Nicholas Chammas
>Priority: Minor
>
> I'm the maintainer of [Flintrock|https://github.com/nchammas/flintrock], a 
> command-line tool for launching Apache Spark clusters on AWS. One of the 
> things I try to do for my users is make it straightforward to use Spark with 
> {{s3a://}}. I do this by recommending that users start Spark with the 
> {{hadoop-aws}} package.
> For example:
> {code:java}
> pyspark --packages "org.apache.hadoop:hadoop-aws:2.8.4"
> {code}
> I'm struggling, however, to understand what versions of {{hadoop-aws}} should 
> work with what versions of Spark.
> Spark releases are [built against Hadoop 
> 2.7|http://archive.apache.org/dist/spark/spark-2.3.1/]. At the same time, 
> I've been told that I should be able to use new

[jira] [Commented] (HADOOP-15559) Clarity on Spark compatibility with hadoop-aws

2018-06-27 Thread Nicholas Chammas (JIRA)



[ 
https://issues.apache.org/jira/browse/HADOOP-15559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16525697#comment-16525697
 ] 

Nicholas Chammas commented on HADOOP-15559:
---

Looks good to me. I will consider raising the issue of building with 
{{-Phadoop-cloud}} on the Spark dev list.

> Clarity on Spark compatibility with hadoop-aws
> --
>
> Key: HADOOP-15559
> URL: https://issues.apache.org/jira/browse/HADOOP-15559
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: documentation, fs/s3
>    Reporter: Nicholas Chammas
>Priority: Minor
>
> I'm the maintainer of [Flintrock|https://github.com/nchammas/flintrock], a 
> command-line tool for launching Apache Spark clusters on AWS. One of the 
> things I try to do for my users is make it straightforward to use Spark with 
> {{s3a://}}. I do this by recommending that users start Spark with the 
> {{hadoop-aws}} package.
> For example:
> {code:java}
> pyspark --packages "org.apache.hadoop:hadoop-aws:2.8.4"
> {code}
> I'm struggling, however, to understand what versions of {{hadoop-aws}} should 
> work with what versions of Spark.
> Spark releases are [built against Hadoop 
> 2.7|http://archive.apache.org/dist/spark/spark-2.3.1/]. At the same time, 
> I've been told that I should be able to use newer versions of Hadoop and 
> Hadoop libraries with Spark, so for example, running Spark built against 
> Hadoop 2.7 alongside HDFS 2.8 should work, and there is [no need to build 
> Spark explicitly against Hadoop 
> 2.8|http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Spark-2-3-1-RC4-tp24087p24092.html].
> I'm having trouble translating this mental model into recommendations for how 
> to pair Spark with {{hadoop-aws}}.
> For example, Spark 2.3.1 built against Hadoop 2.7 works with 
> {{hadoop-aws:2.7.6}} but not with {{hadoop-aws:2.8.4}}. Trying the latter 
> yields the following error when I try to access files via {{s3a://}}.
> {code:java}
> py4j.protocol.Py4JJavaError: An error occurred while calling o35.text.
> : java.lang.IllegalAccessError: tried to access method 
> org.apache.hadoop.metrics2.lib.MutableCounterLong.(Lorg/apache/hadoop/metrics2/MetricsInfo;J)V
>  from class org.apache.hadoop.fs.s3a.S3AInstrumentation
> at 
> org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:194)
> at 
> org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:216)
> at 
> org.apache.hadoop.fs.s3a.S3AInstrumentation.(S3AInstrumentation.java:139)
> at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:174)
> at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
> at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
> at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
> at 
> org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:45)
> at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:354)
> at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
> at org.apache.spark.sql.DataFrameReader.text(DataFrameReader.scala:693)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
> at py4j.Gateway.invoke(Gateway.java:282)
> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
> at py4j.commands.CallCommand.execute(CallCommand.java:79)
> at py4j.GatewayConnection.run(GatewayConnection.java:238)
> at java.lang.Thread.run(Thread.java:748){code}
> So it would seem that {{hadoop-aws}} must be matched to the same MAJOR.MINOR 
> release of Hadoop that Spark is built against. However, neither [this 
> page|https://wiki.apache.org/hadoop/AmazonS3] nor [this 
> one|https://hortonworks.github.io/hdp-aws/s3-spark/] shed any light on how to 
> pair the correct version of {{hadoop-aws}} with Spark.
&g

[jira] [Resolved] (HADOOP-15559) Clarity on Spark compatibility with hadoop-aws

2018-06-27 Thread Nicholas Chammas (JIRA)



 [ 
https://issues.apache.org/jira/browse/HADOOP-15559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas resolved HADOOP-15559.
---
Resolution: Fixed

> Clarity on Spark compatibility with hadoop-aws
> --
>
> Key: HADOOP-15559
> URL: https://issues.apache.org/jira/browse/HADOOP-15559
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: documentation, fs/s3
>        Reporter: Nicholas Chammas
>Priority: Minor
>
> I'm the maintainer of [Flintrock|https://github.com/nchammas/flintrock], a 
> command-line tool for launching Apache Spark clusters on AWS. One of the 
> things I try to do for my users is make it straightforward to use Spark with 
> {{s3a://}}. I do this by recommending that users start Spark with the 
> {{hadoop-aws}} package.
> For example:
> {code:java}
> pyspark --packages "org.apache.hadoop:hadoop-aws:2.8.4"
> {code}
> I'm struggling, however, to understand what versions of {{hadoop-aws}} should 
> work with what versions of Spark.
> Spark releases are [built against Hadoop 
> 2.7|http://archive.apache.org/dist/spark/spark-2.3.1/]. At the same time, 
> I've been told that I should be able to use newer versions of Hadoop and 
> Hadoop libraries with Spark, so for example, running Spark built against 
> Hadoop 2.7 alongside HDFS 2.8 should work, and there is [no need to build 
> Spark explicitly against Hadoop 
> 2.8|http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Spark-2-3-1-RC4-tp24087p24092.html].
> I'm having trouble translating this mental model into recommendations for how 
> to pair Spark with {{hadoop-aws}}.
> For example, Spark 2.3.1 built against Hadoop 2.7 works with 
> {{hadoop-aws:2.7.6}} but not with {{hadoop-aws:2.8.4}}. Trying the latter 
> yields the following error when I try to access files via {{s3a://}}.
> {code:java}
> py4j.protocol.Py4JJavaError: An error occurred while calling o35.text.
> : java.lang.IllegalAccessError: tried to access method 
> org.apache.hadoop.metrics2.lib.MutableCounterLong.(Lorg/apache/hadoop/metrics2/MetricsInfo;J)V
>  from class org.apache.hadoop.fs.s3a.S3AInstrumentation
> at 
> org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:194)
> at 
> org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:216)
> at 
> org.apache.hadoop.fs.s3a.S3AInstrumentation.(S3AInstrumentation.java:139)
> at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:174)
> at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
> at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
> at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
> at 
> org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:45)
> at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:354)
> at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
> at org.apache.spark.sql.DataFrameReader.text(DataFrameReader.scala:693)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
> at py4j.Gateway.invoke(Gateway.java:282)
> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
> at py4j.commands.CallCommand.execute(CallCommand.java:79)
> at py4j.GatewayConnection.run(GatewayConnection.java:238)
> at java.lang.Thread.run(Thread.java:748){code}
> So it would seem that {{hadoop-aws}} must be matched to the same MAJOR.MINOR 
> release of Hadoop that Spark is built against. However, neither [this 
> page|https://wiki.apache.org/hadoop/AmazonS3] nor [this 
> one|https://hortonworks.github.io/hdp-aws/s3-spark/] shed any light on how to 
> pair the correct version of {{hadoop-aws}} with Spark.
> Would it be appropriate to add some guidance somewhere on what versions of 
> {{hadoop-aws}} work with what versions and builds of Spark? It would help

[jira] [Created] (HADOOP-17562) Provide mechanism for explicitly specifying the compression codec for input files

2021-03-03 Thread Nicholas Chammas (Jira)

Nicholas Chammas created HADOOP-17562:
-

 Summary: Provide mechanism for explicitly specifying the 
compression codec for input files
 Key: HADOOP-17562
 URL: https://issues.apache.org/jira/browse/HADOOP-17562
 Project: Hadoop Common
  Issue Type: Improvement
Reporter: Nicholas Chammas


I come to you via SPARK-29280.

I am looking for the file _input_ equivalents of the following settings:
{code:java}
mapreduce.output.fileoutputformat.compress
mapreduce.map.output.compress{code}
Right now, I understand that Hadoop infers the codec to use when reading a file 
from the file's extension.

However, in some cases the files may have the incorrect extension or no 
extension. There are links to some examples from SPARK-29280.

Ideally, you should be able to explicitly specify the codec to use to read 
those files. I don't believe that's possible today. Instead, the current 
workaround appears to be to [create a custom codec 
class|https://stackoverflow.com/a/17152167/877069] and override the 
getDefaultExtension method to specify the extension to expect.

Does it make sense to offer an explicit way to select the compression codec for 
file input, mirroring how things work for file output?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-dev-h...@hadoop.apache.org

[jira] [Created] (HADOOP-16930) Add com.amazonaws.auth.profile.ProfileCredentialsProvider to hadoop-aws docs

2020-03-20 Thread Nicholas Chammas (Jira)

Nicholas Chammas created HADOOP-16930:
-

 Summary: Add com.amazonaws.auth.profile.ProfileCredentialsProvider 
to hadoop-aws docs
 Key: HADOOP-16930
 URL: https://issues.apache.org/jira/browse/HADOOP-16930
 Project: Hadoop Common
  Issue Type: Improvement
  Components: documentation, fs/s3
Reporter: Nicholas Chammas


There is a very, very useful S3A authentication method that is not currently 
documented: {{com.amazonaws.auth.profile.ProfileCredentialsProvider}}

This provider lets you source your AWS credentials from a shared credentials 
file, typically stored under {{~/.aws/credentials}}, using a [named 
profile|https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-profiles.html].
 All you need is to set the {{AWS_PROFILE}} environment variable, and the 
provider will get the appropriate credentials for you.

I discovered this from my coworkers, but cannot find it in the docs for 
hadoop-aws. I'd expect to see it at least mentioned in [this 
section|https://hadoop.apache.org/docs/r2.9.2/hadoop-aws/tools/hadoop-aws/index.html#S3A_Authentication_methods].
 It should probably be added to the docs for every minor release that supports 
it, which I'd guess includes 2.8 on up.

(This provider should probably also be added to the default list of credential 
provider classes, but we can address that in another ticket. I can say that at 
least in 2.9.2, it's not in the default list.)

(This is not to be confused with 
{{com.amazonaws.auth.InstanceProfileCredentialsProvider}}, which serves a 
completely different purpose.)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Commented] (HADOOP-16930) Add com.amazonaws.auth.profile.ProfileCredentialsProvider to hadoop-aws docs

2020-03-20 Thread Nicholas Chammas (Jira)



[ 
https://issues.apache.org/jira/browse/HADOOP-16930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17063734#comment-17063734
 ] 

Nicholas Chammas commented on HADOOP-16930:
---

cc [~ste...@apache.org] - I'd be happy to work on the doc update if I've 
understood the issue correctly.

> Add com.amazonaws.auth.profile.ProfileCredentialsProvider to hadoop-aws docs
> 
>
> Key: HADOOP-16930
> URL: https://issues.apache.org/jira/browse/HADOOP-16930
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: documentation, fs/s3
>Reporter: Nicholas Chammas
>Priority: Minor
>
> There is a very, very useful S3A authentication method that is not currently 
> documented: {{com.amazonaws.auth.profile.ProfileCredentialsProvider}}
> This provider lets you source your AWS credentials from a shared credentials 
> file, typically stored under {{~/.aws/credentials}}, using a [named 
> profile|https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-profiles.html].
>  All you need is to set the {{AWS_PROFILE}} environment variable, and the 
> provider will get the appropriate credentials for you.
> I discovered this from my coworkers, but cannot find it in the docs for 
> hadoop-aws. I'd expect to see it at least mentioned in [this 
> section|https://hadoop.apache.org/docs/r2.9.2/hadoop-aws/tools/hadoop-aws/index.html#S3A_Authentication_methods].
>  It should probably be added to the docs for every minor release that 
> supports it, which I'd guess includes 2.8 on up.
> (This provider should probably also be added to the default list of 
> credential provider classes, but we can address that in another ticket. I can 
> say that at least in 2.9.2, it's not in the default list.)
> (This is not to be confused with 
> {{com.amazonaws.auth.InstanceProfileCredentialsProvider}}, which serves a 
> completely different purpose.)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Commented] (RAT-323) Harmonize UIs

2024-01-15 Thread Nicholas Chammas (Jira)



[ 
https://issues.apache.org/jira/browse/RAT-323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17806930#comment-17806930
 ] 

Nicholas Chammas commented on RAT-323:
--

Big +1 for enabling the CLI to use SCM ignores as excludes.

The Apache Spark project uses RAT via the CLI, and I am currently trying to 
clean up the configured excludes there because it's a [total 
mess|https://github.com/apache/spark/blob/c0ff0f579daa21dcc6004058537d275a0dd2920f/dev/.rat-excludes].
 This is partly because RAT is not using the project's existing .gitignore 
files, and partly because people expect .rat-excludes to work the same way as 
.gitignore.

> Harmonize UIs
> -
>
> Key: RAT-323
> URL: https://issues.apache.org/jira/browse/RAT-323
> Project: Apache Rat
>  Issue Type: Improvement
>  Components: cli
>Affects Versions: 0.16
>Reporter: Claude Warren
>Priority: Major
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> The UIs (CLI, ANT and Maven) were all developed separately and have different 
> options.
> There is an overlap in some functionality and the functionality of some UIs 
> is not found in others.
> This task is to do two things:
>  # collect all the UI options, and ensure that they are all supported in the 
> ReportConfiguration class. 
>  # modify the UIs so that the names of the options are the same (or as 
> similar as possible) across the three UIs.  Renamed methods are to be 
> deprecated in favour of new methods.
>  
> Example:
> apache-rat-plugin has 3 options: parseSCMIgnoresAsExcludes, 
> useEclipseDefaultExcludes, useIdeaDefaultExcludes that change the file 
> filter.  These are options that would be useful in all UIs and should be 
> moved to the ReportConfiguration so that any UI can set them.
> By harmonization I mean that options like the above are extracted from the 
> specific UIs where they are implemented and moved to the ReportConfiguration 
> so that the implementations are in one place and can be shared across all UIs.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (RAT-352) Enable use of wildcard expressions in exclude file

2024-01-15 Thread Nicholas Chammas (Jira)

Nicholas Chammas created RAT-352:


 Summary: Enable use of wildcard expressions in exclude file
 Key: RAT-352
 URL: https://issues.apache.org/jira/browse/RAT-352
 Project: Apache Rat
  Issue Type: Improvement
  Components: cli
Reporter: Nicholas Chammas


Due to the widespread use of git, I would find it much more intuitive if 
.rat-excludes worked like .gitignore. I think most people on the Spark project 
would agree (though, fair disclosure, I haven't polled them).

Would it make sense to add a CLI option instructing RAT to interpret entries in 
the exclude file as wildcard expressions (as opposed to regular expressions) 
that work more or less like .gitignore?

This feature request is somewhat related to RAT-265.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (RAT-352) Enable use of wildcard expressions in exclude file

2024-01-16 Thread Nicholas Chammas (Jira)



[ 
https://issues.apache.org/jira/browse/RAT-352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17807311#comment-17807311
 ] 

Nicholas Chammas commented on RAT-352:
--

> would it make sense to provide a CLI option that reads a .gitignore instead 
> of a .ratexclude file to allow for your feature request?

The Spark project needs separate listings in .gitignore vs. .rat-excludes. So 
as long as the new option simply changes how the patterns are interpreted (from 
regex to wildcard), then we can update our existing .rat-excludes to work with 
the new option.

The goal (for me at least) is to be able to look at .gitignore and 
.rat-excludes and interpret the entries in there the same way. I think it's 
more intuitive and easier to manage.

> Enable use of wildcard expressions in exclude file
> --
>
> Key: RAT-352
> URL: https://issues.apache.org/jira/browse/RAT-352
> Project: Apache Rat
>  Issue Type: Improvement
>  Components: cli
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Due to the widespread use of git, I would find it much more intuitive if 
> .rat-excludes worked like .gitignore. I think most people on the Spark 
> project would agree (though, fair disclosure, I haven't polled them).
> Would it make sense to add a CLI option instructing RAT to interpret entries 
> in the exclude file as wildcard expressions (as opposed to regular 
> expressions) that work more or less like .gitignore?
> This feature request is somewhat related to RAT-265.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ORC-152) Saving empty Spark DataFrame via ORC does not preserve schema

2017-02-28 Thread Nicholas Chammas (JIRA)

Nicholas Chammas created ORC-152:


 Summary: Saving empty Spark DataFrame via ORC does not preserve 
schema
 Key: ORC-152
 URL: https://issues.apache.org/jira/browse/ORC-152
 Project: Orc
  Issue Type: Bug
Reporter: Nicholas Chammas
Priority: Minor


Details are on SPARK-15474.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (PARQUET-41) Add bloom filters to parquet statistics

2021-02-01 Thread Nicholas Chammas (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276842#comment-17276842
 ] 

Nicholas Chammas commented on PARQUET-41:
-

Where is the user documentation for all the bloom filter-related functionality 
that will be released as part of parquet-mr 1.12? I'm thinking of user settings 
like {{parquet.filter.bloom.enabled}} and {{parquet.bloom.filter.*}}, along 
with anything else a user might care about.

For example, if a Spark user wants to use or configure bloom filters on their 
Parquet data, what documentation should they reference?

> Add bloom filters to parquet statistics
> ---
>
> Key: PARQUET-41
> URL: https://issues.apache.org/jira/browse/PARQUET-41
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-format, parquet-mr
>Reporter: Alex Levenson
>Assignee: Junjie Chen
>Priority: Major
>  Labels: filter2, pull-request-available
> Fix For: format-2.7.0, 1.12.0
>
>
> For row groups with no dictionary, we could still produce a bloom filter. 
> This could be very useful in filtering entire row groups.
> Pull request:
> https://github.com/apache/parquet-mr/pull/215



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (PARQUET-41) Add bloom filters to parquet statistics

2021-02-01 Thread Nicholas Chammas (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276862#comment-17276862
 ] 

Nicholas Chammas commented on PARQUET-41:
-

Thanks for the link [~yumwang]. That 
[README|https://github.com/apache/parquet-mr/tree/master/parquet-hadoop#readme] 
is what I was looking for.

Are these docs published on the [documentation 
site|http://parquet.apache.org/documentation/latest/] anywhere, or is the 
README file on GitHub the canonical reference?

> Add bloom filters to parquet statistics
> ---
>
> Key: PARQUET-41
> URL: https://issues.apache.org/jira/browse/PARQUET-41
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-format, parquet-mr
>Reporter: Alex Levenson
>Assignee: Junjie Chen
>Priority: Major
>  Labels: filter2, pull-request-available
> Fix For: format-2.7.0, 1.12.0
>
>
> For row groups with no dictionary, we could still produce a bloom filter. 
> This could be very useful in filtering entire row groups.
> Pull request:
> https://github.com/apache/parquet-mr/pull/215



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (AVRO-3923) Add Avro 1.11.3 release blog

2024-01-21 Thread Nicholas Chammas (Jira)



[ 
https://issues.apache.org/jira/browse/AVRO-3923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17809199#comment-17809199
 ] 

Nicholas Chammas commented on AVRO-3923:


Silly question but: Is this URL supposed to 404?

[https://avro.apache.org/docs/1.11.3/specification/]

Where are the docs for 1.11.3?

> Add Avro 1.11.3 release blog
> 
>
> Key: AVRO-3923
> URL: https://issues.apache.org/jira/browse/AVRO-3923
> Project: Apache Avro
>  Issue Type: Improvement
>  Components: website
>Affects Versions: 1.11.3
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.12.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ORC-152) Saving empty Spark DataFrame via ORC does not preserve schema

2017-02-28 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15888475#comment-15888475
 ] 

Nicholas Chammas commented on ORC-152:
--

The problem appears to be that when you write an empty DataFrame with ORC, the 
schema is lost. This doesn't happen with Parquet for example.

Perhaps this example is clearer: 
https://github.com/apache/spark/pull/13257#issuecomment-221132286

Note the schema of the DataFrame read from disk.

> Saving empty Spark DataFrame via ORC does not preserve schema
> -
>
> Key: ORC-152
> URL: https://issues.apache.org/jira/browse/ORC-152
> Project: Orc
>  Issue Type: Bug
>    Reporter: Nicholas Chammas
>Priority: Minor
>
> Details are on SPARK-15474.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Resolved] (ORC-152) Saving empty Spark DataFrame via ORC does not preserve schema

2017-03-01 Thread Nicholas Chammas (JIRA)


 [ 
https://issues.apache.org/jira/browse/ORC-152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas resolved ORC-152.
--
Resolution: Invalid

> Saving empty Spark DataFrame via ORC does not preserve schema
> -
>
> Key: ORC-152
> URL: https://issues.apache.org/jira/browse/ORC-152
> Project: Orc
>  Issue Type: Bug
>    Reporter: Nicholas Chammas
>Priority: Minor
>
> Details are on SPARK-15474.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (ORC-152) Saving empty Spark DataFrame via ORC does not preserve schema

2017-03-01 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15890628#comment-15890628
 ] 

Nicholas Chammas commented on ORC-152:
--

Thanks for the additional information. I'll take this back to the Spark project.

> Saving empty Spark DataFrame via ORC does not preserve schema
> -
>
> Key: ORC-152
> URL: https://issues.apache.org/jira/browse/ORC-152
> Project: Orc
>  Issue Type: Bug
>    Reporter: Nicholas Chammas
>Priority: Minor
>
> Details are on SPARK-15474.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (ORC-152) Saving empty Spark DataFrame via ORC does not preserve schema

2017-12-13 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16289678#comment-16289678
 ] 

Nicholas Chammas commented on ORC-152:
--

A link to the matching Spark issue is in the description above.

> Saving empty Spark DataFrame via ORC does not preserve schema
> -
>
> Key: ORC-152
> URL: https://issues.apache.org/jira/browse/ORC-152
> Project: ORC
>  Issue Type: Bug
>        Reporter: Nicholas Chammas
>Priority: Minor
>
> Details are on SPARK-15474.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (SPARK-18495) Web UI should document meaning of green dot in DAG visualization

2016-11-17 Thread Nicholas Chammas (JIRA)

Nicholas Chammas created SPARK-18495:


 Summary: Web UI should document meaning of green dot in DAG 
visualization
 Key: SPARK-18495
 URL: https://issues.apache.org/jira/browse/SPARK-18495
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 2.0.2
Reporter: Nicholas Chammas
Priority: Trivial


A green dot in the DAG visualization apparently means that the referenced RDD 
is cached. This is not documented anywhere except in [this blog 
post|https://databricks.com/blog/2015/06/22/understanding-your-spark-application-through-visualization.html].

It would be good if the Web UI itself documented this somehow (perhaps in the 
tooltip?) so that the user can naturally learn what it means while using the 
Web UI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18495) Web UI should document meaning of green dot in DAG visualization

2016-11-17 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15674980#comment-15674980
 ] 

Nicholas Chammas commented on SPARK-18495:
--

cc [~andrewor14]

> Web UI should document meaning of green dot in DAG visualization
> 
>
> Key: SPARK-18495
> URL: https://issues.apache.org/jira/browse/SPARK-18495
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.0.2
>Reporter: Nicholas Chammas
>Priority: Trivial
>
> A green dot in the DAG visualization apparently means that the referenced RDD 
> is cached. This is not documented anywhere except in [this blog 
> post|https://databricks.com/blog/2015/06/22/understanding-your-spark-application-through-visualization.html].
> It would be good if the Web UI itself documented this somehow (perhaps in the 
> tooltip?) so that the user can naturally learn what it means while using the 
> Web UI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18589) persist() resolves "java.lang.RuntimeException: Invalid PythonUDF (...), requires attributes from more than one child"

2016-11-25 Thread Nicholas Chammas (JIRA)

Nicholas Chammas created SPARK-18589:


 Summary: persist() resolves "java.lang.RuntimeException: Invalid 
PythonUDF (...), requires attributes from more than one child"
 Key: SPARK-18589
 URL: https://issues.apache.org/jira/browse/SPARK-18589
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 2.0.2, 2.1.0
 Environment: Python 3.5, Java 8
Reporter: Nicholas Chammas
Priority: Minor


Smells like another optimizer bug that's similar to SPARK-17100 and 
SPARK-18254. I'm seeing this on 2.0.2 and on master at commit 
{{fb07bbe575aabe68422fd3a31865101fb7fa1722}}.

I don't have a minimal repro for this yet, but the error I'm seeing is:

{code}
py4j.protocol.Py4JJavaError: An error occurred while calling o247.count.
: java.lang.RuntimeException: Invalid PythonUDF <...>(...), requires attributes 
from more than one child.
at scala.sys.package$.error(package.scala:27)
at 
org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:150)
at 
org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:149)
at scala.collection.immutable.Stream.foreach(Stream.scala:594)
at 
org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:149)
at 
org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:114)
at 
org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:113)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$2.apply(TreeNode.scala:312)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$2.apply(TreeNode.scala:312)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:311)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:328)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:305)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:328)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:305)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:328)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:305)
at 
org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:113)
at 
org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:93)
at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:93)
at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:93)
at 
scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
at scala.collection.immutable.List.foldLeft(List.scala:84)
at 
org.apache.spark.sql.execution.QueryExecution.prepareForExecution(QueryExecution.scala:93)
at 
org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:83)
at 
org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:83)
at org.apache.spark.sql.Dataset.withCallback(Dataset.scala:2555)
at org.apache.spark.sql.Dataset.count(Dataset.scala:2226)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.ref

[jira] [Commented] (SPARK-18589) persist() resolves "java.lang.RuntimeException: Invalid PythonUDF (...), requires attributes from more than one child"

2016-11-25 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15696717#comment-15696717
 ] 

Nicholas Chammas commented on SPARK-18589:
--

cc [~davies] [~hvanhovell]

> persist() resolves "java.lang.RuntimeException: Invalid PythonUDF 
> (...), requires attributes from more than one child"
> --
>
> Key: SPARK-18589
> URL: https://issues.apache.org/jira/browse/SPARK-18589
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.2, 2.1.0
> Environment: Python 3.5, Java 8
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Smells like another optimizer bug that's similar to SPARK-17100 and 
> SPARK-18254. I'm seeing this on 2.0.2 and on master at commit 
> {{fb07bbe575aabe68422fd3a31865101fb7fa1722}}.
> I don't have a minimal repro for this yet, but the error I'm seeing is:
> {code}
> py4j.protocol.Py4JJavaError: An error occurred while calling o247.count.
> : java.lang.RuntimeException: Invalid PythonUDF <...>(...), requires 
> attributes from more than one child.
> at scala.sys.package$.error(package.scala:27)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:150)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:149)
> at scala.collection.immutable.Stream.foreach(Stream.scala:594)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:149)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:114)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:113)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$2.apply(TreeNode.scala:312)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$2.apply(TreeNode.scala:312)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:311)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:328)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:328)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:328)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:305)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:113)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:93)
> at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scal

[jira] [Updated] (SPARK-16589) Chained cartesian produces incorrect number of records

2016-11-30 Thread Nicholas Chammas (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-16589:
-
Labels: correctness  (was: )

> Chained cartesian produces incorrect number of records
> --
>
> Key: SPARK-16589
> URL: https://issues.apache.org/jira/browse/SPARK-16589
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.4.0, 1.5.0, 1.6.0, 2.0.0
>Reporter: Maciej Szymkiewicz
>  Labels: correctness
>
> Chaining cartesian calls in PySpark results in the number of records lower 
> than expected. It can be reproduced as follows:
> {code}
> rdd = sc.parallelize(range(10), 1)
> rdd.cartesian(rdd).cartesian(rdd).count()
> ## 355
> rdd.cartesian(rdd).cartesian(rdd).distinct().count()
> ## 251
> {code}
> It looks like it is related to serialization. If we reserialize after initial 
> cartesian:
> {code}
> rdd.cartesian(rdd)._reserialize(BatchedSerializer(PickleSerializer(), 
> 1)).cartesian(rdd).count()
> ## 1000
> {code}
> or insert identity map:
> {code}
> rdd.cartesian(rdd).map(lambda x: x).cartesian(rdd).count()
> ## 1000
> {code}
> it yields correct results.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2016-12-01 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15713636#comment-15713636
 ] 

Nicholas Chammas commented on SPARK-13587:
--

[~tsp]:

{quote}
Previously, I have had reasonable success with zipping the contents of my conda 
environment in the gateway/driver node and submitting the zip file as an 
argument to --archives in the spark-submit command line. This approach works 
perfectly because it uses the existing spark infrastructure to distribute 
dependencies through to the workers. You actually don't even need anaconda 
installed on the workers since the zip can package the entire python 
installation within it. The downside of it being that conda zip files can bloat 
up quickly in a production spark application.
{quote}

Can you elaborate on how you did this? I'm willing to jump through some hoops 
to create a hackish way of distributing dependencies while this JIRA task gets 
worked out.

What I'm trying is:
# Create a virtual environment and activate it.
# Pip install my requirements into that environment, as one would in a regular 
Python project.
# Zip up the venv/ folder and ship it with my application using {{--py-files}}.

I'm struggling to get the workers to pick up Python dependencies from the 
packaged venv over what's in the system site-packages. All I want is to be able 
to ship out the dependencies with the application from a virtual environment 
all at once (i.e. without having to enumerate each dependency).

Has anyone been able to do this today? It would be good to document it as a 
workaround for people until this issue is resolved.

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

< 5 6 7 8 9 10 11 12 13 14 >

901 - 1000 of 1417 matches

Mail list logo