Re: How to list all registered tables in a sql context?

2014-09-07 Thread Jianshi Huang
Thanks Tobias,

I also found this: https://issues.apache.org/jira/browse/SPARK-3299

Looks like it's been working on.

Jianshi

On Mon, Sep 8, 2014 at 9:28 AM, Tobias Pfeiffer  wrote:

> Hi,
>
> On Sat, Sep 6, 2014 at 1:40 AM, Jianshi Huang 
> wrote:
>
>> Err... there's no such feature?
>>
>
> The problem is that the SQLContext's `catalog` member is protected, so you
> can't access it from outside. If you subclass SQLContext, and make sure
> that `catalog` is always a `SimpleCatalog`, you can check `catalog.tables`
> (which is a HashMap).
>
> Tobias
>



-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/


Re: error: type mismatch while Union

2014-09-07 Thread Dhimant
Thank you Aaron for pointing out problem. This only happens when I run this
code in spark-shell but not when i submit the job.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/error-type-mismatch-while-Union-tp13547p13677.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: prepending jars to the driver class path for spark-submit on YARN

2014-09-07 Thread Xiangrui Meng
There is an undocumented configuration to put users jars in front of
spark jar. But I'm not very certain that it works as expected (and
this is why it is undocumented). Please try turning on
spark.yarn.user.classpath.first . -Xiangrui

On Sat, Sep 6, 2014 at 5:13 PM, Victor Tso-Guillen  wrote:
> I ran into the same issue. What I did was use maven shade plugin to shade my
> version of httpcomponents libraries into another package.
>
>
> On Fri, Sep 5, 2014 at 4:33 PM, Penny Espinoza
>  wrote:
>>
>> Hey - I’m struggling with some dependency issues with
>> org.apache.httpcomponents httpcore and httpclient when using spark-submit
>> with YARN running Spark 1.0.2 on a Hadoop 2.2 cluster.  I’ve seen several
>> posts about this issue, but no resolution.
>>
>> The error message is this:
>>
>>
>> Caused by: java.lang.NoSuchMethodError:
>> org.apache.http.impl.conn.DefaultClientConnectionOperator.(Lorg/apache/http/conn/scheme/SchemeRegistry;Lorg/apache/http/conn/DnsResolver;)V
>> at
>> org.apache.http.impl.conn.PoolingClientConnectionManager.createConnectionOperator(PoolingClientConnectionManager.java:140)
>> at
>> org.apache.http.impl.conn.PoolingClientConnectionManager.(PoolingClientConnectionManager.java:114)
>> at
>> org.apache.http.impl.conn.PoolingClientConnectionManager.(PoolingClientConnectionManager.java:99)
>> at
>> org.apache.http.impl.conn.PoolingClientConnectionManager.(PoolingClientConnectionManager.java:85)
>> at
>> org.apache.http.impl.conn.PoolingClientConnectionManager.(PoolingClientConnectionManager.java:93)
>> at
>> com.amazonaws.http.ConnectionManagerFactory.createPoolingClientConnManager(ConnectionManagerFactory.java:26)
>> at
>> com.amazonaws.http.HttpClientFactory.createHttpClient(HttpClientFactory.java:96)
>> at
>> com.amazonaws.http.AmazonHttpClient.(AmazonHttpClient.java:155)
>> at
>> com.amazonaws.AmazonWebServiceClient.(AmazonWebServiceClient.java:118)
>> at
>> com.amazonaws.AmazonWebServiceClient.(AmazonWebServiceClient.java:102)
>> at
>> com.amazonaws.services.s3.AmazonS3Client.(AmazonS3Client.java:332)
>> at
>> com.oncue.rna.realtime.streaming.config.package$.transferManager(package.scala:76)
>> at
>> com.oncue.rna.realtime.streaming.models.S3SchemaRegistry.(SchemaRegistry.scala:27)
>> at
>> com.oncue.rna.realtime.streaming.models.S3SchemaRegistry$.schemaRegistry$lzycompute(SchemaRegistry.scala:46)
>> at
>> com.oncue.rna.realtime.streaming.models.S3SchemaRegistry$.schemaRegistry(SchemaRegistry.scala:44)
>> at
>> com.oncue.rna.realtime.streaming.coders.KafkaAvroDecoder.(KafkaAvroDecoder.scala:20)
>> ... 17 more
>>
>> The apache httpcomponents libraries include the method above as of version
>> 4.2.  The Spark 1.0.2 binaries seem to include version 4.1.
>>
>> I can get this to work in my driver program by adding exclusions to force
>> use of 4.1, but then I get the error in tasks even when using the —jars
>> option of the spark-submit command.  How can I get both the driver program
>> and the individual tasks in my spark-streaming job to use the same version
>> of this library so my job will run all the way through?
>>
>> thanks
>> p
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Solving Systems of Linear Equations Using Spark?

2014-09-07 Thread Xiangrui Meng
You can try LinearRegression with sparse input. It converges the least
squares solution if the linear system is over-determined, while the
convergence rate depends on the condition number. Applying standard
scaling is popular heuristic to reduce the condition number.

If you are interested in sparse direct methods as in SuiteSparse. I'm
not aware of any effort to do so.

-Xiangrui

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Solving Systems of Linear Equations Using Spark?

2014-09-07 Thread durin
Doing a quick Google search, it appears to me that there is a number people
who have implemented algorithms for solving systems of (sparse) linear
equations on Hadoop MapReduce. 

However, I can find no such thing for Spark. 

Has anyone information on whether there are attempts of creating such an
algorithm for Spark?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Solving-Systems-of-Linear-Equations-Using-Spark-tp13674.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Deployment model popularity - Standard vs. YARN vs. Mesos vs. SIMR

2014-09-07 Thread Patrick Wendell
I would say that the first three are all used pretty heavily. Mesos
was the first one supported (long ago),  the standalone is the
simplest and most popular today, and YARN is newer but growing a lot
in activity.

SIMR is not used as much... it was designed mostly for environments
where users had access to Hadoop and couldn't easily install Spark.
Now most Hadoop vendors bundle Spark anyways, so it's not needed.

On Sun, Sep 7, 2014 at 6:29 PM, Otis Gospodnetic
 wrote:
> Hi,
>
> I'm trying to determine which Spark deployment models are the most popular -
> Standalone, YARN, Mesos, or SIMR.  Anyone knows?
>
> I thought I'm use search-hadoop.com to help me figure this out and this is
> what I found:
>
>
> 1) Standalone
> http://search-hadoop.com/?q=standalone&fc_project=Spark&fc_type=mail+_hash_+user
> (seems the most popular?)
>
> 2) YARN
>  http://search-hadoop.com/?q=yarn&fc_project=Spark&fc_type=mail+_hash_+user
> (almost as popular as standalone?)
>
> 3) Mesos
> http://search-hadoop.com/?q=mesos&fc_project=Spark&fc_type=mail+_hash_+user
> (less popular than yarn or standalone)
>
> 4) SIMR
> http://search-hadoop.com/?q=simr&fc_project=Spark&fc_type=mail+_hash_+user
> (no mentions?)
>
> This is obviously not very accurate but is the order right?
>
> Thanks,
> Otis
> --
> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> Solr & Elasticsearch Support * http://sematext.com/
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Deployment model popularity - Standard vs. YARN vs. Mesos vs. SIMR

2014-09-07 Thread Otis Gospodnetic
Hi,

I'm trying to determine which Spark deployment models are the most popular
- Standalone, YARN, Mesos, or SIMR.  Anyone knows?

I thought I'm use search-hadoop.com to help me figure this out and this is
what I found:


1) Standalone
http://search-hadoop.com/?q=standalone&fc_project=Spark&fc_type=mail+_hash_+user
(seems the most popular?)

2) YARN
 http://search-hadoop.com/?q=yarn&fc_project=Spark&fc_type=mail+_hash_+user
(almost as popular as standalone?)

3) Mesos
http://search-hadoop.com/?q=mesos&fc_project=Spark&fc_type=mail+_hash_+user
(less popular than yarn or standalone)

4) SIMR
http://search-hadoop.com/?q=simr&fc_project=Spark&fc_type=mail+_hash_+user
(no mentions?)

This is obviously not very accurate but is the order right?

Thanks,
Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/


Re: How to list all registered tables in a sql context?

2014-09-07 Thread Tobias Pfeiffer
Hi,

On Sat, Sep 6, 2014 at 1:40 AM, Jianshi Huang 
wrote:

> Err... there's no such feature?
>

The problem is that the SQLContext's `catalog` member is protected, so you
can't access it from outside. If you subclass SQLContext, and make sure
that `catalog` is always a `SimpleCatalog`, you can check `catalog.tables`
(which is a HashMap).

Tobias


Re: Recursion

2014-09-07 Thread Tobias Pfeiffer
Hi,

On Fri, Sep 5, 2014 at 6:16 PM, Deep Pradhan 
wrote:
>
> Does Spark support recursive calls?
>

Can you give an example of which kind of recursion you would like to use?

Tobias


Spark groupByKey partition out of memory

2014-09-07 Thread julyfire
When a MappedRDD is handled by groupByKey transformation, tuples distributed
in different worker nodes with the same key will be collected into one
worker nodes, say, 
(K, V1), (K, V2), ..., (K, Vn) -> (K, Seq(V1, V2, ..., Vn)). 

I want to know whether the value /Seq(V1, V2, ..., Vn)/ of a tuple in the
grouped RDD can reside in different nodes or have to be in one node, if I
set the number of partitions when using groupByKey. If  the value /Seq(V1,
V2, ..., Vn)/ can only reside in the memory of just one machine, out of
memory risk exists in case the size of the /Seq(V1, V2, ..., Vn)/ is larger
than the JVM memory limit of this machine. if this case happens, how should
we deal with?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-groupByKey-partition-out-of-memory-tp13669.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark Streaming and database access (e.g. MySQL)

2014-09-07 Thread Soumitra Kumar
I have the following code:

stream foreachRDD { rdd =>
if (rdd.take (1).size == 1) {
rdd foreachPartition { iterator =>
initDbConnection ()
iterator foreach {
write to db
}
closeDbConnection ()
}
}
}

On Sun, Sep 7, 2014 at 1:26 PM, Sean Owen  wrote:

> ... I'd call out that last bit as actually tricky: "close off the driver"
>
> See this message for the right-est way to do that, along with the
> right way to open DB connections remotely instead of trying to
> serialize them:
>
>
> http://mail-archives.apache.org/mod_mbox/spark-user/201407.mbox/%3CCAPH-c_O9kQO6yJ4khXUVdO=+D4vj=JfG2tP9eqn5RPko=dr...@mail.gmail.com%3E
>
> On Sun, Sep 7, 2014 at 4:19 PM, Mayur Rustagi 
> wrote:
> > Standard pattern is to initialize the mysql jdbc driver in your
> mappartition
> > call , update database & then close off the driver.
> > Couple of gotchas
> > 1. New driver initiated for all your partitions
> > 2. If the effect(inserts & updates) is not idempotent, so if your server
> > crashes, Spark will replay updates to mysql & may cause data corruption.
> >
> >
> > Regards
> > Mayur
> >
> > Mayur Rustagi
> > Ph: +1 (760) 203 3257
> > http://www.sigmoidanalytics.com
> > @mayur_rustagi
> >
> >
> > On Sun, Sep 7, 2014 at 11:54 AM, jchen  wrote:
> >>
> >> Hi,
> >>
> >> Has someone tried using Spark Streaming with MySQL (or any other
> >> database/data store)? I can write to MySQL at the beginning of the
> driver
> >> application. However, when I am trying to write the result of every
> >> streaming processing window to MySQL, it fails with the following error:
> >>
> >> org.apache.spark.SparkException: Job aborted due to stage failure: Task
> >> not
> >> serializable: java.io.NotSerializableException:
> >> com.mysql.jdbc.JDBC4PreparedStatement
> >>
> >> I think it is because the statement object should be serializable, in
> >> order
> >> to be executed on the worker node. Has someone tried the similar cases?
> >> Example code will be very helpful. My intension is to execute
> >> INSERT/UPDATE/DELETE/SELECT statements for each sliding window.
> >>
> >> Thanks,
> >> JC
> >>
> >>
> >>
> >> --
> >> View this message in context:
> >>
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-and-database-access-e-g-MySQL-tp13644.html
> >> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >>
> >> -
> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> >> For additional commands, e-mail: user-h...@spark.apache.org
> >>
> >
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Spark SQL check if query is completed (pyspark)

2014-09-07 Thread Michael Armbrust
Sometimes the underlying Hive code will also print exceptions during
successful execution (for example CREATE TABLE IF NOT EXISTS).  If there is
actually a problem Spark SQL should throw an exception.

What is the command you are running and what is the error you are seeing?


On Sat, Sep 6, 2014 at 2:11 PM, Davies Liu  wrote:

> The SQLContext.sql() will return an SchemaRDD, you need to call collect()
> to pull the data in.
>
> On Sat, Sep 6, 2014 at 6:02 AM, jamborta  wrote:
> > Hi,
> >
> > I am using Spark SQL to run some administrative queries and joins (e.g.
> > create table, insert overwrite, etc), where the query does not return any
> > data. I noticed if the query fails it prints some error message on the
> > console, but does not actually throw an exception (this is spark 1.0.2).
> >
> > Is there any way to get these errors from the returned object?
> >
> > thanks,
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-check-if-query-is-completed-pyspark-tp13630.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> >
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Spark Streaming and database access (e.g. MySQL)

2014-09-07 Thread Sean Owen
... I'd call out that last bit as actually tricky: "close off the driver"

See this message for the right-est way to do that, along with the
right way to open DB connections remotely instead of trying to
serialize them:

http://mail-archives.apache.org/mod_mbox/spark-user/201407.mbox/%3CCAPH-c_O9kQO6yJ4khXUVdO=+D4vj=JfG2tP9eqn5RPko=dr...@mail.gmail.com%3E

On Sun, Sep 7, 2014 at 4:19 PM, Mayur Rustagi  wrote:
> Standard pattern is to initialize the mysql jdbc driver in your mappartition
> call , update database & then close off the driver.
> Couple of gotchas
> 1. New driver initiated for all your partitions
> 2. If the effect(inserts & updates) is not idempotent, so if your server
> crashes, Spark will replay updates to mysql & may cause data corruption.
>
>
> Regards
> Mayur
>
> Mayur Rustagi
> Ph: +1 (760) 203 3257
> http://www.sigmoidanalytics.com
> @mayur_rustagi
>
>
> On Sun, Sep 7, 2014 at 11:54 AM, jchen  wrote:
>>
>> Hi,
>>
>> Has someone tried using Spark Streaming with MySQL (or any other
>> database/data store)? I can write to MySQL at the beginning of the driver
>> application. However, when I am trying to write the result of every
>> streaming processing window to MySQL, it fails with the following error:
>>
>> org.apache.spark.SparkException: Job aborted due to stage failure: Task
>> not
>> serializable: java.io.NotSerializableException:
>> com.mysql.jdbc.JDBC4PreparedStatement
>>
>> I think it is because the statement object should be serializable, in
>> order
>> to be executed on the worker node. Has someone tried the similar cases?
>> Example code will be very helpful. My intension is to execute
>> INSERT/UPDATE/DELETE/SELECT statements for each sliding window.
>>
>> Thanks,
>> JC
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-and-database-access-e-g-MySQL-tp13644.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Running spark-shell (or queries) over the network (not from master)

2014-09-07 Thread Sean Owen
Also keep in mind there is a non-trivial amount of traffic between the
driver and cluster. It's not something I would do by default, running
the driver so remotely. With enough ports open it should work though.

On Sun, Sep 7, 2014 at 7:05 PM, Ognen Duzlevski
 wrote:
> Horacio,
>
> Thanks, I have not tried that, however, I am not after security right now -
> I am just wondering why something so obvious won't work ;)
>
> Ognen
>
>
> On 9/7/2014 12:38 PM, Horacio G. de Oro wrote:
>>
>> Have you tryied with ssh? It will be much secure (only 1 port open),
>> and you'll be able to run spark-shell over the networ. I'm using that
>> way in my project (https://github.com/data-tsunami/smoke) with good
>> results.
>>
>> I can't make a try now, but something like this should work:
>>
>> ssh -tt ec2-user@YOUR-EC2-IP /path/to/spark-shell SPARK-SHELL-OPTIONS
>>
>> With this approach you are way more secure (without installing a VPN),
>> you don't need spark/hadoop installed on your PC. You won't have acces
>> to local files, but you haven't mentioned that as a requirement :-)
>>
>> Hope this help you.
>>
>> Horacio
>> --
>>
>>Web: http://www.data-tsunami.com
>> Email: hgde...@gmail.com
>>Cel: +54 9 3572 525359
>>   LinkedIn: https://www.linkedin.com/in/hgdeoro
>
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Low Level Kafka Consumer for Spark

2014-09-07 Thread Dibyendu Bhattacharya
Hi Tathagata,

I have managed to implement the logic into the Kafka-Spark consumer to
recover from Driver failure. This is just a interim fix till actual fix is
done from Spark side.

The logic is something like this.

1. When the Individual Receivers starts for every Topic partition, it
writes the Kafka messages along with certain meta data in Block Store. This
meta data contains the details of message offset, partition id, topic name
and consumer id. You can see this logic in PartitionManager.java  next()
method.

2.  In the Driver code ( Consumer.java) , I am creating the union of all
there individual D-Streams, and processing the data using forEachRDD call.
In the driver code, I am receiving the RDD which contains the Kafka
messages along with meta data details. In the driver code, periodically I
am committing the "processed" offset of the Kafka message into ZK.

3. When driver stops, and restart again, the Receiver starts again, and
this time in PartitionManager.java, I am checking what is the actual
"committed" offset for the partition, and what is the actual "processed"
offset of the same partition. This logic is in the PartitionManager
constructor.

If this is a Receiver restart, and "processed" offset of less than
"Committed" offset, I am started fetching again from "Processed" offset.
This may lead to duplicate records, but our system can handle duplicates.

I have tested with multiple driver kill/stops and I found no data loss in
Kafka consumer.

In the Driver code, I have not done any "checkpointing" yet, will test that
tomorrow.


One interesting thing I found, if I do "repartition" of original stream , I
can still see the issue of data loss in this logic. What I believe, during
re- partitioning Spark might be changing the order of RDDs the way it
generated from Kafka stream. So during re-partition case, even when I am
committing processed offset, but as this is not in order I still see issue.
Not sure if this understanding is correct, but not able to find any other
explanation.

But if I do not use repartition this solution works fine.

I can make this as configurable, so that when actual fix is available ,
this feature in consumer can be turned off as this is an overhead for the
consumer . Let me know what you think..

Regards,
Dibyendu




On Fri, Sep 5, 2014 at 11:14 PM, Tathagata Das 
wrote:

> Some thoughts on this thread to clarify the doubts.
>
> 1. Driver recovery: The current (1.1 to be released) does not recover the
> raw data that has been received but not processes. This is because when the
> driver dies, the executors die and so does the raw data that was stored in
> it. Only for HDFS, the data is not lost by driver recovery as the data is
> already present reliably in HDFS. This is something we want to fix by Spark
> 1.2 (3 month from now). Regarding recovery by replaying the data from
> Kafka, it is possible but tricky. Our goal is to provide strong guarantee,
> exactly-once semantics in all transformations. To guarantee this for all
> kinds of streaming computations stateful and not-stateful computations, it
> is requires that the data be replayed through Kafka in exactly same order,
> and the underlying blocks of data in Spark be regenerated in the exact way
> as it would have if there was no driver failure. This is quite tricky to
> implement, requires manipulation of zookeeper offsets, etc, that is hard to
> do with the high level consumer that KafkaUtil uses. Dibyendu's low level
> Kafka receiver may enable such approaches in the future. For now we
> definitely plan to solve the first problem very very soon.
>
> 3. Repartitioning: I am trying to understand the repartition issue. One
> common mistake I have seen is that developers repartition a stream but not
> use the repartitioned stream.
>
> WRONG:
> inputDstream.repartition(100)
> inputDstream.map(...).count().print()
>
> RIGHT:
> val repartitionedDStream = inputDStream.repartitoin(100)
> repartitionedDStream.map(...).count().print()
>
> Not sure if this helps solve the problem that you all the facing. I am
> going to add this to the stremaing programming guide to make sure this
> common mistake is avoided.
>
> TD
>
>
>
>
> On Wed, Sep 3, 2014 at 10:38 AM, Dibyendu Bhattacharya <
> dibyendu.bhattach...@gmail.com> wrote:
>
>> Hi,
>>
>> Sorry for little delay . As discussed in this thread, I have modified the
>> Kafka-Spark-Consumer ( https://github.com/dibbhatt/kafka-spark-consumer)
>> code to have dedicated Receiver for every Topic Partition. You can see the
>> example howto create Union of these receivers
>> in consumer.kafka.client.Consumer.java .
>>
>> Thanks to Chris for suggesting this change.
>>
>> Regards,
>> Dibyendu
>>
>>
>> On Mon, Sep 1, 2014 at 2:55 AM, RodrigoB 
>> wrote:
>>
>>> Just a comment on the recovery part.
>>>
>>> Is it correct to say that currently Spark Streaming recovery design does
>>> not
>>> consider re-computations (upon metadata lineage recovery) that depend on
>>> blocks of data of the rec

Re: distcp on ec2 standalone spark cluster

2014-09-07 Thread Josh Rosen
If I recall, you should be able to start Hadoop MapReduce using
~/ephemeral-hdfs/sbin/start-mapred.sh.

On Sun, Sep 7, 2014 at 6:42 AM, Tomer Benyamini  wrote:

> Hi,
>
> I would like to copy log files from s3 to the cluster's
> ephemeral-hdfs. I tried to use distcp, but I guess mapred is not
> running on the cluster - I'm getting the exception below.
>
> Is there a way to activate it, or is there a spark alternative to distcp?
>
> Thanks,
> Tomer
>
> mapreduce.Cluster (Cluster.java:initialize(114)) - Failed to use
> org.apache.hadoop.mapred.LocalClientProtocolProvider due to error:
> Invalid "mapreduce.jobtracker.address" configuration value for
> LocalJobRunner : "XXX:9001"
>
> ERROR tools.DistCp (DistCp.java:run(126)) - Exception encountered
>
> java.io.IOException: Cannot initialize Cluster. Please check your
> configuration for mapreduce.framework.name and the correspond server
> addresses.
>
> at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:121)
>
> at org.apache.hadoop.mapreduce.Cluster.(Cluster.java:83)
>
> at org.apache.hadoop.mapreduce.Cluster.(Cluster.java:76)
>
> at org.apache.hadoop.tools.DistCp.createMetaFolderPath(DistCp.java:352)
>
> at org.apache.hadoop.tools.DistCp.execute(DistCp.java:146)
>
> at org.apache.hadoop.tools.DistCp.run(DistCp.java:118)
>
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>
> at org.apache.hadoop.tools.DistCp.main(DistCp.java:374)
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: distcp on ec2 standalone spark cluster

2014-09-07 Thread Nicholas Chammas
I think you need to run start-all.sh or something similar on the EC2
cluster. MR is installed but is not running by default on EC2 clusters spun
up by spark-ec2.
​

On Sun, Sep 7, 2014 at 12:33 PM, Tomer Benyamini 
wrote:

> I've installed a spark standalone cluster on ec2 as defined here -
> https://spark.apache.org/docs/latest/ec2-scripts.html. I'm not sure if
> mr1/2 is part of this installation.
>
>
> On Sun, Sep 7, 2014 at 7:25 PM, Ye Xianjin  wrote:
> > Distcp requires a mr1(or mr2) cluster to start. Do you have a mapreduce
> > cluster on your hdfs?
> > And from the error message, it seems that you didn't specify your
> jobtracker
> > address.
> >
> > --
> > Ye Xianjin
> > Sent with Sparrow
> >
> > On Sunday, September 7, 2014 at 9:42 PM, Tomer Benyamini wrote:
> >
> > Hi,
> >
> > I would like to copy log files from s3 to the cluster's
> > ephemeral-hdfs. I tried to use distcp, but I guess mapred is not
> > running on the cluster - I'm getting the exception below.
> >
> > Is there a way to activate it, or is there a spark alternative to distcp?
> >
> > Thanks,
> > Tomer
> >
> > mapreduce.Cluster (Cluster.java:initialize(114)) - Failed to use
> > org.apache.hadoop.mapred.LocalClientProtocolProvider due to error:
> > Invalid "mapreduce.jobtracker.address" configuration value for
> > LocalJobRunner : "XXX:9001"
> >
> > ERROR tools.DistCp (DistCp.java:run(126)) - Exception encountered
> >
> > java.io.IOException: Cannot initialize Cluster. Please check your
> > configuration for mapreduce.framework.name and the correspond server
> > addresses.
> >
> > at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:121)
> >
> > at org.apache.hadoop.mapreduce.Cluster.(Cluster.java:83)
> >
> > at org.apache.hadoop.mapreduce.Cluster.(Cluster.java:76)
> >
> > at org.apache.hadoop.tools.DistCp.createMetaFolderPath(DistCp.java:352)
> >
> > at org.apache.hadoop.tools.DistCp.execute(DistCp.java:146)
> >
> > at org.apache.hadoop.tools.DistCp.run(DistCp.java:118)
> >
> > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> >
> > at org.apache.hadoop.tools.DistCp.main(DistCp.java:374)
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> >
> >
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Running spark-shell (or queries) over the network (not from master)

2014-09-07 Thread Ognen Duzlevski

Horacio,

Thanks, I have not tried that, however, I am not after security right 
now - I am just wondering why something so obvious won't work ;)


Ognen

On 9/7/2014 12:38 PM, Horacio G. de Oro wrote:

Have you tryied with ssh? It will be much secure (only 1 port open),
and you'll be able to run spark-shell over the networ. I'm using that
way in my project (https://github.com/data-tsunami/smoke) with good
results.

I can't make a try now, but something like this should work:

ssh -tt ec2-user@YOUR-EC2-IP /path/to/spark-shell SPARK-SHELL-OPTIONS

With this approach you are way more secure (without installing a VPN),
you don't need spark/hadoop installed on your PC. You won't have acces
to local files, but you haven't mentioned that as a requirement :-)

Hope this help you.

Horacio
--

   Web: http://www.data-tsunami.com
Email: hgde...@gmail.com
   Cel: +54 9 3572 525359
  LinkedIn: https://www.linkedin.com/in/hgdeoro



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Running spark-shell (or queries) over the network (not from master)

2014-09-07 Thread Horacio G. de Oro
Have you tryied with ssh? It will be much secure (only 1 port open),
and you'll be able to run spark-shell over the networ. I'm using that
way in my project (https://github.com/data-tsunami/smoke) with good
results.

I can't make a try now, but something like this should work:

ssh -tt ec2-user@YOUR-EC2-IP /path/to/spark-shell SPARK-SHELL-OPTIONS

With this approach you are way more secure (without installing a VPN),
you don't need spark/hadoop installed on your PC. You won't have acces
to local files, but you haven't mentioned that as a requirement :-)

Hope this help you.

Horacio
--

  Web: http://www.data-tsunami.com
   Email: hgde...@gmail.com
  Cel: +54 9 3572 525359
 LinkedIn: https://www.linkedin.com/in/hgdeoro

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Running spark-shell (or queries) over the network (not from master)

2014-09-07 Thread Ognen Duzlevski

Have you actually tested this?

I have two instances, one is standalone master and the other one just 
has spark installed, same versions of spark (1.0.0).


The security group on the master  allows all (0-65535) TCP and UDP 
traffic from the other machine and the other machine allows all TCP/UDP 
traffic from master. Yet my spark-shell --master spark://master-ip:7077 
still is failing to connect.


What am I missing?

Thanks!
Ognen

On 9/5/2014 5:34 PM, qihong wrote:

Since you are using your home computer, so it's probably not reachable by EC2
from internet.

You can try to set "spark.driver.host" to your WAN ip, "spark.driver.port"
to a fixed port in SparkConf, and open that port in your home network (port
forwarding to the computer you are using). see if that helps.



--
View this message in 
context:http://apache-spark-user-list.1001560.n3.nabble.com/Running-spark-shell-or-queries-over-the-network-not-from-master-tp13543p13595.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail:user-unsubscr...@spark.apache.org
For additional commands, e-mail:user-h...@spark.apache.org




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: distcp on ec2 standalone spark cluster

2014-09-07 Thread Tomer Benyamini
I've installed a spark standalone cluster on ec2 as defined here -
https://spark.apache.org/docs/latest/ec2-scripts.html. I'm not sure if
mr1/2 is part of this installation.


On Sun, Sep 7, 2014 at 7:25 PM, Ye Xianjin  wrote:
> Distcp requires a mr1(or mr2) cluster to start. Do you have a mapreduce
> cluster on your hdfs?
> And from the error message, it seems that you didn't specify your jobtracker
> address.
>
> --
> Ye Xianjin
> Sent with Sparrow
>
> On Sunday, September 7, 2014 at 9:42 PM, Tomer Benyamini wrote:
>
> Hi,
>
> I would like to copy log files from s3 to the cluster's
> ephemeral-hdfs. I tried to use distcp, but I guess mapred is not
> running on the cluster - I'm getting the exception below.
>
> Is there a way to activate it, or is there a spark alternative to distcp?
>
> Thanks,
> Tomer
>
> mapreduce.Cluster (Cluster.java:initialize(114)) - Failed to use
> org.apache.hadoop.mapred.LocalClientProtocolProvider due to error:
> Invalid "mapreduce.jobtracker.address" configuration value for
> LocalJobRunner : "XXX:9001"
>
> ERROR tools.DistCp (DistCp.java:run(126)) - Exception encountered
>
> java.io.IOException: Cannot initialize Cluster. Please check your
> configuration for mapreduce.framework.name and the correspond server
> addresses.
>
> at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:121)
>
> at org.apache.hadoop.mapreduce.Cluster.(Cluster.java:83)
>
> at org.apache.hadoop.mapreduce.Cluster.(Cluster.java:76)
>
> at org.apache.hadoop.tools.DistCp.createMetaFolderPath(DistCp.java:352)
>
> at org.apache.hadoop.tools.DistCp.execute(DistCp.java:146)
>
> at org.apache.hadoop.tools.DistCp.run(DistCp.java:118)
>
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>
> at org.apache.hadoop.tools.DistCp.main(DistCp.java:374)
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: distcp on ec2 standalone spark cluster

2014-09-07 Thread Ye Xianjin
Distcp requires a mr1(or mr2) cluster to start. Do you have a mapreduce cluster 
on your hdfs? 
And from the error message, it seems that you didn't specify your jobtracker 
address.


-- 
Ye Xianjin
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Sunday, September 7, 2014 at 9:42 PM, Tomer Benyamini wrote:

> Hi,
> 
> I would like to copy log files from s3 to the cluster's
> ephemeral-hdfs. I tried to use distcp, but I guess mapred is not
> running on the cluster - I'm getting the exception below.
> 
> Is there a way to activate it, or is there a spark alternative to distcp?
> 
> Thanks,
> Tomer
> 
> mapreduce.Cluster (Cluster.java:initialize(114)) - Failed to use
> org.apache.hadoop.mapred.LocalClientProtocolProvider due to error:
> Invalid "mapreduce.jobtracker.address" configuration value for
> LocalJobRunner : "XXX:9001"
> 
> ERROR tools.DistCp (DistCp.java:run(126)) - Exception encountered
> 
> java.io.IOException: Cannot initialize Cluster. Please check your
> configuration for mapreduce.framework.name (http://mapreduce.framework.name) 
> and the correspond server
> addresses.
> 
> at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:121)
> 
> at org.apache.hadoop.mapreduce.Cluster.(Cluster.java:83)
> 
> at org.apache.hadoop.mapreduce.Cluster.(Cluster.java:76)
> 
> at org.apache.hadoop.tools.DistCp.createMetaFolderPath(DistCp.java:352)
> 
> at org.apache.hadoop.tools.DistCp.execute(DistCp.java:146)
> 
> at org.apache.hadoop.tools.DistCp.run(DistCp.java:118)
> 
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> 
> at org.apache.hadoop.tools.DistCp.main(DistCp.java:374)
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> (mailto:user-unsubscr...@spark.apache.org)
> For additional commands, e-mail: user-h...@spark.apache.org 
> (mailto:user-h...@spark.apache.org)
> 
> 




Re: how to choose right DStream batch interval

2014-09-07 Thread Mayur Rustagi
Spark will simply have a backlog of tasks, it'll manage to process them
nonetheless, though if it keeps falling behind, you may run out of memory
or have unreasonable latency. For momentary spikes, Spark streaming will
manage.
Mostly if you are looking to do 100% processing, you'll have to go with 5
sec processing, alternative is to process data in two pipelines (.5 & 5 )
in two spark streaming jobs & overwrite results of one with the other.

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi 


On Sat, Sep 6, 2014 at 12:39 AM, qihong  wrote:

> repost since original msg was marked with "This post has NOT been accepted
> by
> the mailing list yet."
>
> I have some questions regarding DStream batch interval:
>
> 1. if it only take 0.5 second to process the batch 99% of time, but 1% of
> batches need 5 seconds to process (due to some random factor or failures),
> then what's the right batch interval? 5 seconds (the worst case)?
>
> 2. What will happen to DStream processing if 1 batch took longer than batch
> interval? Can Spark recover from that?
>
> Thanks,
> Qihong
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/how-to-choose-right-DStream-batch-interval-tp13578p13579.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Array and RDDs

2014-09-07 Thread Mayur Rustagi
Your question is a bit confusing..
I assume you have a RDD containing nodes & some meta data (child nodes
maybe) & you are trying to attach another metadata to it (bye array). if
its just same byte array for all nodes you can generate rdd with the count
of nodes & zip the two rdd together, you can also create a (node,
bytearray) combo & join the two rdd together.


Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi 


On Sat, Sep 6, 2014 at 10:51 AM, Deep Pradhan 
wrote:

> Hi,
> I have an input file which consists of 
> I have created and RDD consisting of key-value pair where key is the node
> id and the values are the children of that node.
> Now I want to associate a byte with each node. For that I have created a
> byte array.
> Every time I print out the key-value pair in the RDD the key-value pairs
> do not come in the same order. Because of this I am finding it difficult to
> assign the byte values with each node.
> Can anyone help me out in this matter?
>
> I basically have the following code:
> val bitarray = Array.fill[Byte](number)(0)
>
> And I want to assiciate each byte in the array to a node.
> How should I do that?
>
> Thank You
>


Re: Q: About scenarios where driver execution flow may block...

2014-09-07 Thread Mayur Rustagi
Statements are executed only when you try to cause some effect on the
server (produce data, collect data on driver). At time of execution Spark
does all the depedency resolution & truncates paths that dont go anywhere
as well as optimize execution pipelines. So you really dont have to worry
about these.

Important thing is if you are doing certain actions in your functions that
are non-explicitly dependent on others then you may start seeing errors.
For example you may write a file in hdfs during a map operations & expect
to read it another map operations, according to spark map operation is not
expected to alter anything apart from the RDD it is created upon, hence
spark may not realize this dependency & try to parallelize the two
operations, causing error . Bottom line as long as you make all your
depedencies explicit in RDD, spark will take care of the magic.

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi 


On Sun, Sep 7, 2014 at 12:14 AM, didata  wrote:

>  Hello friends:
>
> I have a theory question about call blocking in a Spark driver.
>
> Consider this (admittedly contrived =:)) snippet to illustrate this
> question...
>
> >>> x = rdd01.reduceByKey()  # or maybe some other 'shuffle-requiring
> action'.
>
> >>> b = sc.broadcast(x. take(20)) # Or any statement that requires the
> previous statement to complete, cluster-wide.
>
> >>> y = rdd02.someAction(f(b))
>
> Would the first or second statement above block because the second (or
> third) statement needs to wait for the previous one to complete,
> cluster-wide?
>
> Maybe this isn't the best example (typed on a phone), but generally I'm
> trying to understand the scenario(s) where a rdd call in the driver may
> block because the graph indicates that the next statement is dependent on
> the completion of the current one, cluster-wide (noy just lazy evaluated).
>
> Thank you. :)
>
> Sincerely yours,
> Team Dimension Data
>


Re: Spark Streaming and database access (e.g. MySQL)

2014-09-07 Thread Mayur Rustagi
Standard pattern is to initialize the mysql jdbc driver in your
mappartition call , update database & then close off the driver.
Couple of gotchas
1. New driver initiated for all your partitions
2. If the effect(inserts & updates) is not idempotent, so if your server
crashes, Spark will replay updates to mysql & may cause data corruption.


Regards
Mayur

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi 


On Sun, Sep 7, 2014 at 11:54 AM, jchen  wrote:

> Hi,
>
> Has someone tried using Spark Streaming with MySQL (or any other
> database/data store)? I can write to MySQL at the beginning of the driver
> application. However, when I am trying to write the result of every
> streaming processing window to MySQL, it fails with the following error:
>
> org.apache.spark.SparkException: Job aborted due to stage failure: Task not
> serializable: java.io.NotSerializableException:
> com.mysql.jdbc.JDBC4PreparedStatement
>
> I think it is because the statement object should be serializable, in order
> to be executed on the worker node. Has someone tried the similar cases?
> Example code will be very helpful. My intension is to execute
> INSERT/UPDATE/DELETE/SELECT statements for each sliding window.
>
> Thanks,
> JC
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-and-database-access-e-g-MySQL-tp13644.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


distcp on ec2 standalone spark cluster

2014-09-07 Thread Tomer Benyamini
Hi,

I would like to copy log files from s3 to the cluster's
ephemeral-hdfs. I tried to use distcp, but I guess mapred is not
running on the cluster - I'm getting the exception below.

Is there a way to activate it, or is there a spark alternative to distcp?

Thanks,
Tomer

mapreduce.Cluster (Cluster.java:initialize(114)) - Failed to use
org.apache.hadoop.mapred.LocalClientProtocolProvider due to error:
Invalid "mapreduce.jobtracker.address" configuration value for
LocalJobRunner : "XXX:9001"

ERROR tools.DistCp (DistCp.java:run(126)) - Exception encountered

java.io.IOException: Cannot initialize Cluster. Please check your
configuration for mapreduce.framework.name and the correspond server
addresses.

at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:121)

at org.apache.hadoop.mapreduce.Cluster.(Cluster.java:83)

at org.apache.hadoop.mapreduce.Cluster.(Cluster.java:76)

at org.apache.hadoop.tools.DistCp.createMetaFolderPath(DistCp.java:352)

at org.apache.hadoop.tools.DistCp.execute(DistCp.java:146)

at org.apache.hadoop.tools.DistCp.run(DistCp.java:118)

at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)

at org.apache.hadoop.tools.DistCp.main(DistCp.java:374)

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Adding quota to the ephemeral hdfs on a standalone spark cluster on ec2

2014-09-07 Thread Tomer Benyamini
Thanks! I found the hdfs ui via this port - http://[master-ip]:50070/.
It shows 1 node hdfs though, although I have 4 slaves on my cluster.
Any idea why?

On Sun, Sep 7, 2014 at 4:29 PM, Ognen Duzlevski
 wrote:
>
> On 9/7/2014 7:27 AM, Tomer Benyamini wrote:
>>
>> 2. What should I do to increase the quota? Should I bring down the
>> existing slaves and upgrade to ones with more storage? Is there a way
>> to add disks to existing slaves? I'm using the default m1.large slaves
>> set up using the spark-ec2 script.
>
> Take a look at: http://www.ec2instances.info/
>
> There you will find the available EC2 instances with their associated costs
> and how much ephemeral space they come with. Once you pick an instance you
> get only so much ephemeral space. You can always add drives but they will be
> EBS and not physically attached to the instance.
>
> Ognen
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Fwd: DELIVERY FAILURE: Error transferring to QCMBSJ601.HERMES.SI.SOCGEN; Maximum hop count exceeded. Message probably in a routing loop.

2014-09-07 Thread Ognen Duzlevski


I keep getting below reply every time I send a message to the Spark user 
list? Can this person be taken off the list by powers that be?

Thanks!
Ognen

 Forwarded Message 
Subject: 	DELIVERY FAILURE: Error transferring to 
QCMBSJ601.HERMES.SI.SOCGEN; Maximum hop count exceeded. Message probably 
in a routing loop.

Date:   Sun, 7 Sep 2014 08:29:23 -0500
From:   postmas...@sgcib.com
To: ognen.duzlev...@gmail.com



Your message

  Subject: Re: Adding quota to the ephemeral hdfs on a standalone spark cluster 
on ec2

was not delivered to:

  pierre.lanvin-...@sgcib.com

because:

  Error transferring to QCMBSJ601.HERMES.SI.SOCGEN; Maximum hop count exceeded. 
 Message probably in a routing loop.



Reporting-MTA: dns;gapar018.fr.world.socgen

Final-Recipient: rfc822;pierre.lanvin-ext@sgcib.com
Action: failed
Status: 5.0.0
Remote-MTA: smtp;QCMBSJ601.HERMES.SI.SOCGEN
Diagnostic-Code: X-Notes; Error transferring to QCMBSJ601.HERMES.SI.SO
 CGEN; Maximum hop count exceeded.  Message probably in a routing loop.

--- Begin Message ---


On 9/7/2014 7:27 AM, Tomer Benyamini wrote:

2. What should I do to increase the quota? Should I bring down the
existing slaves and upgrade to ones with more storage? Is there a way
to add disks to existing slaves? I'm using the default m1.large slaves
set up using the spark-ec2 script.

Take a look at: http://www.ec2instances.info/

There you will find the available EC2 instances with their associated 
costs and how much ephemeral space they come with. Once you pick an 
instance you get only so much ephemeral space. You can always add drives 
but they will be EBS and not physically attached to the instance.


Ognen

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org


--- End Message ---
*This message and any attachments (the "message") are confidential, intended solely for the addressee(s), and may contain legally privileged information.Any unauthorised use or dissemination is prohibited. E-mails are susceptible to alteration.  Neither SOCIETE GENERALE nor any of its subsidiaries or affiliates shall be liable for the message if altered, changed orfalsified.Please visit http://swapdisclosure.sgcib.com for important information with respect to derivative products.                              Ce message et toutes les pieces jointes (ci-apres le "message") sont confidentiels et susceptibles de contenir des informations couvertes par le secret professionnel. Ce message est etabli a l'intention exclusive de ses destinataires. Toute utilisation ou diffusion non autorisee est interdite.Tout message electronique est susceptible d'alteration. La SOCIETE GENERALE et ses filiales declinent toute responsabilite au titre de ce message s'il a ete altere, deforme ou falsifie.Veuillez consulter le site http://swapdisclosure.sgcib.com afin de recueillir d'importantes informations sur les produits derives.*

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Adding quota to the ephemeral hdfs on a standalone spark cluster on ec2

2014-09-07 Thread Ognen Duzlevski


On 9/7/2014 7:27 AM, Tomer Benyamini wrote:

2. What should I do to increase the quota? Should I bring down the
existing slaves and upgrade to ones with more storage? Is there a way
to add disks to existing slaves? I'm using the default m1.large slaves
set up using the spark-ec2 script.

Take a look at: http://www.ec2instances.info/

There you will find the available EC2 instances with their associated 
costs and how much ephemeral space they come with. Once you pick an 
instance you get only so much ephemeral space. You can always add drives 
but they will be EBS and not physically attached to the instance.


Ognen

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Adding quota to the ephemeral hdfs on a standalone spark cluster on ec2

2014-09-07 Thread Tomer Benyamini
Hi,

I would like to make sure I'm not exceeding the quota on the local
cluster's hdfs. I have a couple of questions:

1. How do I know the quota? Here's the output of hadoop fs -count -q
which essentially does not tell me a lot

root@ip-172-31-7-49 ~]$ hadoop fs -count -q /
  2147483647  2147482006none inf
 4 163725412205559 /

2. What should I do to increase the quota? Should I bring down the
existing slaves and upgrade to ones with more storage? Is there a way
to add disks to existing slaves? I'm using the default m1.large slaves
set up using the spark-ec2 script.

Thanks,
Tomer

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark 1.0.2 Can GroupByTest example be run in Eclipse without change

2014-09-07 Thread Shing Hing Man
After looking at the source code of SparkConf.scala,   I found the following 
solution.
Just set the following Java system property :

-Dspark.master=local

Shing



On Monday, 1 September 2014, 22:09, Shing Hing Man  
wrote:
 


Hi, 

I have noticed that the GroupByTest example in
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/GroupByTest.scala
has been changed to  be run using spark-submit. 
Previously,  I set "local" as the first command line parameter, and this enable 
me to run GroupByTest in Eclipse. 
val sc = new SparkContext(args(0), "GroupBy Test",
System.getenv("SPARK_HOME"), SparkContext.jarOfClass(this.getClass).toSeq)


In the latest GroupByTest code, I can not pass in "local" as the first comand 
line parameter : 
val sparkConf = new SparkConf().setAppName("GroupBy Test")
var numMappers = if (args.length > 0) args(0).toInt else 2
var numKVPairs = if (args.length > 1) args(1).toInt else 1000
var valSize = if (args.length > 2) args(2).toInt else 1000
var numReducers = if (args.length > 3) args(3).toInt else numMappers
val sc = new SparkContext(sparkConf)


Is there a way to specify  "master=local" (maybe in an environment variable), 
so that I can run the latest 
version of GroupByTest in Eclipse without changing the code. 

Thanks in advance for your assistance !

Shing 

Crawler and Scraper with different priorities

2014-09-07 Thread Sandeep Singh
Hi all,

I am Implementing a Crawler, Scraper. The It should be able to process the
request for crawling & scraping, within few seconds of submitting the
job(around 1mil/sec), for rest I can take some time(scheduled evenly all
over the day). What is the best way to implement this?

Thanks.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Crawler-and-Scraper-with-different-priorities-tp13645.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org