date:20140319

Re: Relation between DStream and RDDs

2014-03-19 Thread Tathagata Das

That is a good question. If I understand correctly, you need multiple RDDs
from a DStream in *every batch*. Can you elaborate on why do you need
multiple RDDs every batch?

TD


On Wed, Mar 19, 2014 at 10:20 PM, Sanjay Awatramani
wrote:

> Hi,
>
> As I understand, a DStream consists of 1 or more RDDs. And foreachRDD will
> run a given func on each and every RDD inside a DStream.
>
> I created a simple program which reads log files from a folder every hour:
> JavaStreamingContext stcObj = new JavaStreamingContext(confObj, new
> Duration(60 * 60 * 1000)); //1 hour
> JavaDStream obj = stcObj.textFileStream("/Users/path/to/Input");
>
> When the interval is reached, Spark reads all the files and creates one
> and only one RDD (as i verified from a sysout inside foreachRDD).
>
> The streaming doc at a lot of places gives an indication that many
> operations (e.g. flatMap) on a DStream are applied individually to a RDD
> and the resulting DStream consists of the mapped RDDs in the same number as
> the input DStream.
> ref:
> https://spark.apache.org/docs/latest/streaming-programming-guide.html#dstreams
>
> If that is the case, how can i generate a scenario where in I have
> multiple RDDs inside a DStream in my example ?
>
> Regards,
> Sanjay
>

Relation between DStream and RDDs

2014-03-19 Thread Sanjay Awatramani

Hi,

As I understand, a DStream consists of 1 or more RDDs. And foreachRDD will run 
a given func on each and every RDD inside a DStream.

I created a simple program which reads log files from a folder every hour:
JavaStreamingContext stcObj = new JavaStreamingContext(confObj, new Duration(60 
* 60 * 1000)); //1 hour
JavaDStream obj = stcObj.textFileStream("/Users/path/to/Input");

When the interval is reached, Spark reads all the files and creates one and 
only one RDD (as i verified from a sysout inside foreachRDD).

The streaming doc at a lot of places gives an indication that many operations 
(e.g. flatMap) on a DStream are applied individually to a RDD and the resulting 
DStream consists of the mapped RDDs in the same number as the input DStream.
ref: 
https://spark.apache.org/docs/latest/streaming-programming-guide.html#dstreams

If that is the case, how can i generate a scenario where in I have multiple 
RDDs inside a DStream in my example ?

Regards,
Sanjay

PySpark worker fails with IOError Broken Pipe

2014-03-19 Thread Nicholas Chammas

So I have the pyspark shell open and after some idle time I sometimes get
this:

>>> PySpark worker failed with exception:
> Traceback (most recent call last):
>   File "/root/spark/python/pyspark/worker.py", line 77, in main
> serializer.dump_stream(func(split_index, iterator), outfile)
>   File "/root/spark/python/pyspark/serializers.py", line 182, in
> dump_stream
> self.serializer.dump_stream(self._batched(iterator), stream)
>   File "/root/spark/python/pyspark/serializers.py", line 118, in
> dump_stream
> self._write_with_length(obj, stream)
>   File "/root/spark/python/pyspark/serializers.py", line 130, in
> _write_with_length
> stream.write(serialized)
> IOError: [Errno 32] Broken pipe
> Traceback (most recent call last):
>   File "/root/spark/python/pyspark/daemon.py", line 117, in launch_worker
> worker(listen_sock)
>   File "/root/spark/python/pyspark/daemon.py", line 107, in worker
> outfile.flush()
> IOError: [Errno 32] Broken pipe


The shell is still alive and I can continue to do work.

Is this anything to worry about or fix?

Nick




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-worker-fails-with-IOError-Broken-Pipe-tp2916.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

答复: What's the lifecycle of an rdd? Can I control it?

2014-03-19 Thread 林武康

Thank you, everybody. Nice to know 😊

-原始邮件-
发件人: "Nicholas Chammas" 
发送时间: ‎2014/‎3/‎20 10:23
收件人: "user" 
主题: Re: What's the lifecycle of an rdd? Can I control it?

Related question: 


If I keep creating new RDDs and cache()-ing them, does Spark automatically 
unpersist the least recently used RDD when it runs out of memory? Or is an 
explicit unpersist the only way to get rid of an RDD (barring the PR Tathagata 
mentioned)?


Also, does unpersist()-ing an RDD immediately free up space, or just allow that 
space to be reclaimed when needed?



On Wed, Mar 19, 2014 at 7:01 PM, Tathagata Das  
wrote:

Just a head's up, there is an active pull requeust that will automatically 
unpersist RDDs that are not in reference/scope from the application any more. 


TD



On Wed, Mar 19, 2014 at 6:58 PM, hequn cheng  wrote:

persist and unpersist.
unpersist:Mark the RDD as non-persistent, and remove all blocks for it from 
memory and disk



2014-03-19 16:40 GMT+08:00 林武康 :


Hi, can any one tell me about the lifecycle of an rdd? I search through the 
official website and still can't figure it out. Can I use an rdd in some stages 
and destroy it in order to release memory because that no stages ahead will use 
this rdd any more. Is it possible?

Thanks!

Sincerely 
Lin wukang

Re: Shark does not give any results with SELECT count(*) command

2014-03-19 Thread qingyang li

have found the cause , my problem is :
the style of file salves is not correct, so the task only be run on master.

explain here to help other guy who also encounter similiar problem.


2014-03-20 9:57 GMT+08:00 qingyang li :

> Hi, i install spark0.9.0 and shark0.9 on 3 nodes , when i run select *
> from src , i can get result, but when i run select count(*) from src or
> select * from src limit 1,  there is no result output.
>
> i have found similiar problem on google groups:
>
> https://groups.google.com/forum/#!searchin/spark-users/Shark$20does$20not$20give$20any$20results$20with$20SELECT$20command/spark-users/oKMBPBWim0U/_hbDCi4m-xUJ
> but , there is no solution on it.
>
> Does anyone encounter such problem?
>

Re: java.net.SocketException on reduceByKey() in pyspark

2014-03-19 Thread Uri Laserson

I have the exact same error running on a bare metal cluster with CentOS6
and Python 2.6.6.  Any other thoughts on the problem here?  I only get the
error on operations that require communication, like reduceByKey or groupBy.


On Sun, Mar 2, 2014 at 1:29 PM, Nicholas Chammas  wrote:

> Alright, so this issue is related to the upgrade to Python 2.7, which
> relates it to the other Python 2.7 issue I reported in this 
> thread
> .
>
> I modified my code not to rely on Python 2.7, spun up a new cluster and
> did *not* upgrade its version of Python from 2.6.8. The code ran fine.
>
> I'd open a JIRA issue about this, but I cannot provide a simple repro that
> anyone can walk through.
>
> Nick
>
>
> On Fri, Feb 28, 2014 at 11:44 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> Even a count() on the result of the flatMap() fails with the same error.
>> Somehow the formatting on the error output got messed in my previous email,
>> so here's a relevant snippet of the output again.
>>
>> 14/03/01 04:39:01 INFO scheduler.DAGScheduler: Failed to run count at
>> :1
>> Traceback (most recent call last):
>>   File "", line 1, in 
>>   File "/root/spark/python/pyspark/rdd.py", line 542, in count
>> return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
>>   File "/root/spark/python/pyspark/rdd.py", line 533, in sum
>> return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add)
>>   File "/root/spark/python/pyspark/rdd.py", line 499, in reduce
>> vals = self.mapPartitions(func).collect()
>>   File "/root/spark/python/pyspark/rdd.py", line 463, in collect
>> bytesInJava = self._jrdd.collect().iterator()
>>   File "/root/spark/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py",
>> line 537, in __call__
>>   File "/root/spark/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py", line
>> 300, in get_return_value
>> py4j.protocol.Py4JJavaError: An error occurred while calling o396.collect.
>> : org.apache.spark.SparkException: Job aborted: Task 29.0:0 failed 4
>> times (most recent failure: Exception failure: java.net.SocketException:
>> Connection reset)
>>  at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)
>> at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1026)
>>  at
>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>>  at org.apache.spark.scheduler.DAGScheduler.org
>> $apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1026)
>>  at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619)
>> at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619)
>>  at scala.Option.foreach(Option.scala:236)
>> at
>> org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:619)
>>  at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:207)
>> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>>  at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>>  at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>> at
>> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>>  at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>> at
>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>>  at
>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>> at
>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>>
>> Any pointers to where I should look, or things to try?
>>
>> Nick
>>
>>
>>
>> On Fri, Feb 28, 2014 at 6:33 PM, nicholas.chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> I've done a whole bunch of things to this RDD, and now when I try to
>>> sortByKey(), this is what I get:
>>>
>>> >>> flattened_po.flatMap(lambda x: 
>>> >>> map_to_database_types(x)).sortByKey()14/02/28
>>> 23:18:41 INFO spark.SparkContext: Starting job: sortByKey at 
>>> :114/02/28
>>> 23:18:41 INFO scheduler.DAGScheduler: Got job 22 (sortByKey at :1)
>>> with 1 output partitions (allowLocal=false)14/02/28 23:18:41 INFO
>>> scheduler.DAGScheduler: Final stage: Stage 23 (sortByKey at 
>>> :1)14/02/28
>>> 23:18:41 INFO scheduler.DAGScheduler: Parents of final stage: List()14/02/28
>>> 23:18:41 INFO scheduler.DAGScheduler: Missing parents: List()14/02/28
>>> 23:18:41 INFO scheduler.DAGScheduler: Submitting Stage 23 (PythonRDD[41] at
>>> sortByKey at :1), which has no missing parents14/02/28 23:18:41
>>> INFO scheduler.DAGScheduler: Submitting 1 missing tasks from Stage 2

Re: What's the lifecycle of an rdd? Can I control it?

2014-03-19 Thread Nicholas Chammas

Okie doke, good to know.


On Wed, Mar 19, 2014 at 7:35 PM, Matei Zaharia wrote:

> Yes, Spark automatically removes old RDDs from the cache when you make new
> ones. Unpersist forces it to remove them right away. In both cases though,
> note that Java doesn’t garbage-collect the objects released until later.
>
> Matei
>
> On Mar 19, 2014, at 7:22 PM, Nicholas Chammas 
> wrote:
>
> Related question:
>
> If I keep creating new RDDs and cache()-ing them, does Spark automatically
> unpersist the least recently used RDD when it runs out of memory? Or is an
> explicit unpersist the only way to get rid of an RDD (barring the PR
> Tathagata mentioned)?
>
> Also, does unpersist()-ing an RDD immediately free up space, or just allow
> that space to be reclaimed when needed?
>
>
> On Wed, Mar 19, 2014 at 7:01 PM, Tathagata Das <
> tathagata.das1...@gmail.com> wrote:
>
>> Just a head's up, there is an active
>> *pull requeust* that will
>> automatically unpersist RDDs that are not in reference/scope from the
>> application any more.
>>
>> TD
>>
>>
>> On Wed, Mar 19, 2014 at 6:58 PM, hequn cheng wrote:
>>
>>> persist and unpersist.
>>> unpersist:Mark the RDD as non-persistent, and remove all blocks for it
>>> from memory and disk
>>>
>>>
>>> 2014-03-19 16:40 GMT+08:00 林武康 :
>>>
>>>   Hi, can any one tell me about the lifecycle of an rdd? I search
 through the official website and still can't figure it out. Can I use an
 rdd in some stages and destroy it in order to release memory because that
 no stages ahead will use this rdd any more. Is it possible?

 Thanks!

 Sincerely
 Lin wukang

>>>
>>>
>>
>
>

Re: What's the lifecycle of an rdd? Can I control it?

2014-03-19 Thread Matei Zaharia

Yes, Spark automatically removes old RDDs from the cache when you make new 
ones. Unpersist forces it to remove them right away. In both cases though, note 
that Java doesn’t garbage-collect the objects released until later.

Matei

On Mar 19, 2014, at 7:22 PM, Nicholas Chammas  
wrote:

> Related question: 
> 
> If I keep creating new RDDs and cache()-ing them, does Spark automatically 
> unpersist the least recently used RDD when it runs out of memory? Or is an 
> explicit unpersist the only way to get rid of an RDD (barring the PR 
> Tathagata mentioned)?
> 
> Also, does unpersist()-ing an RDD immediately free up space, or just allow 
> that space to be reclaimed when needed?
> 
> 
> On Wed, Mar 19, 2014 at 7:01 PM, Tathagata Das  
> wrote:
> Just a head's up, there is an active pull requeust that will automatically 
> unpersist RDDs that are not in reference/scope from the application any more. 
> 
> TD
> 
> 
> On Wed, Mar 19, 2014 at 6:58 PM, hequn cheng  wrote:
> persist and unpersist.
> unpersist:Mark the RDD as non-persistent, and remove all blocks for it from 
> memory and disk
> 
> 
> 2014-03-19 16:40 GMT+08:00 林武康 :
> 
> Hi, can any one tell me about the lifecycle of an rdd? I search through the 
> official website and still can't figure it out. Can I use an rdd in some 
> stages and destroy it in order to release memory because that no stages ahead 
> will use this rdd any more. Is it possible?
> 
> Thanks!
> 
> Sincerely 
> Lin wukang
> 
> 
>

Re: What's the lifecycle of an rdd? Can I control it?

2014-03-19 Thread Nicholas Chammas

Related question:

If I keep creating new RDDs and cache()-ing them, does Spark automatically
unpersist the least recently used RDD when it runs out of memory? Or is an
explicit unpersist the only way to get rid of an RDD (barring the PR
Tathagata mentioned)?

Also, does unpersist()-ing an RDD immediately free up space, or just allow
that space to be reclaimed when needed?

On Wed, Mar 19, 2014 at 7:01 PM, Tathagata Das
wrote:

> Just a head's up, there is an active
> *pull requeust* that will
> automatically unpersist RDDs that are not in reference/scope from the
> application any more.
>
> TD
>
>
> On Wed, Mar 19, 2014 at 6:58 PM, hequn cheng  wrote:
>
>> persist and unpersist.
>> unpersist:Mark the RDD as non-persistent, and remove all blocks for it
>> from memory and disk
>>
>>
>> 2014-03-19 16:40 GMT+08:00 林武康 :
>>
>>   Hi, can any one tell me about the lifecycle of an rdd? I search
>>> through the official website and still can't figure it out. Can I use an
>>> rdd in some stages and destroy it in order to release memory because that
>>> no stages ahead will use this rdd any more. Is it possible?
>>>
>>> Thanks!
>>>
>>> Sincerely
>>> Lin wukang
>>>
>>
>>
>

Re: What's the lifecycle of an rdd? Can I control it?

2014-03-19 Thread Tathagata Das

Just a head's up, there is an active
*pull requeust* that will
automatically unpersist RDDs that are not in reference/scope from the
application any more.

TD


On Wed, Mar 19, 2014 at 6:58 PM, hequn cheng  wrote:

> persist and unpersist.
> unpersist:Mark the RDD as non-persistent, and remove all blocks for it
> from memory and disk
>
>
> 2014-03-19 16:40 GMT+08:00 林武康 :
>
>   Hi, can any one tell me about the lifecycle of an rdd? I search through
>> the official website and still can't figure it out. Can I use an rdd in
>> some stages and destroy it in order to release memory because that no
>> stages ahead will use this rdd any more. Is it possible?
>>
>> Thanks!
>>
>> Sincerely
>> Lin wukang
>>
>
>

Re: What's the lifecycle of an rdd? Can I control it?

2014-03-19 Thread hequn cheng

persist and unpersist.
unpersist:Mark the RDD as non-persistent, and remove all blocks for it from
memory and disk


2014-03-19 16:40 GMT+08:00 林武康 :

>  Hi, can any one tell me about the lifecycle of an rdd? I search through
> the official website and still can't figure it out. Can I use an rdd in
> some stages and destroy it in order to release memory because that no
> stages ahead will use this rdd any more. Is it possible?
>
> Thanks!
>
> Sincerely
> Lin wukang
>

Shark does not give any results with SELECT count(*) command

2014-03-19 Thread qingyang li

Hi, i install spark0.9.0 and shark0.9 on 3 nodes , when i run select * from
src , i can get result, but when i run select count(*) from src or select *
from src limit 1,  there is no result output.

i have found similiar problem on google groups:
https://groups.google.com/forum/#!searchin/spark-users/Shark$20does$20not$20give$20any$20results$20with$20SELECT$20command/spark-users/oKMBPBWim0U/_hbDCi4m-xUJ
but , there is no solution on it.

Does anyone encounter such problem?

Re: Machine Learning on streaming data

2014-03-19 Thread Tathagata Das

Yes, of course you can conceptually apply machine learning algorithm on
Spark Streaming. However the current MLLib does not yet have direct support
for Spark Streaming's DStream. However, since DStreams are essentially a
sequence of RDDs, you can apply MLLib algorithms on those RDDs. Take a look
at DStream.transform() and DStream.foreachRDD() operations, which allows
you access RDDs in a DStream. You can apply MLLib functions on them.

Some people have attempted to make a tighter integration between MLLib and
Spark Streaming. Jeremy (cc'ed) can say more about his adventures.

TD

On Sun, Mar 16, 2014 at 5:56 PM, Nasir Khan wrote:

> hi, I m into a project in which i have to get streaming URL's and Filter it
> and classify it as benin or suspicious. Now Machine Learning and Streaming
> are two separate things in apache spark (AFAIK). my Question is Can we
> apply
> Online Machine Learning Algorithms on Streams??
>
> I am at Beginner Level, Kindly Explain in abit detail and if some one can
> direct me to some good material for me will be greats.
>
> Thanks
> Nasir Khan.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Machine-Learning-on-streaming-data-tp2732.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

in SF until Friday

2014-03-19 Thread Nicholas Chammas

I'm in San Francisco until Friday for a conference (visiting from Boston).

If any of y'all are up for a drink or something, I'd love to meet you in
person and say hi.

Nick




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/in-SF-until-Friday-tp2900.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

saveAsTextFile() failing for large datasets

2014-03-19 Thread Soila Pertet Kavulya

I am testing the performance of Spark to see how it behaves when the
dataset size exceeds the amount of memory available. I am running
wordcount on a 4-node cluster (Intel Xeon 16 cores (32 threads), 256GB
RAM per node). I limited spark.executor.memory to 64g, so I have 256g
of memory available in the cluster. Wordcount  fails due to connection
errors during saveAsTextFile() when the input size is 1TB. I have
tried experimenting with different timeouts, and akka frame sizes but
the job is still failing. Are there any changes that I should make to
get the job to run successfully?

Here is my most recent config.

SPARK_WORKER_CORES=32
SPARK_WORKER_MEMORY=64g
SPARK_WORKER_INSTANCES=1
SPARK_DAEMON_MEMORY=1g

SPARK_JAVA_OPTS="-Dspark.executor.memory=64g
-Dspark.default.parallelism=128 -Dspark.deploy.spreadOut=true
-Dspark.storage.memoryFraction=0.5
-Dspark.shuffle.consolidateFiles=true -Dspark.akka.frameSize=200
-Dspark.akka.timeout=300
-Dspark.storage.blockManagerSlaveTimeoutMs=30"


Error logs:
14/03/17 13:07:52 WARN ExternalAppendOnlyMap: Spilling in-memory map
of 584 MB to disk (1 time so far)

14/03/17 13:07:52 WARN ExternalAppendOnlyMap: Spilling in-memory map
of 510 MB to disk (1 time so far)

14/03/17 13:08:03 INFO ConnectionManager: Removing ReceivingConnection
to ConnectionManagerId(node02,56673)

14/03/17 13:08:03 INFO ConnectionManager: Removing SendingConnection
to ConnectionManagerId(node02,56673)

14/03/17 13:08:03 INFO ConnectionManager: Removing SendingConnection
to ConnectionManagerId(node02,56673)

14/03/17 13:08:03 INFO ConnectionManager: Notifying
org.apache.spark.network.ConnectionManager$MessageStatus@2c762242

14/03/17 13:08:03 INFO ConnectionManager: Notifying
org.apache.spark.network.ConnectionManager$MessageStatus@7fc331db

14/03/17 13:08:03 INFO ConnectionManager: Notifying
org.apache.spark.network.ConnectionManager$MessageStatus@220eecfa

14/03/17 13:08:03 INFO ConnectionManager: Notifying
org.apache.spark.network.ConnectionManager$MessageStatus@286cca3

14/03/17 13:08:03 ERROR
BlockFetcherIterator$BasicBlockFetcherIterator: Could not get block(s)
from ConnectionManagerId(node02,56673)

14/03/17 13:08:03 ERROR
BlockFetcherIterator$BasicBlockFetcherIterator: Could not get block(s)
from ConnectionManagerId(node02,56673)

14/03/17 13:08:03 ERROR
BlockFetcherIterator$BasicBlockFetcherIterator: Could not get block(s)
from ConnectionManagerId(node02,56673)

14/03/17 13:08:03 INFO ConnectionManager: Notifying
org.apache.spark.network.ConnectionManager$MessageStatus@2d1d6b0a

14/03/17 13:08:03 ERROR
BlockFetcherIterator$BasicBlockFetcherIterator: Could not get block(s)
from ConnectionManagerId(node02,56673)

Re: spark-streaming

2014-03-19 Thread Tathagata Das

Hey Nathan,

We made that private in order to reduce the visible public API, to have
greater control in the future. Can you tell me more about the timing
information that you want to get?

TD


On Fri, Mar 14, 2014 at 8:57 PM, Nathan Kronenfeld <
nkronenf...@oculusinfo.com> wrote:

> I'm trying to update some spark streaming code from 0.8.1 to 0.9.0.
>
> Among other things, I've found the function clearMetadata, who's comment
> says:
>
> "...Subclasses of DStream may override this to clear their own
> metadata along with the generated RDDs"
>
> yet which is declared private[streaming].
>
> How are subclasses expected to override this if it's private? If they
> aren't, how and when should they now clear any extraneous data they have?
>
> Similarly, I now see no way to get the timing information - how is a
> custom dstream supposed to do this now?
>
> Thanks,
> -Nathan
>
>
>
> --
> Nathan Kronenfeld
> Senior Visualization Developer
> Oculus Info Inc
> 2 Berkeley Street, Suite 600,
> Toronto, Ontario M5A 4J5
> Phone:  +1-416-203-3003 x 238
> Email:  nkronenf...@oculusinfo.com
>

Re: spark 0.8 examples in local mode

2014-03-19 Thread maxpar

Just figure it out. I need to add a "file://" in URI. I guess it is not
needed in previous Hadoop versions.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-0-8-examples-in-local-mode-tp2892p2897.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: example of non-line oriented input data?

2014-03-19 Thread Jeremy Freeman

Another vote on this, support for simple SequenceFiles and/or Avro would be 
terrific, as using plain text can be very space-inefficient, especially for 
numerical data.

-- Jeremy

On Mar 19, 2014, at 5:24 PM, Nicholas Chammas  
wrote:

> I'd second the request for Avro support in Python first, followed by Parquet.
> 
> 
> On Wed, Mar 19, 2014 at 2:14 PM, Evgeny Shishkin  wrote:
> 
> On 19 Mar 2014, at 19:54, Diana Carroll  wrote:
> 
>> Actually, thinking more on this question, Matei: I'd definitely say support 
>> for Avro.  There's a lot of interest in this!!
>> 
> 
> Agree, and parquet as default Cloudera Impala format.
> 
> 
> 
> 
>> On Tue, Mar 18, 2014 at 8:14 PM, Matei Zaharia  
>> wrote:
>> BTW one other thing — in your experience, Diana, which non-text InputFormats 
>> would be most useful to support in Python first? Would it be Parquet or 
>> Avro, simple SequenceFiles with the Hadoop Writable types, or something 
>> else? I think a per-file text input format that does the stuff we did here 
>> would also be good.
>> 
>> Matei
>> 
>> 
>> On Mar 18, 2014, at 3:27 PM, Matei Zaharia  wrote:
>> 
>>> Hi Diana,
>>> 
>>> This seems to work without the iter() in front if you just return 
>>> treeiterator. What happened when you didn’t include that? Treeiterator 
>>> should return an iterator.
>>> 
>>> Anyway, this is a good example of mapPartitions. It’s one where you want to 
>>> view the whole file as one object (one XML here), so you couldn’t implement 
>>> this using a flatMap, but you still want to return multiple values. The 
>>> MLlib example you saw needs Python 2.7 because unfortunately that is a 
>>> requirement for our Python MLlib support (see 
>>> http://spark.incubator.apache.org/docs/0.9.0/python-programming-guide.html#libraries).
>>>  We’d like to relax this later but we’re using some newer features of NumPy 
>>> and Python. The rest of PySpark works on 2.6.
>>> 
>>> In terms of the size in memory, here both the string s and the XML tree 
>>> constructed from it need to fit in, so you can’t work on very large 
>>> individual XML files. You may be able to use a streaming XML parser instead 
>>> to extract elements from the data in a streaming fashion, without every 
>>> materializing the whole tree. 
>>> http://docs.python.org/2/library/xml.sax.reader.html#module-xml.sax.xmlreader
>>>  is one example.
>>> 
>>> Matei
>>> 
>>> On Mar 18, 2014, at 7:49 AM, Diana Carroll  wrote:
>>> 
 Well, if anyone is still following this, I've gotten the following code 
 working which in theory should allow me to parse whole XML files: (the 
 problem was that I can't return the tree iterator directly.  I have to 
 call iter().  Why?)

 import xml.etree.ElementTree as ET

 # two source files, format  >>> name="...">..
 mydata=sc.textFile("file:/home/training/countries*.xml") 

 def parsefile(iterator):
 s = ''
 for i in iterator: s = s + str(i)
 tree = ET.fromstring(s)
 treeiterator = tree.getiterator("country")
 # why to I have to convert an iterator to an iterator?  not sure but 
 required
 return iter(treeiterator)

 mydata.mapPartitions(lambda x: parsefile(x)).map(lambda element: 
 element.attrib).collect()

 The output is what I expect:
 [{'name': 'Liechtenstein'}, {'name': 'Singapore'}, {'name': 'Panama'}]

 BUT I'm a bit concerned about the construction of the string "s".  How big 
 can my file be before converting it to a string becomes problematic?

 On Tue, Mar 18, 2014 at 9:41 AM, Diana Carroll  
 wrote:
 Thanks, Matei.

 In the context of this discussion, it would seem mapParitions is 
 essential, because it's the only way I'm going to be able to process each 
 file as a whole, in our example of a large number of small XML files which 
 need to be parsed as a whole file because records are not required to be 
 on a single line.

 The theory makes sense but I'm still utterly lost as to how to implement 
 it.  Unfortunately there's only a single example of the use of 
 mapPartitions in any of the Python example programs, which is the log 
 regression example, which I can't run because it requires Python 2.7 and 
 I'm on Python 2.6.  (aside: I couldn't find any statement that Python 2.6 
 is unsupported...is it?)

 I'd really really love to see a real life example of a Python use of 
 mapPartitions.  I do appreciate the very simple examples you provided, but 
 (perhaps because of my novice status on Python) I can't figure out how to 
 translate those to a real world situation in which I'm building RDDs from 
 files, not inline collections like [(1,2),(2,3)].

 Also, you say that the function called in mapPartitions can return a 
 collection OR an iterator.  I tried returning an iterator by calling 
 ElementTree getiterator function, but

Re: example of non-line oriented input data?

2014-03-19 Thread Nicholas Chammas

I'd second the request for Avro support in Python first, followed by
Parquet.


On Wed, Mar 19, 2014 at 2:14 PM, Evgeny Shishkin wrote:

>
> On 19 Mar 2014, at 19:54, Diana Carroll  wrote:
>
> Actually, thinking more on this question, Matei: I'd definitely say
> support for Avro.  There's a lot of interest in this!!
>
>
> Agree, and parquet as default Cloudera Impala format.
>
>
>
>
> On Tue, Mar 18, 2014 at 8:14 PM, Matei Zaharia wrote:
>
>> BTW one other thing -- in your experience, Diana, which non-text
>> InputFormats would be most useful to support in Python first? Would it be
>> Parquet or Avro, simple SequenceFiles with the Hadoop Writable types, or
>> something else? I think a per-file text input format that does the stuff we
>> did here would also be good.
>>
>> Matei
>>
>>
>> On Mar 18, 2014, at 3:27 PM, Matei Zaharia 
>> wrote:
>>
>> Hi Diana,
>>
>> This seems to work without the iter() in front if you just return
>> treeiterator. What happened when you didn't include that? Treeiterator
>> should return an iterator.
>>
>> Anyway, this is a good example of mapPartitions. It's one where you want
>> to view the whole file as one object (one XML here), so you couldn't
>> implement this using a flatMap, but you still want to return multiple
>> values. The MLlib example you saw needs Python 2.7 because unfortunately
>> that is a requirement for our Python MLlib support (see
>> http://spark.incubator.apache.org/docs/0.9.0/python-programming-guide.html#libraries).
>> We'd like to relax this later but we're using some newer features of NumPy
>> and Python. The rest of PySpark works on 2.6.
>>
>> In terms of the size in memory, here both the string s and the XML tree
>> constructed from it need to fit in, so you can't work on very large
>> individual XML files. You may be able to use a streaming XML parser instead
>> to extract elements from the data in a streaming fashion, without every
>> materializing the whole tree.
>> http://docs.python.org/2/library/xml.sax.reader.html#module-xml.sax.xmlreaderis
>>  one example.
>>
>> Matei
>>
>> On Mar 18, 2014, at 7:49 AM, Diana Carroll  wrote:
>>
>> Well, if anyone is still following this, I've gotten the following code
>> working which in theory should allow me to parse whole XML files: (the
>> problem was that I can't return the tree iterator directly.  I have to call
>> iter().  Why?)
>>
>> import xml.etree.ElementTree as ET
>>
>> # two source files, format  > name="...">..
>> mydata=sc.textFile("file:/home/training/countries*.xml")
>>
>> def parsefile(iterator):
>> s = ''
>> for i in iterator: s = s + str(i)
>> tree = ET.fromstring(s)
>> treeiterator = tree.getiterator("country")
>> # why to I have to convert an iterator to an iterator?  not sure but
>> required
>> return iter(treeiterator)
>>
>> mydata.mapPartitions(lambda x: parsefile(x)).map(lambda element:
>> element.attrib).collect()
>>
>> The output is what I expect:
>> [{'name': 'Liechtenstein'}, {'name': 'Singapore'}, {'name': 'Panama'}]
>>
>> BUT I'm a bit concerned about the construction of the string "s".  How
>> big can my file be before converting it to a string becomes problematic?
>>
>>
>>
>> On Tue, Mar 18, 2014 at 9:41 AM, Diana Carroll wrote:
>>
>>> Thanks, Matei.
>>>
>>> In the context of this discussion, it would seem mapParitions is
>>> essential, because it's the only way I'm going to be able to process each
>>> file as a whole, in our example of a large number of small XML files which
>>> need to be parsed as a whole file because records are not required to be on
>>> a single line.
>>>
>>> The theory makes sense but I'm still utterly lost as to how to implement
>>> it.  Unfortunately there's only a single example of the use of
>>> mapPartitions in any of the Python example programs, which is the log
>>> regression example, which I can't run because it requires Python 2.7 and
>>> I'm on Python 2.6.  (aside: I couldn't find any statement that Python 2.6
>>> is unsupported...is it?)
>>>
>>> I'd really really love to see a real life example of a Python use of
>>> mapPartitions.  I do appreciate the very simple examples you provided, but
>>> (perhaps because of my novice status on Python) I can't figure out how to
>>> translate those to a real world situation in which I'm building RDDs from
>>> files, not inline collections like [(1,2),(2,3)].
>>>
>>> Also, you say that the function called in mapPartitions can return a
>>> collection OR an iterator.  I tried returning an iterator by calling
>>> ElementTree getiterator function, but still got the error telling me my
>>> object was not an iterator.
>>>
>>> If anyone has a real life example of mapPartitions returning a Python
>>> iterator, that would be fabulous.
>>>
>>> Diana
>>>
>>>
>>> On Mon, Mar 17, 2014 at 6:17 PM, Matei Zaharia 
>>> wrote:
>>>
 Oh, I see, the problem is that the function you pass to mapPartitions
 must itself return an iterator or a collection. This is used so that you
 can return

Re: example of non-line oriented input data?

2014-03-19 Thread Evgeny Shishkin


On 19 Mar 2014, at 19:54, Diana Carroll  wrote:

> Actually, thinking more on this question, Matei: I'd definitely say support 
> for Avro.  There's a lot of interest in this!!
> 

Agree, and parquet as default Cloudera Impala format.



> On Tue, Mar 18, 2014 at 8:14 PM, Matei Zaharia  
> wrote:
> BTW one other thing — in your experience, Diana, which non-text InputFormats 
> would be most useful to support in Python first? Would it be Parquet or Avro, 
> simple SequenceFiles with the Hadoop Writable types, or something else? I 
> think a per-file text input format that does the stuff we did here would also 
> be good.
> 
> Matei
> 
> 
> On Mar 18, 2014, at 3:27 PM, Matei Zaharia  wrote:
> 
>> Hi Diana,
>> 
>> This seems to work without the iter() in front if you just return 
>> treeiterator. What happened when you didn’t include that? Treeiterator 
>> should return an iterator.
>> 
>> Anyway, this is a good example of mapPartitions. It’s one where you want to 
>> view the whole file as one object (one XML here), so you couldn’t implement 
>> this using a flatMap, but you still want to return multiple values. The 
>> MLlib example you saw needs Python 2.7 because unfortunately that is a 
>> requirement for our Python MLlib support (see 
>> http://spark.incubator.apache.org/docs/0.9.0/python-programming-guide.html#libraries).
>>  We’d like to relax this later but we’re using some newer features of NumPy 
>> and Python. The rest of PySpark works on 2.6.
>> 
>> In terms of the size in memory, here both the string s and the XML tree 
>> constructed from it need to fit in, so you can’t work on very large 
>> individual XML files. You may be able to use a streaming XML parser instead 
>> to extract elements from the data in a streaming fashion, without every 
>> materializing the whole tree. 
>> http://docs.python.org/2/library/xml.sax.reader.html#module-xml.sax.xmlreader
>>  is one example.
>> 
>> Matei
>> 
>> On Mar 18, 2014, at 7:49 AM, Diana Carroll  wrote:
>> 
>>> Well, if anyone is still following this, I've gotten the following code 
>>> working which in theory should allow me to parse whole XML files: (the 
>>> problem was that I can't return the tree iterator directly.  I have to call 
>>> iter().  Why?)
>>> 
>>> import xml.etree.ElementTree as ET
>>> 
>>> # two source files, format  >> name="...">..
>>> mydata=sc.textFile("file:/home/training/countries*.xml") 
>>> 
>>> def parsefile(iterator):
>>> s = ''
>>> for i in iterator: s = s + str(i)
>>> tree = ET.fromstring(s)
>>> treeiterator = tree.getiterator("country")
>>> # why to I have to convert an iterator to an iterator?  not sure but 
>>> required
>>> return iter(treeiterator)
>>> 
>>> mydata.mapPartitions(lambda x: parsefile(x)).map(lambda element: 
>>> element.attrib).collect()
>>> 
>>> The output is what I expect:
>>> [{'name': 'Liechtenstein'}, {'name': 'Singapore'}, {'name': 'Panama'}]
>>> 
>>> BUT I'm a bit concerned about the construction of the string "s".  How big 
>>> can my file be before converting it to a string becomes problematic?
>>> 
>>> 
>>> 
>>> On Tue, Mar 18, 2014 at 9:41 AM, Diana Carroll  
>>> wrote:
>>> Thanks, Matei.
>>> 
>>> In the context of this discussion, it would seem mapParitions is essential, 
>>> because it's the only way I'm going to be able to process each file as a 
>>> whole, in our example of a large number of small XML files which need to be 
>>> parsed as a whole file because records are not required to be on a single 
>>> line.
>>> 
>>> The theory makes sense but I'm still utterly lost as to how to implement 
>>> it.  Unfortunately there's only a single example of the use of 
>>> mapPartitions in any of the Python example programs, which is the log 
>>> regression example, which I can't run because it requires Python 2.7 and 
>>> I'm on Python 2.6.  (aside: I couldn't find any statement that Python 2.6 
>>> is unsupported...is it?)
>>> 
>>> I'd really really love to see a real life example of a Python use of 
>>> mapPartitions.  I do appreciate the very simple examples you provided, but 
>>> (perhaps because of my novice status on Python) I can't figure out how to 
>>> translate those to a real world situation in which I'm building RDDs from 
>>> files, not inline collections like [(1,2),(2,3)].
>>> 
>>> Also, you say that the function called in mapPartitions can return a 
>>> collection OR an iterator.  I tried returning an iterator by calling 
>>> ElementTree getiterator function, but still got the error telling me my 
>>> object was not an iterator. 
>>> 
>>> If anyone has a real life example of mapPartitions returning a Python 
>>> iterator, that would be fabulous.
>>> 
>>> Diana
>>> 
>>> 
>>> On Mon, Mar 17, 2014 at 6:17 PM, Matei Zaharia  
>>> wrote:
>>> Oh, I see, the problem is that the function you pass to mapPartitions must 
>>> itself return an iterator or a collection. This is used so that you can 
>>> return multiple output records for each input record. You can imp

how to sort within DStream or merge consecutive RDDs

2014-03-19 Thread Adrian Mocanu

Hi
I would like to know if it is possible to sort within DStream. I know it's 
possible to sort within an RDD and I know it's impossible to sort within the 
entire DStream but I would be satisfied with sorting across 2 RDDs:

1) merge 2 consecutive RDDs
2) reduce by key + sort the merged data
3) take the first half of the data in the sorted RDD
4) merge the second half of the RDD with the next consecutive RDD
5) jump to step 2) and repeat

Here's an attempt to perform the steps 1-5. I window the stream to WindowSize 
20 then I take 10 seconds of these 20 (probably the 10 seconds don't start 
exactly at the beginning of the 20sec window but ignore that for this exercise) 
by using another window of size 10. The first 10 seconds is what I want but I 
want to skip every other 10 sec window.

Tuple format: (timestamp, count)

val bigWindow= Mystream.window(20,10)
bigWindow.reduceByKey( (t1,t2) => (t1._2 + t2._2) )
 .transform(rdd=>rdd.sortByKey(true))
 .window(10,10)//make 10sec window out of the 20 sec window
 .filterBy(t =>  (t._1 <= bigWindow.midWindowValue._1 )  ) //take value 
from bigWindow (the 20 sec window) from time index 0 to 20/2 denoted in my 
filter by bigWindow.midWindowValue

a) Any ideas how to rewrite this query or how to get the element in the middle 
of a time window (or some arbitrary location)?
b) If you know how I can iterate and merge + split RDDs I'd like to give that a 
try as well instead of using 2 time windows.
c) Suggestions how to do the overall DStream reducing and sorting


-Adrian

spark 0.8 examples in local mode

2014-03-19 Thread maxpar

Hi folks,

I have just upgrade to Spark 0.8.1, and try some examples like:
./run-example org.apache.spark.examples.SparkHdfsLR local lr_data.txt 3

It turns out that Spark keeps trying to read the file from HDFS other than
local FS:
Client: Retrying connect to server: Node1/192.168.0.101:9000. Already
tried 0 time(s)

Is there anything I am missing in configurations or the way I specified the
path?

Thanks,

Max



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-0-8-examples-in-local-mode-tp2892.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

workers die with AssociationError

2014-03-19 Thread Eric Kimbrel

I am running spark with a cloudera cluster, spark version
0.9.0-cdh5.0.0-beta-2

While nothing else is running on the cluster i am having frequent worker
failures with errors like

AssociationError [akka.tcp://sparkWorker@worker5:7078] ->
[akka.tcp://sparkExecutor@worker5:37487]: Error [Association failed with
[akka.tcp://sparkExecutor@worker5:37487]] [
akka.remote.EndpointAssociationException: Association failed with
[akka.tcp://sparkExecutor@worker5:37487]
Caused by:
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
Connection refused: worker5/172.21.10.128:37487
]

These errors are not occurring while a spark job is running, in fact jobs
are able to run to completion without errors, some time afterwards the
workers die.





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/workers-die-with-AssociationError-tp2891.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Running spark examples/scala scripts

2014-03-19 Thread Mayur Rustagi

You have to pick the right client version for your Hadoop. So basically its
going to be your hadoop version. Map of hadoop versions to cdh &
hortonworks is given on spark website.

Regards
Mayur

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi 



On Wed, Mar 19, 2014 at 2:55 AM, Pariksheet Barapatre
wrote:

> :-) Thanks for suggestion.
>
> I was actually asking how to run spark scripts as a standalone App. I am
> able to run Java code and Python code as standalone app.
>
>  one more doubt, documentation says - to read HDFS file, we need to add
> dependency
> org.apache.hadoop
> hadoop-client
> 1.0.1
> 
>
> How to know HDFS version, I just guess 1.0.1 and it worked.
>
>
> Next task is to run scala code with sbt.
>
> Cheers
> Pari
>
>
> On 18 March 2014 22:33, Mayur Rustagi  wrote:
>
>> print out the last line & run it outside on the shell :)
>>
>> Mayur Rustagi
>> Ph: +1 (760) 203 3257
>> http://www.sigmoidanalytics.com
>>  @mayur_rustagi 
>>
>>
>>
>> On Tue, Mar 18, 2014 at 2:37 AM, Pariksheet Barapatre <
>> pbarapa...@gmail.com> wrote:
>>
>>> Hello all,
>>>
>>> I am trying to run shipped in example with spark i.e. in example
>>> directory.
>>>
>>> [cloudera@aster2 examples]$ ls
>>> bagel   ExceptionHandlingTest.scala  HdfsTest2.scala
>>>LocalKMeans.scala  MultiBroadcastTest.scala   SparkHdfsLR.scala
>>>  SparkPi.scala
>>> BroadcastTest.scala graphx   HdfsTest.scala
>>> LocalLR.scala  SimpleSkewedGroupByTest.scala  SparkKMeans.scala
>>>  SparkTC.scala
>>> CassandraTest.scala GroupByTest.scalaLocalALS.scala
>>> LocalPi.scala  SkewedGroupByTest.scalaSparkLR.scala
>>> DriverSubmissionTest.scala  HBaseTest.scala
>>>  LocalFileLR.scala  LogQuery.scala SparkALS.scala
>>> SparkPageRank.scala
>>>
>>>
>>> I am able to run these examples using run-example script, but how to run
>>> these examples without using run-example script.
>>>
>>>
>>>
>>>
>>> Thanks,
>>> Pari
>>>
>>
>>
>
>
> --
> Cheers,
> Pari
>

Re: Connect Exception Error in spark interactive shell...

2014-03-19 Thread Mayur Rustagi

The data may be spilled off to disk hence HDFS is a necessity for Spark.
You can run Spark on a single machine & not use HDFS but in distributed
mode HDFS will be required.

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi 



On Wed, Mar 19, 2014 at 4:10 AM, Sai Prasanna wrote:

> Mayur,
>
> While reading a local file which is not in HDFS through spark shell, does
> the HDFS need to be up and running ???
>
>
> On Tue, Mar 18, 2014 at 9:46 PM, Mayur Rustagi wrote:
>
>> Your hdfs is down. Probably forgot to format namenode.
>> check if namenode is running
>>ps -aef|grep Namenode
>> if not & data in hdfs is not critical
>> hadoop namenode -format
>> & restart hdfs
>>
>>
>> Mayur Rustagi
>> Ph: +1 (760) 203 3257
>> http://www.sigmoidanalytics.com
>> @mayur_rustagi 
>>
>>
>>
>> On Tue, Mar 18, 2014 at 5:59 AM, Sai Prasanna wrote:
>>
>>> Hi ALL !!
>>>
>>> In the interactive spark shell i get the following error.
>>> I just followed the steps of the video "First steps with spark - spark
>>> screen cast #1" by andy konwinski...
>>>
>>> Any thoughts ???
>>>
>>> scala> val textfile = sc.textFile("README.md")
>>> textfile: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at
>>> :12
>>>
>>> scala> textfile.count
>>> java.lang.RuntimeException: java.net.ConnectException: Call to master/
>>> 192.168.1.11:9000 failed on connection exception:
>>> java.net.ConnectException: Connection refused
>>>  at
>>> org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:546)
>>> at
>>> org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:318)
>>>  at
>>> org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:291)
>>> at
>>> org.apache.spark.SparkContext$$anonfun$13.apply(SparkContext.scala:439)
>>>  at
>>> org.apache.spark.SparkContext$$anonfun$13.apply(SparkContext.scala:439)
>>> at
>>> org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$1.apply(HadoopRDD.scala:112)
>>>  at
>>> org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$1.apply(HadoopRDD.scala:112)
>>> at scala.Option.map(Option.scala:133)
>>>  at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:112)
>>> at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:134)
>>>  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201)
>>> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:199)
>>>  at scala.Option.getOrElse(Option.scala:108)
>>> at org.apache.spark.rdd.RDD.partitions(RDD.scala:199)
>>>  at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:26)
>>> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201)
>>>  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:199)
>>> at scala.Option.getOrElse(Option.scala:108)
>>>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:199)
>>> at org.apache.spark.SparkContext.runJob(SparkContext.scala:886)
>>>  at org.apache.spark.rdd.RDD.count(RDD.scala:698)
>>> at (:15)
>>>  at (:20)
>>> at (:22)
>>>  at (:24)
>>> at (:26)
>>>  at .(:30)
>>> at .()
>>>  at .(:11)
>>> at .()
>>>  at $export()
>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>  at
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>> at
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>  at java.lang.reflect.Method.invoke(Method.java:606)
>>> at
>>> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:629)
>>>  at
>>> org.apache.spark.repl.SparkIMain$Request$$anonfun$10.apply(SparkIMain.scala:897)
>>> at
>>> scala.tools.nsc.interpreter.Line$$anonfun$1.apply$mcV$sp(Line.scala:43)
>>>  at scala.tools.nsc.io.package$$anon$2.run(package.scala:25)
>>> at java.lang.Thread.run(Thread.java:744)
>>> Caused by: java.net.ConnectException: Call to 
>>> master/192.168.1.11:9000failed on connection exception: 
>>> java.net.ConnectException: Connection
>>> refused
>>>  at org.apache.hadoop.ipc.Client.wrapException(Client.java:1099)
>>> at org.apache.hadoop.ipc.Client.call(Client.java:1075)
>>>  at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
>>> at com.sun.proxy.$Proxy8.getProtocolVersion(Unknown Source)
>>>  at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396)
>>> at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379)
>>>  at
>>> org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:119)
>>> at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:238)
>>>  at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:203)
>>> at
>>> org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:89)
>>>  at
>>> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1386)
>>> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
>>>  at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1404)
>>> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)
>>>  at org.apache.

Re: Spark enables us to process Big Data on an ARM cluster !!

2014-03-19 Thread Chanwit Kaewkasi

Hi Koert,

There's some NAND flash built-in each node. We mount the NAND flash as
a local directory for Spark to spill data out.
A DZone article, also written by me, will tell more about the cluster.
We really appreciate the design of Spark's RDD done by the Spark team.
It turned out to be perfect for ARM clusters.

http://www.dzone.com/articles/big-data-processing-arm-0

Another great thing is that our cluster can operate at the room
temperature (25C / 77F) too.

The board is Cubieboard here it is:
https://en.wikipedia.org/wiki/Cubieboard#Specification

Best regards,

-chanwit

--
Chanwit Kaewkasi
linkedin.com/in/chanwit


On Wed, Mar 19, 2014 at 9:43 PM, Koert Kuipers  wrote:
> i dont know anything about arm clusters but it looks great. what are the
> specs? the nodes have no local disk at all?
>
>
> On Tue, Mar 18, 2014 at 10:36 PM, Chanwit Kaewkasi 
> wrote:
>>
>> Hi all,
>>
>> We are a small team doing a research on low-power (and low-cost) ARM
>> clusters. We built a 20-node ARM cluster that be able to start Hadoop.
>> But as all of you've known, Hadoop is performing on-disk operations,
>> so it's not suitable for a constraint machine powered by ARM.
>>
>> We then switched to Spark and had to say wow!!
>>
>> Spark / HDFS enables us to crush Wikipedia articles (of year 2012) of
>> size 34GB in 1h50m. We have identified the bottleneck and it's our
>> 100M network.
>>
>> Here's the cluster:
>> https://dl.dropboxusercontent.com/u/381580/aiyara_cluster/Mk-I_SSD.png
>>
>> And this is what we got from Spark's shell:
>> https://dl.dropboxusercontent.com/u/381580/aiyara_cluster/result_00.png
>>
>> I think it's the first ARM cluster that can process a non-trivial size
>> of Big Data.
>> (Please correct me if I'm wrong)
>> I really want to thank the Spark team that makes this possible !!
>>
>> Best regards,
>>
>> -chanwit
>>
>> --
>> Chanwit Kaewkasi
>> linkedin.com/in/chanwit
>
>

closure scope & Serialization

2014-03-19 Thread Domen Grabec

Hey,

I have 3 classes where 2 are in circular dependency like this:

package org.example
import org.apache.spark.SparkContext

class A(bLazy: => Option[B]) extends java.io.Serializable{
  lazy val b: Option[B] = bLazy
}

class B(aLazy: => Option[A]) extends java.io.Serializable{
  lazy val a: Option[A] = aLazy
}

class DoNotSerialize{

  def createAnObject(): A = {
lazy val a: A = new A(Some(b)) // Requires class DoNotSerialize to
implement java.io.Serializable
//lazy val a: A = new A(None) // Works OK
lazy val b: B = new B(Some(a))
a
  }
}

object main extends App{
 val dont = new DoNotSerialize()
 val sc = new SparkContext("local[8]", "test session")
 val scRdd = sc.parallelize(Seq(dont.createAnObject()))
 scRdd.count()
}

can someone please help me understand why in the first case DoNotSerialize
class needs to implement Serializable?

Regards, Domen

Re: trying to understand job cancellation

2014-03-19 Thread Koert Kuipers

on spark 1.0.0 SNAPSHOT this seems to work. at least so far i have seen no
issues yet.


On Thu, Mar 6, 2014 at 8:44 AM, Koert Kuipers  wrote:

> its 0.9 snapshot from january running in standalone mode.
>
> have these fixed been merged into 0.9?
>
>
> On Thu, Mar 6, 2014 at 12:45 AM, Matei Zaharia wrote:
>
>> Which version of Spark is this in, Koert? There might have been some
>> fixes more recently for it.
>>
>> Matei
>>
>> On Mar 5, 2014, at 5:26 PM, Koert Kuipers  wrote:
>>
>> Sorry I meant to say: seems the issue is shared RDDs between a job that
>> got cancelled and a later job.
>>
>> However even disregarding that I have the other issue that the active
>> task of the cancelled job hangs around forever, not doing anything
>> On Mar 5, 2014 7:29 PM, "Koert Kuipers"  wrote:
>>
>>> yes jobs on RDDs that were not part of the cancelled job work fine.
>>>
>>> so it seems the issue is the cached RDDs that are ahred between the
>>> cancelled job and the jobs after that.
>>>
>>>
>>> On Wed, Mar 5, 2014 at 7:15 PM, Koert Kuipers  wrote:
>>>
 well, the new jobs use existing RDDs that were also used in the jon
 that got killed.

 let me confirm that new jobs that use completely different RDDs do not
 get killed.



 On Wed, Mar 5, 2014 at 7:00 PM, Mayur Rustagi 
 wrote:

> Quite unlikely as jobid are given in an incremental fashion, so your
> future jobid are not likely to be killed if your groupid is not repeated.I
> guess the issue is something else.
>
> Mayur Rustagi
> Ph: +1 (760) 203 3257
> http://www.sigmoidanalytics.com
> @mayur_rustagi 
>
>
>
> On Wed, Mar 5, 2014 at 3:50 PM, Koert Kuipers wrote:
>
>> i did that. my next job gets a random new group job id (a uuid).
>> however that doesnt seem to stop the job from getting sucked into the
>> cancellation it seems
>>
>>
>> On Wed, Mar 5, 2014 at 6:47 PM, Mayur Rustagi <
>> mayur.rust...@gmail.com> wrote:
>>
>>> You can randomize job groups as well. to secure yourself against
>>> termination.
>>>
>>> Mayur Rustagi
>>> Ph: +1 (760) 203 3257
>>> http://www.sigmoidanalytics.com
>>> @mayur_rustagi 
>>>
>>>
>>>
>>> On Wed, Mar 5, 2014 at 3:42 PM, Koert Kuipers wrote:
>>>
 got it. seems like i better stay away from this feature for now..


 On Wed, Mar 5, 2014 at 5:55 PM, Mayur Rustagi <
 mayur.rust...@gmail.com> wrote:

> One issue is that job cancellation is posted on eventloop. So its
> possible that subsequent jobs submitted to job queue may beat the job
> cancellation event & hence the job cancellation event may end up 
> closing
> them too.
> So there's definitely a race condition you are risking even if not
> running into.
>
> Mayur Rustagi
> Ph: +1 (760) 203 3257
> http://www.sigmoidanalytics.com
> @mayur_rustagi 
>
>
>
> On Wed, Mar 5, 2014 at 2:40 PM, Koert Kuipers 
> wrote:
>
>> SparkContext.cancelJobGroup
>>
>>
>> On Wed, Mar 5, 2014 at 5:32 PM, Mayur Rustagi <
>> mayur.rust...@gmail.com> wrote:
>>
>>> How do you cancel the job. Which API do you use?
>>>
>>> Mayur Rustagi
>>> Ph: +1 (760) 203 3257
>>> http://www.sigmoidanalytics.com
>>>  @mayur_rustagi 
>>>
>>>
>>>
>>> On Wed, Mar 5, 2014 at 2:29 PM, Koert Kuipers >> > wrote:
>>>
 i also noticed that jobs (with a new JobGroupId) which i run
 after this use which use the same RDDs get very confused. i see 
 lots of
 cancelled stages and retries that go on forever.


 On Tue, Mar 4, 2014 at 5:02 PM, Koert Kuipers <
 ko...@tresata.com> wrote:

> i have a running job that i cancel while keeping the spark
> context alive.
>
> at the time of cancellation the active stage is 14.
>
> i see in logs:
> 2014/03/04 16:43:19 INFO scheduler.DAGScheduler: Asked to
> cancel job group 3a25db23-2e39-4497-b7ab-b26b2a976f9c
> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl:
> Cancelling stage 10
> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl:
> Cancelling stage 14
> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Stage 14
> was cancelled
> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Remove
> TaskSet 14.0 from pool x
> 2014/03/04 16:43:19 INFO s

Re: partitioning via groupByKey

2014-03-19 Thread Jaka Jančar

The former: a single new RDD is returned.

Check the PairRDDFunctions docs 
(http://spark.apache.org/docs/latest/api/core/index.html#org.apache.spark.rdd.PairRDDFunctions):

def groupByKey(): RDD[(K, Seq[V])]
Group the values for each key in the RDD into a single sequence.

On Wednesday, March 19, 2014 at 9:32 AM, Adrian Mocanu wrote:

> When you partition via groupByKey tulpes (parts of the RDD) are moved from 
> some node to another node based on key (hash partitioning).
> Do the tuples remain part of 1 RDD as before but moved to different nodes or 
> does this shuffling create, say, several RDDs which will have parts of the 
> original RDD?
>  
> Thanks
> -Adrian
>  
> 
> 
>

Re: example of non-line oriented input data?

2014-03-19 Thread Diana Carroll

Actually, thinking more on this question, Matei: I'd definitely say support
for Avro.  There's a lot of interest in this!!


On Tue, Mar 18, 2014 at 8:14 PM, Matei Zaharia wrote:

> BTW one other thing -- in your experience, Diana, which non-text
> InputFormats would be most useful to support in Python first? Would it be
> Parquet or Avro, simple SequenceFiles with the Hadoop Writable types, or
> something else? I think a per-file text input format that does the stuff we
> did here would also be good.
>
> Matei
>
>
> On Mar 18, 2014, at 3:27 PM, Matei Zaharia 
> wrote:
>
> Hi Diana,
>
> This seems to work without the iter() in front if you just return
> treeiterator. What happened when you didn't include that? Treeiterator
> should return an iterator.
>
> Anyway, this is a good example of mapPartitions. It's one where you want
> to view the whole file as one object (one XML here), so you couldn't
> implement this using a flatMap, but you still want to return multiple
> values. The MLlib example you saw needs Python 2.7 because unfortunately
> that is a requirement for our Python MLlib support (see
> http://spark.incubator.apache.org/docs/0.9.0/python-programming-guide.html#libraries).
> We'd like to relax this later but we're using some newer features of NumPy
> and Python. The rest of PySpark works on 2.6.
>
> In terms of the size in memory, here both the string s and the XML tree
> constructed from it need to fit in, so you can't work on very large
> individual XML files. You may be able to use a streaming XML parser instead
> to extract elements from the data in a streaming fashion, without every
> materializing the whole tree.
> http://docs.python.org/2/library/xml.sax.reader.html#module-xml.sax.xmlreaderis
>  one example.
>
> Matei
>
> On Mar 18, 2014, at 7:49 AM, Diana Carroll  wrote:
>
> Well, if anyone is still following this, I've gotten the following code
> working which in theory should allow me to parse whole XML files: (the
> problem was that I can't return the tree iterator directly.  I have to call
> iter().  Why?)
>
> import xml.etree.ElementTree as ET
>
> # two source files, format   name="...">..
> mydata=sc.textFile("file:/home/training/countries*.xml")
>
> def parsefile(iterator):
> s = ''
> for i in iterator: s = s + str(i)
> tree = ET.fromstring(s)
> treeiterator = tree.getiterator("country")
> # why to I have to convert an iterator to an iterator?  not sure but
> required
> return iter(treeiterator)
>
> mydata.mapPartitions(lambda x: parsefile(x)).map(lambda element:
> element.attrib).collect()
>
> The output is what I expect:
> [{'name': 'Liechtenstein'}, {'name': 'Singapore'}, {'name': 'Panama'}]
>
> BUT I'm a bit concerned about the construction of the string "s".  How big
> can my file be before converting it to a string becomes problematic?
>
>
>
> On Tue, Mar 18, 2014 at 9:41 AM, Diana Carroll wrote:
>
>> Thanks, Matei.
>>
>> In the context of this discussion, it would seem mapParitions is
>> essential, because it's the only way I'm going to be able to process each
>> file as a whole, in our example of a large number of small XML files which
>> need to be parsed as a whole file because records are not required to be on
>> a single line.
>>
>> The theory makes sense but I'm still utterly lost as to how to implement
>> it.  Unfortunately there's only a single example of the use of
>> mapPartitions in any of the Python example programs, which is the log
>> regression example, which I can't run because it requires Python 2.7 and
>> I'm on Python 2.6.  (aside: I couldn't find any statement that Python 2.6
>> is unsupported...is it?)
>>
>> I'd really really love to see a real life example of a Python use of
>> mapPartitions.  I do appreciate the very simple examples you provided, but
>> (perhaps because of my novice status on Python) I can't figure out how to
>> translate those to a real world situation in which I'm building RDDs from
>> files, not inline collections like [(1,2),(2,3)].
>>
>> Also, you say that the function called in mapPartitions can return a
>> collection OR an iterator.  I tried returning an iterator by calling
>> ElementTree getiterator function, but still got the error telling me my
>> object was not an iterator.
>>
>> If anyone has a real life example of mapPartitions returning a Python
>> iterator, that would be fabulous.
>>
>> Diana
>>
>>
>> On Mon, Mar 17, 2014 at 6:17 PM, Matei Zaharia 
>> wrote:
>>
>>> Oh, I see, the problem is that the function you pass to mapPartitions
>>> must itself return an iterator or a collection. This is used so that you
>>> can return multiple output records for each input record. You can implement
>>> most of the existing map-like operations in Spark, such as map, filter,
>>> flatMap, etc, with mapPartitions, as well as new ones that might do a
>>> sliding window over each partition for example, or accumulate data across
>>> elements (e.g. to compute a sum).
>>>
>>> For example, if you have data = sc.

partitioning via groupByKey

2014-03-19 Thread Adrian Mocanu

When you partition via groupByKey tulpes (parts of the RDD) are moved from some 
node to another node based on key (hash partitioning).
Do the tuples remain part of 1 RDD as before but moved to different nodes or 
does this shuffling create, say, several RDDs which will have parts of the 
original RDD?

Thanks
-Adrian

Re: Spark enables us to process Big Data on an ARM cluster !!

2014-03-19 Thread Christopher Nguyen

Chanwit, that is awesome!

Improvements in shuffle operations should help improve life even more for
you. Great to see a data point on ARM.

Sent while mobile. Pls excuse typos etc.
On Mar 18, 2014 7:36 PM, "Chanwit Kaewkasi"  wrote:

> Hi all,
>
> We are a small team doing a research on low-power (and low-cost) ARM
> clusters. We built a 20-node ARM cluster that be able to start Hadoop.
> But as all of you've known, Hadoop is performing on-disk operations,
> so it's not suitable for a constraint machine powered by ARM.
>
> We then switched to Spark and had to say wow!!
>
> Spark / HDFS enables us to crush Wikipedia articles (of year 2012) of
> size 34GB in 1h50m. We have identified the bottleneck and it's our
> 100M network.
>
> Here's the cluster:
> https://dl.dropboxusercontent.com/u/381580/aiyara_cluster/Mk-I_SSD.png
>
> And this is what we got from Spark's shell:
> https://dl.dropboxusercontent.com/u/381580/aiyara_cluster/result_00.png
>
> I think it's the first ARM cluster that can process a non-trivial size
> of Big Data.
> (Please correct me if I'm wrong)
> I really want to thank the Spark team that makes this possible !!
>
> Best regards,
>
> -chanwit
>
> --
> Chanwit Kaewkasi
> linkedin.com/in/chanwit
>

Re: How to distribute external executable (script) with Spark ?

2014-03-19 Thread Mayur Rustagi

I doubt thr is something like this out of the box. Easiest thing is to
package it in to a jar & send that jar across.
Regards

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi 



On Wed, Mar 19, 2014 at 6:57 AM, Jaonary Rabarisoa wrote:

> Hi all,
>
> I'm trying to build an evaluation platform based on Spark. The idea is to
> run a blackbox executable (build with c/c++ or some scripting language).
> This blackbox takes a set of data as input and outpout some metrics. Since
> I have a huge amount of data, I need to distribute the computation and use
> tools like mapreduce.
>
> The question is, how do I send these blacboxes executable to each node
> automatically so they can be called. I need something similar to addJar but
> for any kind of files.
>
>
> Cheers,
>
>
>

Re: Incrementally add/remove vertices in GraphX

2014-03-19 Thread Alessandro Lulli

Hi All,

Thanks for your answer.

Regarding GraphX streaming:

   - Is there an issue (pull request) to follow to keep track of the update?
   - where is possible to find description and details of what will be
   provided?


Thanks for your help and your time to answer my questions
Alessandro



On Wed, Mar 19, 2014 at 2:43 AM, Ankur Dave  wrote:

> As Matei said, there's currently no support for incrementally adding
> vertices or edges to their respective partitions. Doing this efficiently
> would require extensive modifications to GraphX, so for now, the only
> options are to rebuild the indices on every graph modification, or to use
> the subgraph operator if the modification only involves removing vertices
> and edges.
>
> However, Joey and I are working on GraphX streaming, which is currently in
> the very early stages but eventually will enable this.
>
> Ankur 
>
>
> On Tue, Mar 18, 2014 at 3:30 PM, Matei Zaharia wrote:
>
>> I just meant that you call union() before creating the RDDs that you pass
>> to new Graph(). If you call it after it will produce other RDDs.
>>
>> The Graph() constructor actually shuffles and "indexes" the data to make
>> graph operations efficient, so it's not too easy to add elements after. You
>> could access graph.vertices and graph.edges to build new RDDs, and then
>> call Graph() again to make a new graph. I've CCed Joey and Ankur to see if
>> they have further ideas on how to optimize this. It would be cool to
>> support more efficient union and subtracting of graphs once they've been
>> partitioned by GraphX.
>>
>> Matei
>>
>> On Mar 14, 2014, at 8:32 AM, alelulli  wrote:
>>
>> > Hi Matei,
>> >
>> > Could you please clarify why i must call union before creating the
>> graph?
>> >
>> > What's the behavior if i call union / subtract after the creation?
>> > Is the added /removed vertexes been processed?
>> >
>> > For example if i'm implementing an iterative algorithm and at the 5th
>> step i
>> > need to add some vertex / edge, can i call union / subtract on the
>> > VertexRDD, EdgeRDD and Triplets?
>> >
>> > Thanks
>> > Alessandro
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Incrementally-add-remove-vertices-in-GraphX-tp2227p2695.html
>> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>>
>

Transitive dependency incompatibility

2014-03-19 Thread Jaka Jančar

Hi,

I'm getting the following error:

java.lang.NoSuchMethodError: 
org.apache.http.impl.conn.DefaultClientConnectionOperator.(Lorg/apache/http/conn/scheme/SchemeRegistry;Lorg/apache/http/conn/DnsResolver;)V
at 
org.apache.http.impl.conn.PoolingClientConnectionManager.createConnectionOperator(PoolingClientConnectionManager.java:140)
at 
org.apache.http.impl.conn.PoolingClientConnectionManager.(PoolingClientConnectionManager.java:114)
at 
org.apache.http.impl.conn.PoolingClientConnectionManager.(PoolingClientConnectionManager.java:99)
at 
org.apache.http.impl.conn.PoolingClientConnectionManager.(PoolingClientConnectionManager.java:85)
at 
org.apache.http.impl.conn.PoolingClientConnectionManager.(PoolingClientConnectionManager.java:93)
at 
com.amazonaws.http.ConnectionManagerFactory.createPoolingClientConnManager(ConnectionManagerFactory.java:26)
at 
com.amazonaws.http.HttpClientFactory.createHttpClient(HttpClientFactory.java:96)
at com.amazonaws.http.AmazonHttpClient.(AmazonHttpClient.java:155)
at com.amazonaws.AmazonWebServiceClient.(AmazonWebServiceClient.java:119)
at com.amazonaws.AmazonWebServiceClient.(AmazonWebServiceClient.java:103)
at com.amazonaws.services.s3.AmazonS3Client.(AmazonS3Client.java:334)
at com.celtra.analyzer.TrackingLogRDD.createClient(TrackingLogRDD.scala:131)
at com.celtra.analyzer.TrackingLogRDD.compute(TrackingLogRDD.scala:117)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:237)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:226)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:107)
at org.apache.spark.scheduler.Task.run(Task.scala:53)
at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:215)
at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:50)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:182)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)

My app uses the AWS SDK, which requires 
org.apache.httpcomponents:httpclient:4.2.

I believe the error is caused by the fact that an older version of the package 
is already present on the classpath:

Spark -> Akka -> sjson -> org.apache.httpcomponents:httpclient:4.1
Spark -> jets3t -> commons-httpclient:commons-httpclient:3.1


What are my options if I need to use a newer version of the library in my app?

Thanks,
Jaka

Is shutting down of SparkContext optional?

2014-03-19 Thread Roman Pastukhov

Hi,

After switching from Spark 0.8.0 to Spark 0.9.0 (and to Scala 2.10) one
application started hanging after main thread is done (in 'local[2]' mode,
without a cluster).

Adding SparkContext.stop() at the end solves this.
Is this behavior normal and shutting down of SparkContext is required?

Re: Spark enables us to process Big Data on an ARM cluster !!

2014-03-19 Thread Koert Kuipers

i dont know anything about arm clusters but it looks great. what are
the specs? the nodes have no local disk at all?


On Tue, Mar 18, 2014 at 10:36 PM, Chanwit Kaewkasi wrote:

> Hi all,
>
> We are a small team doing a research on low-power (and low-cost) ARM
> clusters. We built a 20-node ARM cluster that be able to start Hadoop.
> But as all of you've known, Hadoop is performing on-disk operations,
> so it's not suitable for a constraint machine powered by ARM.
>
> We then switched to Spark and had to say wow!!
>
> Spark / HDFS enables us to crush Wikipedia articles (of year 2012) of
> size 34GB in 1h50m. We have identified the bottleneck and it's our
> 100M network.
>
> Here's the cluster:
> https://dl.dropboxusercontent.com/u/381580/aiyara_cluster/Mk-I_SSD.png
>
> And this is what we got from Spark's shell:
> https://dl.dropboxusercontent.com/u/381580/aiyara_cluster/result_00.png
>
> I think it's the first ARM cluster that can process a non-trivial size
> of Big Data.
> (Please correct me if I'm wrong)
> I really want to thank the Spark team that makes this possible !!
>
> Best regards,
>
> -chanwit
>
> --
> Chanwit Kaewkasi
> linkedin.com/in/chanwit
>

RE: Spark enables us to process Big Data on an ARM cluster !!

2014-03-19 Thread Xia, Junluan

Very cool!

-Original Message-
From: Chanwit Kaewkasi [mailto:chan...@gmail.com] 
Sent: Wednesday, March 19, 2014 10:36 AM
To: user@spark.apache.org
Subject: Spark enables us to process Big Data on an ARM cluster !!

Hi all,

We are a small team doing a research on low-power (and low-cost) ARM clusters. 
We built a 20-node ARM cluster that be able to start Hadoop.
But as all of you've known, Hadoop is performing on-disk operations, so it's 
not suitable for a constraint machine powered by ARM.

We then switched to Spark and had to say wow!!

Spark / HDFS enables us to crush Wikipedia articles (of year 2012) of size 34GB 
in 1h50m. We have identified the bottleneck and it's our 100M network.

Here's the cluster:
https://dl.dropboxusercontent.com/u/381580/aiyara_cluster/Mk-I_SSD.png

And this is what we got from Spark's shell:
https://dl.dropboxusercontent.com/u/381580/aiyara_cluster/result_00.png

I think it's the first ARM cluster that can process a non-trivial size of Big 
Data.
(Please correct me if I'm wrong)
I really want to thank the Spark team that makes this possible !!

Best regards,

-chanwit

--
Chanwit Kaewkasi
linkedin.com/in/chanwit

Re: Hadoop Input Format - newAPIHadoopFile

2014-03-19 Thread Pariksheet Barapatre

Thanks . it worked..

Very basic question, i have created  custominput format e.g. stock. How do
I refer this class as custom inputformat. I.e. where to keep this class on
linux folder. Do i need to add this jar if so how .
I am running code through spark-shell.

Thanks
Pari
On 19-Mar-2014 7:35 pm, "Shixiong Zhu"  wrote:

> The correct import statement is "import
> org.apache.hadoop.mapreduce.lib.input.TextInputFormat".
>
> Best Regards,
> Shixiong Zhu
>
>
> 2014-03-19 18:46 GMT+08:00 Pariksheet Barapatre :
>
>> Seems like import issue, ran with HadoopFile and it worked. Not getting
>> import statement for textInputFormat class location for new API.
>>
>> Can anybody help?
>>
>> Thanks
>> Pariksheet
>>
>>
>> On 19 March 2014 16:05, Bertrand Dechoux  wrote:
>>
>>> I don't know the Spark issue but the Hadoop context is clear.
>>>
>>> old api -> org.apache.hadoop.mapred
>>> new api -> org.apache.hadoop.mapreduce
>>>
>>> You might only need to change your import.
>>>
>>> Regards
>>>
>>> Bertrand
>>>
>>>
>>> On Wed, Mar 19, 2014 at 11:29 AM, Pariksheet Barapatre <
>>> pbarapa...@gmail.com> wrote:
>>>
 Hi,

 Trying to read HDFS file with TextInputFormat.

 scala> import org.apache.hadoop.mapred.TextInputFormat
 scala> import org.apache.hadoop.io.{LongWritable, Text}
 scala> val file2 =
 sc.newAPIHadoopFile[LongWritable,Text,TextInputFormat]("hdfs://
 192.168.100.130:8020/user/hue/pig/examples/data/sonnets.txt")


 This is giving me the error.

 :14: error: type arguments
 [org.apache.hadoop.io.LongWritable,org.apache.hadoop.io.Text,org.apache.hadoop.mapred.TextInputFormat]
 conform to the bounds of none of the overloaded alternatives of
  value newAPIHadoopFile: [K, V, F <:
 org.apache.hadoop.mapreduce.InputFormat[K,V]](path: String, fClass:
 Class[F], kClass: Class[K], vClass: Class[V], conf:
 org.apache.hadoop.conf.Configuration)org.apache.spark.rdd.RDD[(K, V)] 
 [K, V, F <: org.apache.hadoop.mapreduce.InputFormat[K,V]](path:
 String)(implicit km: scala.reflect.ClassTag[K], implicit vm:
 scala.reflect.ClassTag[V], implicit fm:
 scala.reflect.ClassTag[F])org.apache.spark.rdd.RDD[(K, V)]
val file2 =
 sc.newAPIHadoopFile[LongWritable,Text,TextInputFormat]("hdfs://
 192.168.100.130:8020/user/hue/pig/examples/data/sonnets.txt")


 What is correct syntax if I want to use TextInputFormat.

 Also, how to use customInputFormat. Very silly question but I am not
 sure how and where to keep jar file containing customInputFormat class.

 Thanks
 Pariksheet



 --
 Cheers,
 Pari

>>>
>>>
>>
>>
>> --
>> Cheers,
>> Pari
>>
>
>

Re: Separating classloader management from SparkContexts

2014-03-19 Thread Punya Biswal

Hi Andrew,

Thanks for pointing me to that example. My understanding of the JobServer
(based on watching a demo of its UI) is that it maintains a set of spark
contexts and allows people to add jars to them, but doesn't allow unloading
or reloading jars within a spark context. The code in JobCache appears to be
a performance enhancement to speed up retrieval of jars that are used
frequently -- the classloader change is purely on the driver side, so that
the driver can serialize the job instance. I'm looking for a classloader
change on the executor-side, so that different jars can be uploaded to the
same SparkContext even if they contain some of the same classes.

Punya

From:  Andrew Ash 
Reply-To:  "user@spark.apache.org" 
Date:  Wednesday, March 19, 2014 at 2:03 AM
To:  "user@spark.apache.org" 
Subject:  Re: Separating classloader management from SparkContexts

Hi Punya, 

This seems like a problem that the recently-announced job-server would
likely have run into at one point.  I haven't tested it yet, but I'd be
interested to see what happens when two jobs in the job server have
conflicting classes.  Does the server correctly segregate each job's classes
from other concurrently-running jobs?

>From my reading of the code I think it may not work the way I'd want it to,
though there are a few classloader tricks going on.

https://github.com/ooyala/spark-jobserver/blob/master/job-server/src/spark.j
observer/JobCache.scala


In line 29 there the jar is added to the SparkContext, and in 30 the jar is
added to the job-server's local classloader.

Note all this PR related to classloaders -
https://github.com/apache/spark/pull/119


Andrew



On Tue, Mar 18, 2014 at 9:24 AM, Punya Biswal  wrote:
> Hi Spark people,
> 
> Sorry to bug everyone again about this, but do people have any thoughts on
> whether sub-contexts would be a good way to solve this problem? I'm thinking
> of something like
> 
> class SparkContext {
>   // ... stuff ...
>   def inSubContext[T](fn: SparkContext => T): T
> }
> 
> this way, I could do something like
> 
> val sc = /* get myself a spark context somehow */;
> val rdd = sc.textFile("/stuff.txt")
> sc.inSubContext { sc1 =>
>   sc1.addJar("extras-v1.jar")
>   print(sc1.filter(/* fn that depends on jar */).count)
> }
> sc.inSubContext { sc2 =>
>   sc2.addJar("extras-v2.jar")
>   print(sc2.filter(/* fn that depends on jar */).count)
> }
> 
> ... even if classes in extras-v1.jar and extras-v2.jar have name collisions.
> 
> Punya
> 
> From: Punya Biswal 
> Reply-To: 
> Date: Sunday, March 16, 2014 at 11:09 AM
> To: "user@spark.apache.org" 
> Subject: Separating classloader management from SparkContexts
> 
> Hi all,
> 
> I'm trying to use Spark to support users who are interactively refining the
> code that processes their data. As a concrete example, I might create an
> RDD[String] and then write several versions of a function to map over the RDD
> until I'm satisfied with the transformation. Right now, once I do addJar() to
> add one version of the jar to the SparkContext, there's no way to add a new
> version of the jar unless I rename the classes and functions involved, or lose
> my current work by re-creating the SparkContext. Is there a better way to do
> this?
> 
> One idea that comes to mind is that we could add APIs to create "sub-contexts"
> from within a SparkContext. Jars added to a sub-context would get added to a
> child classloader on the executor, so that different sub-contexts could use
> classes with the same name while still being able to access on-heap objects
> for RDDs. If this makes sense conceptually, I'd like to work on a PR to add
> such functionality to Spark.
> 
> Punya
> 





smime.p7s
Description: S/MIME cryptographic signature

Re: Joining two HDFS files in in Spark

2014-03-19 Thread Shixiong Zhu

Do you want to read the file content in the following statement?

val ny_daily= sc.parallelize(List("hdfs://localhost:8020/user/user/
NYstock/NYSE_daily"))

If so, you should use "textFile", e.g.,

val ny_daily= sc.textFile("hdfs://localhost:8020/user/user/
NYstock/NYSE_daily")

"parallelize" is used to create a RDD from a collection.


Best Regards,
Shixiong Zhu


2014-03-19 20:52 GMT+08:00 Yana Kadiyska :

> Not sure what you mean by "not getting information how to join". If
> you mean that you can't see the result I believe you need to collect
> the result of the join on the driver, as in
>
> val joinedRdd=enKeyValuePair1.join(enKeyValuePair)
> joinedRdd.collect().map(prinltn)
>
>
>
> On Wed, Mar 19, 2014 at 4:57 AM, Chhaya Vishwakarma
>  wrote:
> > Hi
> >
> >
> >
> > I want to join two files from HDFS using spark shell.
> >
> > Both the files are tab separated and I want to join on second column
> >
> >
> >
> > Tried code
> >
> > But not giving any output
> >
> >
> >
> > val ny_daily=
> >
> sc.parallelize(List("hdfs://localhost:8020/user/user/NYstock/NYSE_daily"))
> >
> >
> >
> > val ny_daily_split = ny_daily.map(line =>line.split('\t'))
> >
> >
> >
> > val enKeyValuePair = ny_daily_split.map(line => (line(0).substring(0, 5),
> > line(3).toInt))
> >
> >
> >
> >
> >
> > val ny_dividend=
> >
> sc.parallelize(List("hdfs://localhost:8020/user/user/NYstock/NYSE_dividends"))
> >
> >
> >
> > val ny_dividend_split = ny_dividend.map(line =>line.split('\t'))
> >
> >
> >
> > val enKeyValuePair1 = ny_dividend_split.map(line => (line(0).substring(0,
> > 4), line(3).toInt))
> >
> >
> >
> > enKeyValuePair1.join(enKeyValuePair)
> >
> >
> >
> >
> >
> > But I am not getting any information for how to join files on particular
> > column
> >
> > Please suggest
> >
> >
> >
> >
> >
> >
> >
> > Regards,
> >
> > Chhaya Vishwakarma
> >
> >
> >
> >
> > 
> > The contents of this e-mail and any attachment(s) may contain
> confidential
> > or privileged information for the intended recipient(s). Unintended
> > recipients are prohibited from taking action on the basis of information
> in
> > this e-mail and using or disseminating the information, and must notify
> the
> > sender and delete it from their system. L&T Infotech will not accept
> > responsibility or liability for the accuracy or completeness of, or the
> > presence of any virus or disabling code in this e-mail"
>

Re: Hadoop Input Format - newAPIHadoopFile

2014-03-19 Thread Shixiong Zhu

The correct import statement is "import
org.apache.hadoop.mapreduce.lib.input.TextInputFormat".

Best Regards,
Shixiong Zhu


2014-03-19 18:46 GMT+08:00 Pariksheet Barapatre :

> Seems like import issue, ran with HadoopFile and it worked. Not getting
> import statement for textInputFormat class location for new API.
>
> Can anybody help?
>
> Thanks
> Pariksheet
>
>
> On 19 March 2014 16:05, Bertrand Dechoux  wrote:
>
>> I don't know the Spark issue but the Hadoop context is clear.
>>
>> old api -> org.apache.hadoop.mapred
>> new api -> org.apache.hadoop.mapreduce
>>
>> You might only need to change your import.
>>
>> Regards
>>
>> Bertrand
>>
>>
>> On Wed, Mar 19, 2014 at 11:29 AM, Pariksheet Barapatre <
>> pbarapa...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> Trying to read HDFS file with TextInputFormat.
>>>
>>> scala> import org.apache.hadoop.mapred.TextInputFormat
>>> scala> import org.apache.hadoop.io.{LongWritable, Text}
>>> scala> val file2 =
>>> sc.newAPIHadoopFile[LongWritable,Text,TextInputFormat]("hdfs://
>>> 192.168.100.130:8020/user/hue/pig/examples/data/sonnets.txt")
>>>
>>>
>>> This is giving me the error.
>>>
>>> :14: error: type arguments
>>> [org.apache.hadoop.io.LongWritable,org.apache.hadoop.io.Text,org.apache.hadoop.mapred.TextInputFormat]
>>> conform to the bounds of none of the overloaded alternatives of
>>>  value newAPIHadoopFile: [K, V, F <:
>>> org.apache.hadoop.mapreduce.InputFormat[K,V]](path: String, fClass:
>>> Class[F], kClass: Class[K], vClass: Class[V], conf:
>>> org.apache.hadoop.conf.Configuration)org.apache.spark.rdd.RDD[(K, V)] 
>>> [K, V, F <: org.apache.hadoop.mapreduce.InputFormat[K,V]](path:
>>> String)(implicit km: scala.reflect.ClassTag[K], implicit vm:
>>> scala.reflect.ClassTag[V], implicit fm:
>>> scala.reflect.ClassTag[F])org.apache.spark.rdd.RDD[(K, V)]
>>>val file2 =
>>> sc.newAPIHadoopFile[LongWritable,Text,TextInputFormat]("hdfs://
>>> 192.168.100.130:8020/user/hue/pig/examples/data/sonnets.txt")
>>>
>>>
>>> What is correct syntax if I want to use TextInputFormat.
>>>
>>> Also, how to use customInputFormat. Very silly question but I am not
>>> sure how and where to keep jar file containing customInputFormat class.
>>>
>>> Thanks
>>> Pariksheet
>>>
>>>
>>>
>>> --
>>> Cheers,
>>> Pari
>>>
>>
>>
>
>
> --
> Cheers,
> Pari
>

Re: example of non-line oriented input data?

2014-03-19 Thread Diana Carroll

If I don't call iter(), and just return treeiterator directly, I get an
error message that the object is not of an iterator type.  This is in
Python 2.6...perhaps a bug?

BUT I also realized my code was wrong.  It results in an RDD containing all
the tags in all the files.  What I really want is an RDD where each record
corresponds to a single file.  So if I have a thousand files, I should have
a thousand elements in my RDD, each of which is an ElementTree.  (Which I
can then use to map or flatMap to pull out the data I actually care about.)

So, this works:

def parsefile(iterator):
s = ''
for i in iterator: s = s + str(i)
yield ElementTree.fromstring(s)

I would think the ability to process very large numbers of smallish XML
files is pretty common. The use case I'm playing with right now is using a
knowledge base of HTML documents.  Each document in the KB is a single
file, which in my experience is not an unusual configuration.  I'd like to
be able to suck the whole KB into an RDD and then do analysis such as
"which keywords are most commonly used in the KB" or "is there a
correlation between certain user attributes and the KB articles they
request" and so on.

Unfortunately I'm not sure I'm best to answer your question about non-text
InputFormats to support.  I'm fairly new to Hadoop (about 8 months) and I'm
not in the field.  My background is in app servers, ecommerce and business
process management, so that's my bias.  From that perspective, it would be
really useful to be able to work with XML/HTML and CSV files...but are
those what big data analysts are actually using Spark for?  I dunno.  And,
really, if I were actually in those fields, I'd be getting the data from a
DB using Shark, right?

Diana

On Tue, Mar 18, 2014 at 6:27 PM, Matei Zaharia wrote:

> Hi Diana,
>
> This seems to work without the iter() in front if you just return
> treeiterator. What happened when you didn't include that? Treeiterator
> should return an iterator.
>
> Anyway, this is a good example of mapPartitions. It's one where you want
> to view the whole file as one object (one XML here), so you couldn't
> implement this using a flatMap, but you still want to return multiple
> values. The MLlib example you saw needs Python 2.7 because unfortunately
> that is a requirement for our Python MLlib support (see
> http://spark.incubator.apache.org/docs/0.9.0/python-programming-guide.html#libraries).
> We'd like to relax this later but we're using some newer features of NumPy
> and Python. The rest of PySpark works on 2.6.
>
> In terms of the size in memory, here both the string s and the XML tree
> constructed from it need to fit in, so you can't work on very large
> individual XML files. You may be able to use a streaming XML parser instead
> to extract elements from the data in a streaming fashion, without every
> materializing the whole tree.
> http://docs.python.org/2/library/xml.sax.reader.html#module-xml.sax.xmlreaderis
>  one example.
>
> Matei
>
> On Mar 18, 2014, at 7:49 AM, Diana Carroll  wrote:
>
> Well, if anyone is still following this, I've gotten the following code
> working which in theory should allow me to parse whole XML files: (the
> problem was that I can't return the tree iterator directly.  I have to call
> iter().  Why?)
>
> import xml.etree.ElementTree as ET
>
> # two source files, format   name="...">..
> mydata=sc.textFile("file:/home/training/countries*.xml")
>
> def parsefile(iterator):
> s = ''
> for i in iterator: s = s + str(i)
> tree = ET.fromstring(s)
> treeiterator = tree.getiterator("country")
> # why to I have to convert an iterator to an iterator?  not sure but
> required
> return iter(treeiterator)
>
> mydata.mapPartitions(lambda x: parsefile(x)).map(lambda element:
> element.attrib).collect()
>
> The output is what I expect:
> [{'name': 'Liechtenstein'}, {'name': 'Singapore'}, {'name': 'Panama'}]
>
> BUT I'm a bit concerned about the construction of the string "s".  How big
> can my file be before converting it to a string becomes problematic?
>
>
>
> On Tue, Mar 18, 2014 at 9:41 AM, Diana Carroll wrote:
>
>> Thanks, Matei.
>>
>> In the context of this discussion, it would seem mapParitions is
>> essential, because it's the only way I'm going to be able to process each
>> file as a whole, in our example of a large number of small XML files which
>> need to be parsed as a whole file because records are not required to be on
>> a single line.
>>
>> The theory makes sense but I'm still utterly lost as to how to implement
>> it.  Unfortunately there's only a single example of the use of
>> mapPartitions in any of the Python example programs, which is the log
>> regression example, which I can't run because it requires Python 2.7 and
>> I'm on Python 2.6.  (aside: I couldn't find any statement that Python 2.6
>> is unsupported...is it?)
>>
>> I'd really really love to see a real life example of a Python use of
>> mapPartitions.  I do appreciate the very sim

Re: Joining two HDFS files in in Spark

2014-03-19 Thread Yana Kadiyska

Not sure what you mean by "not getting information how to join". If
you mean that you can't see the result I believe you need to collect
the result of the join on the driver, as in

val joinedRdd=enKeyValuePair1.join(enKeyValuePair)
joinedRdd.collect().map(prinltn)



On Wed, Mar 19, 2014 at 4:57 AM, Chhaya Vishwakarma
 wrote:
> Hi
>
>
>
> I want to join two files from HDFS using spark shell.
>
> Both the files are tab separated and I want to join on second column
>
>
>
> Tried code
>
> But not giving any output
>
>
>
> val ny_daily=
> sc.parallelize(List("hdfs://localhost:8020/user/user/NYstock/NYSE_daily"))
>
>
>
> val ny_daily_split = ny_daily.map(line =>line.split('\t'))
>
>
>
> val enKeyValuePair = ny_daily_split.map(line => (line(0).substring(0, 5),
> line(3).toInt))
>
>
>
>
>
> val ny_dividend=
> sc.parallelize(List("hdfs://localhost:8020/user/user/NYstock/NYSE_dividends"))
>
>
>
> val ny_dividend_split = ny_dividend.map(line =>line.split('\t'))
>
>
>
> val enKeyValuePair1 = ny_dividend_split.map(line => (line(0).substring(0,
> 4), line(3).toInt))
>
>
>
> enKeyValuePair1.join(enKeyValuePair)
>
>
>
>
>
> But I am not getting any information for how to join files on particular
> column
>
> Please suggest
>
>
>
>
>
>
>
> Regards,
>
> Chhaya Vishwakarma
>
>
>
>
> 
> The contents of this e-mail and any attachment(s) may contain confidential
> or privileged information for the intended recipient(s). Unintended
> recipients are prohibited from taking action on the basis of information in
> this e-mail and using or disseminating the information, and must notify the
> sender and delete it from their system. L&T Infotech will not accept
> responsibility or liability for the accuracy or completeness of, or the
> presence of any virus or disabling code in this e-mail"

How to distribute external executable (script) with Spark ?

2014-03-19 Thread Jaonary Rabarisoa

Hi all,

I'm trying to build an evaluation platform based on Spark. The idea is to
run a blackbox executable (build with c/c++ or some scripting language).
This blackbox takes a set of data as input and outpout some metrics. Since
I have a huge amount of data, I need to distribute the computation and use
tools like mapreduce.

The question is, how do I send these blacboxes executable to each node
automatically so they can be called. I need something similar to addJar but
for any kind of files.


Cheers,

Re: Hadoop Input Format - newAPIHadoopFile

2014-03-19 Thread Pariksheet Barapatre

Seems like import issue, ran with HadoopFile and it worked. Not getting
import statement for textInputFormat class location for new API.

Can anybody help?

Thanks
Pariksheet


On 19 March 2014 16:05, Bertrand Dechoux  wrote:

> I don't know the Spark issue but the Hadoop context is clear.
>
> old api -> org.apache.hadoop.mapred
> new api -> org.apache.hadoop.mapreduce
>
> You might only need to change your import.
>
> Regards
>
> Bertrand
>
>
> On Wed, Mar 19, 2014 at 11:29 AM, Pariksheet Barapatre <
> pbarapa...@gmail.com> wrote:
>
>> Hi,
>>
>> Trying to read HDFS file with TextInputFormat.
>>
>> scala> import org.apache.hadoop.mapred.TextInputFormat
>> scala> import org.apache.hadoop.io.{LongWritable, Text}
>> scala> val file2 =
>> sc.newAPIHadoopFile[LongWritable,Text,TextInputFormat]("hdfs://
>> 192.168.100.130:8020/user/hue/pig/examples/data/sonnets.txt")
>>
>>
>> This is giving me the error.
>>
>> :14: error: type arguments
>> [org.apache.hadoop.io.LongWritable,org.apache.hadoop.io.Text,org.apache.hadoop.mapred.TextInputFormat]
>> conform to the bounds of none of the overloaded alternatives of
>>  value newAPIHadoopFile: [K, V, F <:
>> org.apache.hadoop.mapreduce.InputFormat[K,V]](path: String, fClass:
>> Class[F], kClass: Class[K], vClass: Class[V], conf:
>> org.apache.hadoop.conf.Configuration)org.apache.spark.rdd.RDD[(K, V)] 
>> [K, V, F <: org.apache.hadoop.mapreduce.InputFormat[K,V]](path:
>> String)(implicit km: scala.reflect.ClassTag[K], implicit vm:
>> scala.reflect.ClassTag[V], implicit fm:
>> scala.reflect.ClassTag[F])org.apache.spark.rdd.RDD[(K, V)]
>>val file2 =
>> sc.newAPIHadoopFile[LongWritable,Text,TextInputFormat]("hdfs://
>> 192.168.100.130:8020/user/hue/pig/examples/data/sonnets.txt")
>>
>>
>> What is correct syntax if I want to use TextInputFormat.
>>
>> Also, how to use customInputFormat. Very silly question but I am not sure
>> how and where to keep jar file containing customInputFormat class.
>>
>> Thanks
>> Pariksheet
>>
>>
>>
>> --
>> Cheers,
>> Pari
>>
>
>


-- 
Cheers,
Pari

Re: Hadoop Input Format - newAPIHadoopFile

2014-03-19 Thread Bertrand Dechoux

I don't know the Spark issue but the Hadoop context is clear.

old api -> org.apache.hadoop.mapred
new api -> org.apache.hadoop.mapreduce

You might only need to change your import.

Regards

Bertrand


On Wed, Mar 19, 2014 at 11:29 AM, Pariksheet Barapatre  wrote:

> Hi,
>
> Trying to read HDFS file with TextInputFormat.
>
> scala> import org.apache.hadoop.mapred.TextInputFormat
> scala> import org.apache.hadoop.io.{LongWritable, Text}
> scala> val file2 =
> sc.newAPIHadoopFile[LongWritable,Text,TextInputFormat]("hdfs://
> 192.168.100.130:8020/user/hue/pig/examples/data/sonnets.txt")
>
>
> This is giving me the error.
>
> :14: error: type arguments
> [org.apache.hadoop.io.LongWritable,org.apache.hadoop.io.Text,org.apache.hadoop.mapred.TextInputFormat]
> conform to the bounds of none of the overloaded alternatives of
>  value newAPIHadoopFile: [K, V, F <:
> org.apache.hadoop.mapreduce.InputFormat[K,V]](path: String, fClass:
> Class[F], kClass: Class[K], vClass: Class[V], conf:
> org.apache.hadoop.conf.Configuration)org.apache.spark.rdd.RDD[(K, V)] 
> [K, V, F <: org.apache.hadoop.mapreduce.InputFormat[K,V]](path:
> String)(implicit km: scala.reflect.ClassTag[K], implicit vm:
> scala.reflect.ClassTag[V], implicit fm:
> scala.reflect.ClassTag[F])org.apache.spark.rdd.RDD[(K, V)]
>val file2 =
> sc.newAPIHadoopFile[LongWritable,Text,TextInputFormat]("hdfs://
> 192.168.100.130:8020/user/hue/pig/examples/data/sonnets.txt")
>
>
> What is correct syntax if I want to use TextInputFormat.
>
> Also, how to use customInputFormat. Very silly question but I am not sure
> how and where to keep jar file containing customInputFormat class.
>
> Thanks
> Pariksheet
>
>
>
> --
> Cheers,
> Pari
>

Hadoop Input Format - newAPIHadoopFile

2014-03-19 Thread Pariksheet Barapatre

Hi,

Trying to read HDFS file with TextInputFormat.

scala> import org.apache.hadoop.mapred.TextInputFormat
scala> import org.apache.hadoop.io.{LongWritable, Text}
scala> val file2 =
sc.newAPIHadoopFile[LongWritable,Text,TextInputFormat]("hdfs://
192.168.100.130:8020/user/hue/pig/examples/data/sonnets.txt")


This is giving me the error.

:14: error: type arguments
[org.apache.hadoop.io.LongWritable,org.apache.hadoop.io.Text,org.apache.hadoop.mapred.TextInputFormat]
conform to the bounds of none of the overloaded alternatives of
 value newAPIHadoopFile: [K, V, F <:
org.apache.hadoop.mapreduce.InputFormat[K,V]](path: String, fClass:
Class[F], kClass: Class[K], vClass: Class[V], conf:
org.apache.hadoop.conf.Configuration)org.apache.spark.rdd.RDD[(K, V)] 
[K, V, F <: org.apache.hadoop.mapreduce.InputFormat[K,V]](path:
String)(implicit km: scala.reflect.ClassTag[K], implicit vm:
scala.reflect.ClassTag[V], implicit fm:
scala.reflect.ClassTag[F])org.apache.spark.rdd.RDD[(K, V)]
   val file2 =
sc.newAPIHadoopFile[LongWritable,Text,TextInputFormat]("hdfs://
192.168.100.130:8020/user/hue/pig/examples/data/sonnets.txt")


What is correct syntax if I want to use TextInputFormat.

Also, how to use customInputFormat. Very silly question but I am not sure
how and where to keep jar file containing customInputFormat class.

Thanks
Pariksheet



-- 
Cheers,
Pari

Joining two HDFS files in in Spark

2014-03-19 Thread Chhaya Vishwakarma

Hi

I want to join two files from HDFS using spark shell.
Both the files are tab separated and I want to join on second column

Tried code
But not giving any output

val ny_daily= 
sc.parallelize(List("hdfs://localhost:8020/user/user/NYstock/NYSE_daily"))

val ny_daily_split = ny_daily.map(line =>line.split('\t'))

val enKeyValuePair = ny_daily_split.map(line => (line(0).substring(0, 5), 
line(3).toInt))


val ny_dividend= 
sc.parallelize(List("hdfs://localhost:8020/user/user/NYstock/NYSE_dividends"))

val ny_dividend_split = ny_dividend.map(line =>line.split('\t'))

val enKeyValuePair1 = ny_dividend_split.map(line => (line(0).substring(0, 4), 
line(3).toInt))

enKeyValuePair1.join(enKeyValuePair)


But I am not getting any information for how to join files on particular column
Please suggest



Regards,
Chhaya Vishwakarma



The contents of this e-mail and any attachment(s) may contain confidential or 
privileged information for the intended recipient(s). Unintended recipients are 
prohibited from taking action on the basis of information in this e-mail and 
using or disseminating the information, and must notify the sender and delete 
it from their system. L&T Infotech will not accept responsibility or liability 
for the accuracy or completeness of, or the presence of any virus or disabling 
code in this e-mail"

What's the lifecycle of an rdd? Can I control it?

2014-03-19 Thread 林武康

Hi, can any one tell me about the lifecycle of an rdd? I search through the 
official website and still can't figure it out. Can I use an rdd in some stages 
and destroy it in order to release memory because that no stages ahead will use 
this rdd any more. Is it possible?

Thanks!

Sincerely 
Lin wukang

Re: Connect Exception Error in spark interactive shell...

2014-03-19 Thread Sai Prasanna

Mayur,

While reading a local file which is not in HDFS through spark shell, does
the HDFS need to be up and running ???


On Tue, Mar 18, 2014 at 9:46 PM, Mayur Rustagi wrote:

> Your hdfs is down. Probably forgot to format namenode.
> check if namenode is running
>ps -aef|grep Namenode
> if not & data in hdfs is not critical
> hadoop namenode -format
> & restart hdfs
>
>
> Mayur Rustagi
> Ph: +1 (760) 203 3257
> http://www.sigmoidanalytics.com
> @mayur_rustagi 
>
>
>
> On Tue, Mar 18, 2014 at 5:59 AM, Sai Prasanna wrote:
>
>> Hi ALL !!
>>
>> In the interactive spark shell i get the following error.
>> I just followed the steps of the video "First steps with spark - spark
>> screen cast #1" by andy konwinski...
>>
>> Any thoughts ???
>>
>> scala> val textfile = sc.textFile("README.md")
>> textfile: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at
>> :12
>>
>> scala> textfile.count
>> java.lang.RuntimeException: java.net.ConnectException: Call to master/
>> 192.168.1.11:9000 failed on connection exception:
>> java.net.ConnectException: Connection refused
>>  at
>> org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:546)
>> at
>> org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:318)
>>  at
>> org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:291)
>> at org.apache.spark.SparkContext$$anonfun$13.apply(SparkContext.scala:439)
>>  at
>> org.apache.spark.SparkContext$$anonfun$13.apply(SparkContext.scala:439)
>> at
>> org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$1.apply(HadoopRDD.scala:112)
>>  at
>> org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$1.apply(HadoopRDD.scala:112)
>> at scala.Option.map(Option.scala:133)
>>  at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:112)
>> at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:134)
>>  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201)
>> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:199)
>>  at scala.Option.getOrElse(Option.scala:108)
>> at org.apache.spark.rdd.RDD.partitions(RDD.scala:199)
>>  at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:26)
>> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201)
>>  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:199)
>> at scala.Option.getOrElse(Option.scala:108)
>>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:199)
>> at org.apache.spark.SparkContext.runJob(SparkContext.scala:886)
>>  at org.apache.spark.rdd.RDD.count(RDD.scala:698)
>> at (:15)
>>  at (:20)
>> at (:22)
>>  at (:24)
>> at (:26)
>>  at .(:30)
>> at .()
>>  at .(:11)
>> at .()
>>  at $export()
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>  at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>  at java.lang.reflect.Method.invoke(Method.java:606)
>> at
>> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:629)
>>  at
>> org.apache.spark.repl.SparkIMain$Request$$anonfun$10.apply(SparkIMain.scala:897)
>> at scala.tools.nsc.interpreter.Line$$anonfun$1.apply$mcV$sp(Line.scala:43)
>>  at scala.tools.nsc.io.package$$anon$2.run(package.scala:25)
>> at java.lang.Thread.run(Thread.java:744)
>> Caused by: java.net.ConnectException: Call to master/192.168.1.11:9000failed 
>> on connection exception: java.net.ConnectException: Connection
>> refused
>>  at org.apache.hadoop.ipc.Client.wrapException(Client.java:1099)
>> at org.apache.hadoop.ipc.Client.call(Client.java:1075)
>>  at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
>> at com.sun.proxy.$Proxy8.getProtocolVersion(Unknown Source)
>>  at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396)
>> at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379)
>>  at
>> org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:119)
>> at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:238)
>>  at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:203)
>> at
>> org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:89)
>>  at
>> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1386)
>> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
>>  at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1404)
>> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)
>>  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:123)
>> at org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:542)
>>  ... 39 more
>> Caused by: java.net.ConnectException: Connection refused
>> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>>  at
>> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
>> at
>> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
>>  at org.apache.hadoop.net.NetUtils.connect(N

Re: Pyspark worker memory

2014-03-19 Thread Jim Blomo

To document this, it would be nice to clarify what environment
variables should be used to set which Java system properties, and what
type of process they affect.  I'd be happy to start a page if you can
point me to the right place:

SPARK_JAVA_OPTS:
  -Dspark.executor.memory can by set on the machine running the driver
(typically the master host) and will affect the memory available to
the Executor running on a slave node
  -D

SPARK_DAEMON_OPTS:
  

On Wed, Mar 19, 2014 at 12:48 AM, Jim Blomo  wrote:
> Thanks for the suggestion, Matei.  I've tracked this down to a setting
> I had to make on the Driver.  It looks like spark-env.sh has no impact
> on the Executor, which confused me for a long while with settings like
> SPARK_EXECUTOR_MEMORY.  The only setting that mattered was setting the
> system property in the *driver* (in this case pyspark/shell.py) or
> using -Dspark.executor.memory in SPARK_JAVA_OPTS *on the master*.  I'm
> not sure how this varies from 0.9.0 release, but it seems to work on
> SNAPSHOT.
>
> On Tue, Mar 18, 2014 at 11:52 PM, Matei Zaharia  
> wrote:
>> Try checking spark-env.sh on the workers as well. Maybe code there is
>> somehow overriding the spark.executor.memory setting.
>>
>> Matei
>>
>> On Mar 18, 2014, at 6:17 PM, Jim Blomo  wrote:
>>
>> Hello, I'm using the Github snapshot of PySpark and having trouble setting
>> the worker memory correctly. I've set spark.executor.memory to 5g, but
>> somewhere along the way Xmx is getting capped to 512M. This was not
>> occurring with the same setup and 0.9.0. How many places do I need to
>> configure the memory? Thank you!
>>
>>

Re: Pyspark worker memory

2014-03-19 Thread Jim Blomo

Thanks for the suggestion, Matei.  I've tracked this down to a setting
I had to make on the Driver.  It looks like spark-env.sh has no impact
on the Executor, which confused me for a long while with settings like
SPARK_EXECUTOR_MEMORY.  The only setting that mattered was setting the
system property in the *driver* (in this case pyspark/shell.py) or
using -Dspark.executor.memory in SPARK_JAVA_OPTS *on the master*.  I'm
not sure how this varies from 0.9.0 release, but it seems to work on
SNAPSHOT.

On Tue, Mar 18, 2014 at 11:52 PM, Matei Zaharia  wrote:
> Try checking spark-env.sh on the workers as well. Maybe code there is
> somehow overriding the spark.executor.memory setting.
>
> Matei
>
> On Mar 18, 2014, at 6:17 PM, Jim Blomo  wrote:
>
> Hello, I'm using the Github snapshot of PySpark and having trouble setting
> the worker memory correctly. I've set spark.executor.memory to 5g, but
> somewhere along the way Xmx is getting capped to 512M. This was not
> occurring with the same setup and 0.9.0. How many places do I need to
> configure the memory? Thank you!
>
>

Re: Unable to read HDFS file -- SimpleApp.java

2014-03-19 Thread Prasad

Check this thread out,
http://apache-spark-user-list.1001560.n3.nabble.com/Error-reading-HDFS-file-using-spark-0-9-0-hadoop-2-2-0-incompatible-protobuf-2-5-and-2-4-1-tp2158p2807.html
 
-- you have to remove conflicting akka and protbuf versions.

Thanks
Prasad.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-read-HDFS-file-SimpleApp-java-tp1813p2853.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

54 matches

Mail list logo