date:20140407

Mongo-Hadoop Connector with Spark

2014-04-07 Thread Pavan Kumar

Hi Everyone,

I saved a 2GB pdf file into MongoDB using GridFS. now i want process those
GridFS collection data using Java Spark Mapreduce. previously i have
successfully processed normal mongoDB collections(not GridFS) with Apache
spark using Mongo-Hadoop connector. now i'm unable to handle input GridFS
collections.

My question here is..

1) how to pass MongoDB GridFS data as input to our spark application
2)Do we have separate RDD to handle GridFS  binary data...

I'm trying with following snippet but I'm unable to get actual data.

 MongoConfigUtil.setInputURI(config,
"mongodb://localhost:27017/pdfbooks.fs.chunks" );
 MongoConfigUtil.setOutputURI(config,"mongodb://localhost:27017/"+output );
 JavaPairRDD mongoRDD = sc.newAPIHadoopRDD(config,
com.mongodb.hadoop.MongoInputFormat.class, Object.class,
BSONObject.class);
 JavaRDD words = mongoRDD.flatMap(new
FlatMapFunction,
   String>() {
   @Override
   public Iterable call(Tuple2 arg) {
   System.out.println(arg._2.toString());
   ...

please suggest me  available ways to do...Thank you in Advance!!!

[BLOG] For Beginners

2014-04-07 Thread prabeesh k

Hi all,

Here I am sharing a blog for beginners, about creating spark streaming
stand alone application and bundle the app as single runnable jar. Take a
look and drop your comments in blog page.

http://prabstechblog.blogspot.in/2014/04/a-standalone-spark-application-in-scala.html

http://prabstechblog.blogspot.in/2014/04/creating-single-jar-for-spark-project.html

prabeesh

Re: trouble with "join" on large RDDs

2014-04-07 Thread Patrick Wendell

On Mon, Apr 7, 2014 at 7:37 PM, Brad Miller wrote:

> I am running the latest version of PySpark branch-0.9 and having some
> trouble with join.
>
> One RDD is about 100G (25GB compressed and serialized in memory) with
> 130K records, the other RDD is about 10G (2.5G compressed and
> serialized in memory) with 330K records.  I load each RDD from HDFS,
> invoke keyBy to key each record, and then attempt to join the RDDs.
>
> The join consistently crashes at the beginning of the reduce phase.
> Note that when joining the 10G RDD to itself there is no problem.
>
> Prior to the crash, several suspicious things happen:
>
> -All map output files from the map phase of the join are written to
> spark.local.dir, even though there should be plenty of memory left to
> contain the map output.  I am reasonably sure *all* map outputs are
> written to disk because the size of the map output spill directory
> matches the size of the shuffle write (as displayed in the user
> interface) for each machine.
>

The shuffle data is written through the buffer cache of the operating
system, so you would expect the files to show up there immediately and
probably to show up as being their full size when you do "ls". In reality
though these are likely residing in the OS cache and not on disk.


> -In the beginning of the reduce phase of the join, memory consumption
> on each work spikes and each machine runs out of memory (as evidenced
> by a "Cannon allocate memory" exception in Java).  This is
> particularly surprising since each machine has 30G of ram and each
> spark worker has only been allowed 10G.
>

Could you paste the error here?


> -In the web UI both the "Shuffle Spill (Memory)" and "Shuffle Spill
> (Disk)" fields for each machine remain at 0.0 despite shuffle files
> being written into spark.local.dir.
>

Shuffle spill is different than the shuffle files written to
spark.local.dir. Shuffle spilling is for aggregations that occur on the
reduce side of the shuffle. A join like this might not see any spilling.



>From the logs it looks like your executor has died. Would you be able to
paste the log from the executor with the exact failure? It would show up in
the /work directory inside of spark's directory on the cluster.

答复: java.lang.NoClassDefFoundError: scala/tools/nsc/transform/UnCurry$UnCurryTransformer...

2014-04-07 Thread Francis . Hu

Great!!!

When i built it on another disk whose format is ext4, it works right now.

hadoop@ubuntu-1:~$ df -Th
FilesystemType  Size  Used Avail Use% Mounted on
/dev/sdb6 ext4  135G  8.6G  119G   7% /
udev  devtmpfs  7.7G  4.0K  7.7G   1% /dev
tmpfs tmpfs 3.1G  316K  3.1G   1% /run
none  tmpfs 5.0M 0  5.0M   0% /run/lock
none  tmpfs 7.8G  4.0K  7.8G   1% /run/shm
/dev/sda1 ext4  112G  3.7G  103G   4% /faststore
/home/hadoop/.Private ecryptfs  135G  8.6G  119G   7% /home/hadoop

Thanks again, Marcelo Vanzin.


Francis.Hu

-邮件原件-
发件人: Marcelo Vanzin [mailto:van...@cloudera.com] 
发送时间: Saturday, April 05, 2014 1:13
收件人: user@spark.apache.org
主题: Re: java.lang.NoClassDefFoundError: 
scala/tools/nsc/transform/UnCurry$UnCurryTransformer...

Hi Francis,

This might be a long shot, but do you happen to have built spark on an
encrypted home dir?

(I was running into the same error when I was doing that. Rebuilding
on an unencrypted disk fixed the issue. This is a known issue /
limitation with ecryptfs. It's weird that the build doesn't fail, but
you do get warnings about the long file names.)


On Wed, Apr 2, 2014 at 3:26 AM, Francis.Hu  wrote:
> I stuck in a NoClassDefFoundError.  Any helps that would be appreciated.
>
> I download spark 0.9.0 source, and then run this command to build it :
> SPARK_HADOOP_VERSION=2.2.0 SPARK_YARN=true sbt/sbt assembly

>
> java.lang.NoClassDefFoundError:
> scala/tools/nsc/transform/UnCurry$UnCurryTransformer$$anonfun$14$$anonfun$apply$5$$anonfun$scala$tools$nsc$transform$UnCurry$UnCurryTransformer$$anonfun$$anonfun$$transformInConstructor$1$1

-- 
Marcelo

RE: CheckpointRDD has different number of partitions than original RDD

2014-04-07 Thread Paul Mogren

1.:  I will paste the full content of the environment page of the example 
application running against the cluster at the end of this message.
2. and 3.:  Following #2 I was able to see that the count was incorrectly 0 
when running against the cluster, and following #3 I was able to get the 
message:
org.apache.spark.SparkException: Checkpoint RDD CheckpointRDD[4] at count at 
:15(0) has different number of partitions than original RDD 
MappedRDD[3] at textFile at :12(2)

I think I understand - state checkpoints and other file-exchange operations in 
Spark cluster require a distributed/shared filesystem, even with just a 
single-host cluster and the driver/shell on a second host. Is that correct?

Thank you,
Paul



Stages
Storage
Environment
Executors
NetworkWordCumulativeCountUpdateStateByKey application UI
Environment
Runtime Information

NameValue
Java Home   /usr/lib/jvm/jdk1.8.0/jre
Java Version1.8.0 (Oracle Corporation)
Scala Home  
Scala Version   version 2.10.3
Spark Properties

NameValue
spark.app.name  NetworkWordCumulativeCountUpdateStateByKey
spark.cleaner.ttl   3600
spark.deploy.recoveryMode   ZOOKEEPER
spark.deploy.zookeeper.url  pubsub01:2181
spark.driver.host   10.10.41.67
spark.driver.port   37360
spark.fileserver.urihttp://10.10.41.67:40368
spark.home  /home/pmogren/streamproc/spark-0.9.0-incubating-bin-hadoop2
spark.httpBroadcast.uri http://10.10.41.67:45440
spark.jars  
/home/pmogren/streamproc/spark-0.9.0-incubating-bin-hadoop2/examples/target/scala-2.10/spark-examples_2.10-assembly-0.9.0-incubating.jar
spark.masterspark://10.10.41.19:7077
System Properties

NameValue
awt.toolkit sun.awt.X11.XToolkit
file.encoding   ANSI_X3.4-1968
file.encoding.pkg   sun.io
file.separator  /
java.awt.graphicsenvsun.awt.X11GraphicsEnvironment
java.awt.printerjob sun.print.PSPrinterJob
java.class.version  52.0
java.endorsed.dirs  /usr/lib/jvm/jdk1.8.0/jre/lib/endorsed
java.ext.dirs   /usr/lib/jvm/jdk1.8.0/jre/lib/ext:/usr/java/packages/lib/ext
java.home   /usr/lib/jvm/jdk1.8.0/jre
java.io.tmpdir  /tmp
java.library.path   
java.net.preferIPv4Stacktrue
java.runtime.name   Java(TM) SE Runtime Environment
java.runtime.version1.8.0-b132
java.specification.name Java Platform API Specification
java.specification.vendor   Oracle Corporation
java.specification.version  1.8
java.vendor Oracle Corporation
java.vendor.url http://java.oracle.com/
java.vendor.url.bug http://bugreport.sun.com/bugreport/
java.version1.8.0
java.vm.infomixed mode
java.vm.nameJava HotSpot(TM) 64-Bit Server VM
java.vm.specification.name  Java Virtual Machine Specification
java.vm.specification.vendorOracle Corporation
java.vm.specification.version   1.8
java.vm.vendor  Oracle Corporation
java.vm.version 25.0-b70
line.separator  
log4j.configuration conf/log4j.properties
os.arch amd64
os.name Linux
os.version  3.5.0-23-generic
path.separator  :
sun.arch.data.model 64
sun.boot.class.path 
/usr/lib/jvm/jdk1.8.0/jre/lib/resources.jar:/usr/lib/jvm/jdk1.8.0/jre/lib/rt.jar:/usr/lib/jvm/jdk1.8.0/jre/lib/sunrsasign.jar:/usr/lib/jvm/jdk1.8.0/jre/lib/jsse.jar:/usr/lib/jvm/jdk1.8.0/jre/lib/jce.jar:/usr/lib/jvm/jdk1.8.0/jre/lib/charsets.jar:/usr/lib/jvm/jdk1.8.0/jre/lib/jfr.jar:/usr/lib/jvm/jdk1.8.0/jre/classes
sun.boot.library.path   /usr/lib/jvm/jdk1.8.0/jre/lib/amd64
sun.cpu.endian  little
sun.cpu.isalist 
sun.io.serialization.extendedDebugInfo  true
sun.io.unicode.encoding UnicodeLittle
sun.java.command
org.apache.spark.streaming.examples.StatefulNetworkWordCount 
spark://10.10.41.19:7077 localhost 
sun.java.launcher   SUN_STANDARD
sun.jnu.encodingANSI_X3.4-1968
sun.management.compiler HotSpot 64-Bit Tiered Compilers
sun.nio.ch.bugLevel 
sun.os.patch.level  unknown
user.countryUS
user.dir/home/pmogren/streamproc/spark-0.9.0-incubating-bin-hadoop2
user.home   /home/pmogren
user.language   en
user.name   pmogren
user.timezone   America/New_York
Classpath Entries

ResourceSource
/home/pmogren/streamproc/spark-0.9.0-incubating-bin-hadoop2/assembly/target/scala-2.10/spark-assembly_2.10-0.9.0-incubating-hadoop2.2.0.jar
 System Classpath
/home/pmogren/streamproc/spark-0.9.0-incubating-bin-hadoop2/confSystem 
Classpath
/home/pmogren/streamproc/spark-0.9.0-incubating-bin-hadoop2/examples/target/scala-2.10/spark-examples_2.10-assembly-0.9.0-incubating.jar
System Classpath
http://10.10.41.67:40368/jars/spark-examples_2.10-assembly-0.9.0-incubating.jar 
Added By User











From: Tathagata Das [mailto:tathagata.das1...@gmail.com] 
Sent: Monday, April 07, 2014 7:54 PM
To: user@spark.apache.org
Subject: Re: CheckpointRDD has different number of partitions than original RDD

Few things that would be helpful. 

1. Environment settings - you can find them on the environment tab in the Spark 
application UI
2. Are you setting

Re: RDDInfo visibility SPARK-1132

2014-04-07 Thread Koert Kuipers

ok yeah we are using StageInfo and TaskInfo too...


On Mon, Apr 7, 2014 at 8:51 PM, Andrew Or  wrote:

> Hi Koert,
>
> Other users have expressed interest for us to expose similar classes too
> (i.e. StageInfo, TaskInfo). In the newest release, they will be available
> as part of the developer API. The particular PR that will change this is:
> https://github.com/apache/spark/pull/274.
>
> Cheers,
> Andrew
>
>
> On Mon, Apr 7, 2014 at 5:05 PM, Koert Kuipers  wrote:
>
>> any reason why RDDInfo suddenly became private in SPARK-1132?
>>
>> we are using it to show users status of rdds
>>
>
>

Re: RDDInfo visibility SPARK-1132

2014-04-07 Thread Andrew Or

Hi Koert,

Other users have expressed interest for us to expose similar classes too
(i.e. StageInfo, TaskInfo). In the newest release, they will be available
as part of the developer API. The particular PR that will change this is:
https://github.com/apache/spark/pull/274.

Cheers,
Andrew

On Mon, Apr 7, 2014 at 5:05 PM, Koert Kuipers  wrote:

> any reason why RDDInfo suddenly became private in SPARK-1132?
>
> we are using it to show users status of rdds
>

RDDInfo visibility SPARK-1132

2014-04-07 Thread Koert Kuipers

any reason why RDDInfo suddenly became private in SPARK-1132?

we are using it to show users status of rdds

Re: CheckpointRDD has different number of partitions than original RDD

2014-04-07 Thread Tathagata Das

Few things that would be helpful.

1. Environment settings - you can find them on the environment tab in the
Spark application UI
2. Are you setting the HDFS configuration correctly in your Spark program?
For example, can you write a HDFS file from a Spark program (say
spark-shell) to your HDFS installation and read it back into Spark (i.e.,
create a RDD)? You can test this by write an RDD as a text file from the
shell, and then try to read it back from another shell.
3. If that works, then lets try explicitly checkpointing an RDD. To do this
you can take any RDD and do the following.

myRDD.checkpoint()
myRDD.count()

If there is some issue, then this should reproduce the above error.

TD


On Mon, Apr 7, 2014 at 3:48 PM, Paul Mogren  wrote:

> Hello, Spark community!  My name is Paul. I am a Spark newbie, evaluating
> version 0.9.0 without any Hadoop at all, and need some help. I run into the
> following error with the StatefulNetworkWordCount example (and similarly in
> my prototype app, when I use the updateStateByKey operation).  I get this
> when running against my small cluster, but not (so far) against local[2].
>
> 61904 [spark-akka.actor.default-dispatcher-2] ERROR
> org.apache.spark.streaming.scheduler.JobScheduler - Error running job
> streaming job 1396905956000 ms.0
> org.apache.spark.SparkException: Checkpoint RDD CheckpointRDD[310] at take
> at DStream.scala:586(0) has different number of partitions than original
> RDD MapPartitionsRDD[309] at mapPartitions at StateDStream.scala:66(2)
> at
> org.apache.spark.rdd.RDDCheckpointData.doCheckpoint(RDDCheckpointData.scala:99)
> at org.apache.spark.rdd.RDD.doCheckpoint(RDD.scala:989)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:855)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:870)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:884)
> at org.apache.spark.rdd.RDD.take(RDD.scala:844)
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$foreachFunc$2$1.apply(DStream.scala:586)
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$foreachFunc$2$1.apply(DStream.scala:585)
> at
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:41)
> at
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
> at
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
> at scala.util.Try$.apply(Try.scala:161)
> at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32)
> at
> org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:155)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:744)
>
>
> Please let me know what other information would be helpful; I didn't find
> any question submission guidelines.
>
> Thanks,
> Paul
>

job offering

2014-04-07 Thread Rault, Severan

Hi,

I am looking for users of spark to join my teams here at Amazon. If you are 
reading this you probably qualify.
I am looking for developer of ANY level, but with an interest in spark. My 
teams are leveraging spark to solve real business scenarios.
If you are interested, just shoot me a note and tell you more about the 
opportunities here in Seattle.

Best,

Severan Rault

Director Software development
Amazon

Re: Creating a SparkR standalone job

2014-04-07 Thread pawan kumar

Thanks Shivaram! Will give it a try and let you know.

Regards,
Pawan Venugopal


On Mon, Apr 7, 2014 at 3:38 PM, Shivaram Venkataraman <
shiva...@eecs.berkeley.edu> wrote:

> You can create standalone jobs in SparkR as just R files that are run
> using the sparkR script. These commands will be sent to a Spark cluster and
> the examples on the SparkR repository (
> https://github.com/amplab-extras/SparkR-pkg#examples-unit-tests) are in
> fact standalone jobs.
>
> However I don't think that will completely solve your use case of using
> Streaming + R. We don't yet have a way to call R functions from Spark's
> Java or Scala API. So right now one thing you can try is to save data from
> SparkStreaming to HDFS and then run a SparkR job which reads in the file.
>
> Regarding the other idea of calling R from Scala -- it might be possible
> to do that in your code if the classpath etc. is setup correctly. I haven't
> tried it out though, but do let us know if you get it to work.
>
> Thanks
> Shivaram
>
>
> On Mon, Apr 7, 2014 at 2:21 PM, pawan kumar  wrote:
>
>> Hi,
>>
>> Is it possible to create a standalone job in scala using sparkR? If
>> possible can you provide me with the information of the setup process.
>> (Like the dependencies in SBT and where to include the JAR files)
>>
>> This is my use-case:
>>
>> 1. I have a Spark Streaming standalone Job running in local machine which
>> streams twitter data.
>> 2. I have an R script which performs Sentiment Analysis.
>>
>> I am looking for an optimal way where I could combine these two
>> operations into a single job and run using "SBT Run" command.
>>
>> I came across this document which talks about embedding R into scala (
>> http://dahl.byu.edu/software/jvmr/dahl-payne-uppalapati-2013.pdf) but
>> was not sure if that would work well within the spark context.
>>
>> Thanks,
>> Pawan Venugopal
>>
>>
>

CheckpointRDD has different number of partitions than original RDD

2014-04-07 Thread Paul Mogren

Hello, Spark community!  My name is Paul. I am a Spark newbie, evaluating 
version 0.9.0 without any Hadoop at all, and need some help. I run into the 
following error with the StatefulNetworkWordCount example (and similarly in my 
prototype app, when I use the updateStateByKey operation).  I get this when 
running against my small cluster, but not (so far) against local[2].

61904 [spark-akka.actor.default-dispatcher-2] ERROR 
org.apache.spark.streaming.scheduler.JobScheduler - Error running job streaming 
job 1396905956000 ms.0
org.apache.spark.SparkException: Checkpoint RDD CheckpointRDD[310] at take at 
DStream.scala:586(0) has different number of partitions than original RDD 
MapPartitionsRDD[309] at mapPartitions at StateDStream.scala:66(2)
at 
org.apache.spark.rdd.RDDCheckpointData.doCheckpoint(RDDCheckpointData.scala:99)
at org.apache.spark.rdd.RDD.doCheckpoint(RDD.scala:989)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:855)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:870)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:884)
at org.apache.spark.rdd.RDD.take(RDD.scala:844)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$foreachFunc$2$1.apply(DStream.scala:586)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$foreachFunc$2$1.apply(DStream.scala:585)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:41)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
at scala.util.Try$.apply(Try.scala:161)
at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32)
at 
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:155)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:744)


Please let me know what other information would be helpful; I didn't find any 
question submission guidelines.

Thanks,
Paul

Re: Creating a SparkR standalone job

2014-04-07 Thread Shivaram Venkataraman

You can create standalone jobs in SparkR as just R files that are run using
the sparkR script. These commands will be sent to a Spark cluster and the
examples on the SparkR repository (
https://github.com/amplab-extras/SparkR-pkg#examples-unit-tests) are in
fact standalone jobs.

However I don't think that will completely solve your use case of using
Streaming + R. We don't yet have a way to call R functions from Spark's
Java or Scala API. So right now one thing you can try is to save data from
SparkStreaming to HDFS and then run a SparkR job which reads in the file.

Regarding the other idea of calling R from Scala -- it might be possible to
do that in your code if the classpath etc. is setup correctly. I haven't
tried it out though, but do let us know if you get it to work.

Thanks
Shivaram

On Mon, Apr 7, 2014 at 2:21 PM, pawan kumar  wrote:

> Hi,
>
> Is it possible to create a standalone job in scala using sparkR? If
> possible can you provide me with the information of the setup process.
> (Like the dependencies in SBT and where to include the JAR files)
>
> This is my use-case:
>
> 1. I have a Spark Streaming standalone Job running in local machine which
> streams twitter data.
> 2. I have an R script which performs Sentiment Analysis.
>
> I am looking for an optimal way where I could combine these two operations
> into a single job and run using "SBT Run" command.
>
> I came across this document which talks about embedding R into scala (
> http://dahl.byu.edu/software/jvmr/dahl-payne-uppalapati-2013.pdf) but was
> not sure if that would work well within the spark context.
>
> Thanks,
> Pawan Venugopal
>
>

Re: AWS Spark-ec2 script with different user

2014-04-07 Thread Shivaram Venkataraman

Hmm -- That is strange. Can you paste the command you are using to launch
the instances ? The typical workflow is to use the spark-ec2 wrapper script
using the guidelines at http://spark.apache.org/docs/latest/ec2-scripts.html

Shivaram


On Mon, Apr 7, 2014 at 1:53 PM, Marco Costantini <
silvio.costant...@granatads.com> wrote:

> Hi Shivaram,
>
> OK so let's assume the script CANNOT take a different user and that it
> must be 'root'. The typical workaround is as you said, allow the ssh with
> the root user. Now, don't laugh, but, this worked last Friday, but today
> (Monday) it no longer works. :D Why? ...
>
> ...It seems that NOW, when you launch a 'paravirtual' ami, the root user's
> 'authorized_keys' file is always overwritten. This means the workaround
> doesn't work anymore! I would LOVE for someone to verify this.
>
> Just to point out, I am trying to make this work with a paravirtual
> instance and not an HVM instance.
>
> Please and thanks,
> Marco.
>
>
> On Mon, Apr 7, 2014 at 4:40 PM, Shivaram Venkataraman <
> shivaram.venkatara...@gmail.com> wrote:
>
>> Right now the spark-ec2 scripts assume that you have root access and a
>> lot of internal scripts assume have the user's home directory hard coded as
>> /root.   However all the Spark AMIs we build should have root ssh access --
>> Do you find this not to be the case ?
>>
>> You can also enable root ssh access in a vanilla AMI by editing
>> /etc/ssh/sshd_config and setting "PermitRootLogin" to yes
>>
>> Thanks
>> Shivaram
>>
>>
>>
>> On Mon, Apr 7, 2014 at 11:14 AM, Marco Costantini <
>> silvio.costant...@granatads.com> wrote:
>>
>>> Hi all,
>>> On the old Amazon Linux EC2 images, the user 'root' was enabled for ssh.
>>> Also, it is the default user for the Spark-EC2 script.
>>>
>>> Currently, the Amazon Linux images have an 'ec2-user' set up for ssh
>>> instead of 'root'.
>>>
>>> I can see that the Spark-EC2 script allows you to specify which user to
>>> log in with, but even when I change this, the script fails for various
>>> reasons. And the output SEEMS that the script is still based on the
>>> specified user's home directory being '/root'.
>>>
>>> Am I using this script wrong?
>>> Has anyone had success with this 'ec2-user' user?
>>> Any ideas?
>>>
>>> Please and thank you,
>>> Marco.
>>>
>>
>>
>

Driver Out of Memory

2014-04-07 Thread Eduardo Costa Alfaia


Hi Guys,

I would like understanding why the Driver's RAM goes down, Does the 
processing occur only in the workers?

Thanks
# Start Tests
computer1(Worker/Source Stream)
 23:57:18 up 12:03,  1 user,  load average: 0.03, 0.31, 0.44
 total   used   free sharedbuffers cached
Mem:  3945   1084   2860  0 44827
-/+ buffers/cache:212   3732
Swap:0  0  0
computer8 (Driver/Master)
 23:57:18 up 11:53,  5 users,  load average: 0.43, 1.19, 1.31
 total   used   free sharedbuffers cached
Mem:  5897   4430   1466  0 384   2662
-/+ buffers/cache:   1382   4514
Swap:0  0  0
computer10(Worker/Source Stream)
 23:57:18 up 12:02,  1 user,  load average: 0.55, 1.34, 0.98
 total   used   free sharedbuffers cached
Mem:  5897564   5332  0 18358
-/+ buffers/cache:187   5709
Swap:0  0  0
computer11(Worker/Source Stream)
 23:57:18 up 12:02,  1 user,  load average: 0.07, 0.19, 0.29
 total   used   free sharedbuffers cached
Mem:  3945603   3342  0 54355
-/+ buffers/cache:193   3751
Swap:0  0  0

 After 2 Minutes

computer1
 00:06:41 up 12:12,  1 user,  load average: 3.11, 1.32, 0.73
 total   used   free sharedbuffers cached
Mem:  3945   2950994  0 46 1095
-/+ buffers/cache:   1808   2136
Swap:0  0  0
computer8(Driver/Master)
 00:06:41 up 12:02,  5 users,  load average: 1.16, 0.71, 0.96
 total   used   free sharedbuffers cached
Mem:  5897   5191705  0 385   2792
-/+ buffers/cache:   2014   3882
Swap:0  0  0
computer10
 00:06:41 up 12:11,  1 user,  load average: 2.02, 1.07, 0.89
 total   used   free sharedbuffers cached
Mem:  5897   2567   3329  0 21647
-/+ buffers/cache:   1898   3998
Swap:0  0  0
computer11
 00:06:42 up 12:12,  1 user,  load average: 3.96, 1.83, 0.88
 total   used   free sharedbuffers cached
Mem:  3945   3542402  0 57 1099
-/+ buffers/cache:   2385   1559
Swap:0  0  0


--
Informativa sulla Privacy: http://www.unibs.it/node/8155

Creating a SparkR standalone job

2014-04-07 Thread pawan kumar

Hi,

Is it possible to create a standalone job in scala using sparkR? If
possible can you provide me with the information of the setup process.
(Like the dependencies in SBT and where to include the JAR files)

This is my use-case:

1. I have a Spark Streaming standalone Job running in local machine which
streams twitter data.
2. I have an R script which performs Sentiment Analysis.

I am looking for an optimal way where I could combine these two operations
into a single job and run using "SBT Run" command.

I came across this document which talks about embedding R into scala (
http://dahl.byu.edu/software/jvmr/dahl-payne-uppalapati-2013.pdf) but was
not sure if that would work well within the spark context.

Thanks,
Pawan Venugopal

Re: AWS Spark-ec2 script with different user

2014-04-07 Thread Marco Costantini

Hi Shivaram,

OK so let's assume the script CANNOT take a different user and that it must
be 'root'. The typical workaround is as you said, allow the ssh with the
root user. Now, don't laugh, but, this worked last Friday, but today
(Monday) it no longer works. :D Why? ...

...It seems that NOW, when you launch a 'paravirtual' ami, the root user's
'authorized_keys' file is always overwritten. This means the workaround
doesn't work anymore! I would LOVE for someone to verify this.

Just to point out, I am trying to make this work with a paravirtual
instance and not an HVM instance.

Please and thanks,
Marco.

On Mon, Apr 7, 2014 at 4:40 PM, Shivaram Venkataraman <
shivaram.venkatara...@gmail.com> wrote:

> Right now the spark-ec2 scripts assume that you have root access and a lot
> of internal scripts assume have the user's home directory hard coded as
> /root.   However all the Spark AMIs we build should have root ssh access --
> Do you find this not to be the case ?
>
> You can also enable root ssh access in a vanilla AMI by editing
> /etc/ssh/sshd_config and setting "PermitRootLogin" to yes
>
> Thanks
> Shivaram
>
>
>
> On Mon, Apr 7, 2014 at 11:14 AM, Marco Costantini <
> silvio.costant...@granatads.com> wrote:
>
>> Hi all,
>> On the old Amazon Linux EC2 images, the user 'root' was enabled for ssh.
>> Also, it is the default user for the Spark-EC2 script.
>>
>> Currently, the Amazon Linux images have an 'ec2-user' set up for ssh
>> instead of 'root'.
>>
>> I can see that the Spark-EC2 script allows you to specify which user to
>> log in with, but even when I change this, the script fails for various
>> reasons. And the output SEEMS that the script is still based on the
>> specified user's home directory being '/root'.
>>
>> Am I using this script wrong?
>> Has anyone had success with this 'ec2-user' user?
>> Any ideas?
>>
>> Please and thank you,
>> Marco.
>>
>
>

Re: AWS Spark-ec2 script with different user

2014-04-07 Thread Shivaram Venkataraman

Right now the spark-ec2 scripts assume that you have root access and a lot
of internal scripts assume have the user's home directory hard coded as
/root.   However all the Spark AMIs we build should have root ssh access --
Do you find this not to be the case ?

You can also enable root ssh access in a vanilla AMI by editing
/etc/ssh/sshd_config and setting "PermitRootLogin" to yes

Thanks
Shivaram



On Mon, Apr 7, 2014 at 11:14 AM, Marco Costantini <
silvio.costant...@granatads.com> wrote:

> Hi all,
> On the old Amazon Linux EC2 images, the user 'root' was enabled for ssh.
> Also, it is the default user for the Spark-EC2 script.
>
> Currently, the Amazon Linux images have an 'ec2-user' set up for ssh
> instead of 'root'.
>
> I can see that the Spark-EC2 script allows you to specify which user to
> log in with, but even when I change this, the script fails for various
> reasons. And the output SEEMS that the script is still based on the
> specified user's home directory being '/root'.
>
> Am I using this script wrong?
> Has anyone had success with this 'ec2-user' user?
> Any ideas?
>
> Please and thank you,
> Marco.
>

Re: ui broken in latest 1.0.0

2014-04-07 Thread Koert Kuipers

got it thanks


On Mon, Apr 7, 2014 at 4:08 PM, Xiangrui Meng  wrote:

> This is fixed in https://github.com/apache/spark/pull/281. Please try
> again with the latest master. -Xiangrui
>
> On Mon, Apr 7, 2014 at 1:06 PM, Koert Kuipers  wrote:
> > i noticed that for spark 1.0.0-SNAPSHOT which i checked out a few days
> ago
> > (apr 5) that the "application detail ui" no longer shows any RDDs on the
> > storage tab, despite the fact that they are definitely cached.
> >
> > i am running spark in standalone mode.
>

Re: ui broken in latest 1.0.0

2014-04-07 Thread Xiangrui Meng

This is fixed in https://github.com/apache/spark/pull/281. Please try
again with the latest master. -Xiangrui

On Mon, Apr 7, 2014 at 1:06 PM, Koert Kuipers  wrote:
> i noticed that for spark 1.0.0-SNAPSHOT which i checked out a few days ago
> (apr 5) that the "application detail ui" no longer shows any RDDs on the
> storage tab, despite the fact that they are definitely cached.
>
> i am running spark in standalone mode.

ui broken in latest 1.0.0

2014-04-07 Thread Koert Kuipers

i noticed that for spark 1.0.0-SNAPSHOT which i checked out a few days ago
(apr 5) that the "application detail ui" no longer shows any RDDs on the
storage tab, despite the fact that they are definitely cached.

i am running spark in standalone mode.

Re: Sample Project for using Shark API in Spark programs

2014-04-07 Thread Yana Kadiyska

I might be wrong here but I don't believe it's discouraged. Maybe part
of the reason there's not a lot of examples is that sql2rdd returns an
RDD (TableRDD that is
https://github.com/amplab/shark/blob/master/src/main/scala/shark/SharkContext.scala).
I haven't done anything too complicated yet but my impression is that
almost any Spark example of manipulating RDDs should applying from
that line onwards.

Are you asking for samples what to do with the RDD once you get it or
how to get a SharkContext from a standalone program?

Also, my reading of a recent email on this list is that SharkAPI will
be largely superceded by a more general SparkSQL API in 1.0
(http://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.html).
So if you're just starting out and you don't have short term needs
that might be a better place to start...

On Mon, Apr 7, 2014 at 9:14 AM, Jerry Lam  wrote:
> Hi Shark,
>
> Should I assume that Shark users should not use the shark APIs since there
> are no documentations for it? If there are documentations, can you point it
> out?
>
> Best Regards,
>
> Jerry
>
>
> On Thu, Apr 3, 2014 at 9:24 PM, Jerry Lam  wrote:
>>
>> Hello everyone,
>>
>> I have successfully installed Shark 0.9 and Spark 0.9 in standalone mode
>> in a cluster of 6 nodes for testing purposes.
>>
>> I would like to use Shark API in Spark programs. So far I could only find
>> the following:
>>
>> $./bin/shark-shell
>> scala> val youngUsers = sc.sql2rdd("SELECT * FROM users WHERE age < 20")
>> scala> println(youngUsers.count)
>> ...
>> scala> val featureMatrix = youngUsers.map(extractFeatures(_))
>> scala> kmeans(featureMatrix)
>>
>> Is there a more complete sample code to start a program using Shark API in
>> Spark?
>>
>> Thanks!
>>
>> Jerry
>
>

Re: Status of MLI?

2014-04-07 Thread Evan R. Sparks

That work is under submission at an academic conference and will be made
available if/when the paper is published.

In terms of algorithms for hyperparameter tuning, we consider Grid Search,
Random Search, a couple of older derivative-free optimization methods, and
a few newer methods - TPE (aka HyperOpt from James Bergstra), SMAC (from
Frank Hutter's group), and Spearmint (Jasper Snoek's method) - the short
answer is that in our hands Random Search works surprisingly well for the
low-dimensional problems we looked at, but TPE and SMAC perform slightly
better. I've got a private branch with TPE (as well as random and grid
search) integrated with MLI, but the code is research quality right now and
not extremely general.

We're actively working on bringing these things up to snuff for a proper
open source release.




On Fri, Apr 4, 2014 at 11:28 AM, Yi Zou  wrote:

> Hi, Evan,
>
> Just noticed this thread, do you mind sharing more details regarding
> algorithms targetted at hyperparameter tuning/model selection? or a link
> to dev git repo for that work.
>
> thanks,
> yi
>
>
> On Wed, Apr 2, 2014 at 6:03 PM, Evan R. Sparks wrote:
>
>> Targeting 0.9.0 should work out of the box (just a change to the
>> build.sbt) - I'll push some changes I've been sitting on to the public repo
>> in the next couple of days.
>>
>>
>> On Wed, Apr 2, 2014 at 4:05 AM, Krakna H  wrote:
>>
>>> Thanks for the update Evan! In terms of using MLI, I see that the Github
>>> code is linked to Spark 0.8; will it not work with 0.9 (which is what I
>>> have set up) or higher versions?
>>>
>>>
>>> On Wed, Apr 2, 2014 at 1:44 AM, Evan R. Sparks [via Apache Spark User
>>> List] <[hidden email]
>>> > wrote:
>>>
 Hi there,

 MLlib is the first component of MLbase - MLI and the higher levels of
 the stack are still being developed. Look for updates in terms of our
 progress on the hyperparameter tuning/model selection problem in the next
 month or so!

 - Evan


 On Tue, Apr 1, 2014 at 8:05 PM, Krakna H <[hidden 
 email]
 > wrote:

> Hi Nan,
>
> I was actually referring to MLI/MLBase (http://www.mlbase.org); is
> this being actively developed?
>
> I'm familiar with mllib and have been looking at its documentation.
>
> Thanks!
>
>
> On Tue, Apr 1, 2014 at 10:44 PM, Nan Zhu [via Apache Spark User List]
> <[hidden email] >wrote:
>
>>  mllib has been part of Spark distribution (under mllib directory),
>> also check http://spark.apache.org/docs/latest/mllib-guide.html
>>
>> and for JIRA, because of the recent migration to apache JIRA, I think
>> all mllib-related issues should be under the Spark umbrella,
>> https://issues.apache.org/jira/browse/SPARK/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel
>>
>> --
>> Nan Zhu
>>
>> On Tuesday, April 1, 2014 at 10:38 PM, Krakna H wrote:
>>
>> What is the current development status of MLI/MLBase? I see that the
>> github repo is lying dormant (https://github.com/amplab/MLI) and
>> JIRA has had no activity in the last 30 days (
>> https://spark-project.atlassian.net/browse/MLI/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel).
>> Is the plan to add a lot of this into mllib itself without needing a
>> separate API?
>>
>> Thanks!
>>
>> --
>> View this message in context: Status of 
>> MLI?
>> Sent from the Apache Spark User List mailing list 
>> archiveat
>> Nabble.com.
>>
>>
>>
>>
>> --
>>  If you reply to this email, your message will be added to the
>> discussion below:
>>
>> http://apache-spark-user-list.1001560.n3.nabble.com/Status-of-MLI-tp3610p3611.html
>>  To start a new topic under Apache Spark User List, email [hidden
>> email] 
>> To unsubscribe from Apache Spark User List, click here.
>> NAML
>>
>
>
> --
> View this message in context: Re: Status of 
> MLI?
>
> Sent from the Apache Spark User L

SparkContext.addFile() and FileNotFoundException

2014-04-07 Thread Thierry Herrmann

Hi,
I'm trying to use SparkContext.addFile() to propagate a file to worker
nodes, in a standalone cluster (2 nodes, 1 master, 1 worker connected to the
master). I don't have HDFS or any distributed file system. Just playing with
basic stuff.
Here's the code in my driver (actually spark-shell running on the master
node). In the current directory I have file spam.data
The following commands are taken from the book
http://www.packtpub.com/fast-data-processing-with-spark/book , page 44


*scala> sc.addFile("spam.data")*

14/04/07 14:03:48 INFO Utils: Copying
/home/thierry/dev/spark-samples/packt-book/LoadSaveExample/spam.data to
/tmp/spark-ad9ec403-7894-463b-9e67-08610cd1ae91/spam.data
14/04/07 14:03:49 INFO SparkContext: Added file spam.data at
http://192.168.1.51:59008/files/spam.data with timestamp 1396893828972

*scala> import org.apache.spark.SparkFiles*
import org.apache.spark.SparkFiles

*scala> val inFile = sc.textFile(SparkFiles.get("spam.data"))*

14/04/07 14:05:00 INFO MemoryStore: ensureFreeSpace(138763) called with
curMem=0, maxMem=311387750
14/04/07 14:05:00 INFO MemoryStore: Block broadcast_0 stored as values to
memory (estimated size 135.5 KB, free 296.8 MB)
inFile: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at
:13


Now trigger some action to make the worker work.

*scala> inFile.count()*


In the stderr.log of the app on the worker :

14/04/07 14:05:33 INFO Executor: Fetching
http://192.168.1.51:59008/files/spam.data with timestamp 1396893828972
14/04/07 14:05:33 INFO Utils: Fetching
http://192.168.1.51:59008/files/spam.data to
/tmp/fetchFileTemp435286457200696761.tmp

So apparently the file was successfully downloaded from the driver to the
worker. The jar of the application is also successfully downloaded.
But a bit later, in the same stderr.log:

14/04/07 14:05:34 INFO HttpBroadcast: Reading broadcast variable 0 took
0.352334273 s
14/04/07 14:05:34 INFO HadoopRDD: Input split:
file:/tmp/spark-ad9ec403-7894-463b-9e67-08610cd1ae91/spam.data:0+349170
14/04/07 14:05:34 ERROR Executor: Exception in task ID 0
java.io.FileNotFoundException: File
file:/tmp/spark-ad9ec403-7894-463b-9e67-08610cd1ae91/spam.data does not
exist
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:520)
at
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:398)
at
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:137)
at
org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:763)
at
org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:106)
at
org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:156)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:149)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:64)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:109)
at org.apache.spark.scheduler.Task.run(Task.scala:53)
at
org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
at
org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)

It looks like the file is looked for in:

/tmp/spark-ad9ec403-7894-463b-9e67-08610cd1ae91/spam.dat

which is the temp location on the master node where the driver is running,
while it was downloaded in the worker node in
/tmp/fetchFileTemp435286457200696761.tmp

I see hadoop related classes in the stack trace. Does it mean HDFS is used ?
If that's the case, is it because I'm using the precompiled
spark-0.9.0-incubating-bin-hadoop2 ?

I couldn't find any response, neither in the spark user list, nor by
googling it or in the spark guides (sorry for that probably very basic
question)




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-addFile-and-FileNotFoundException-tp3844.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

AWS Spark-ec2 script with different user

2014-04-07 Thread Marco Costantini

Hi all,
On the old Amazon Linux EC2 images, the user 'root' was enabled for ssh.
Also, it is the default user for the Spark-EC2 script.

Currently, the Amazon Linux images have an 'ec2-user' set up for ssh
instead of 'root'.

I can see that the Spark-EC2 script allows you to specify which user to log
in with, but even when I change this, the script fails for various reasons.
And the output SEEMS that the script is still based on the specified user's
home directory being '/root'.

Am I using this script wrong?
Has anyone had success with this 'ec2-user' user?
Any ideas?

Please and thank you,
Marco.

Re: reduceByKeyAndWindow Java

2014-04-07 Thread Eduardo Costa Alfaia


Hi TD
Could you explain me this code part?

.reduceByKeyAndWindow(
109 new Function2() {
110   public Integer call(Integer i1, Integer i2) { return i1 + 
i2; }

111 },
112 new Function2() {
113   public Integer call(Integer i1, Integer i2) { return i1 - 
i2; }

114 },
115 new Duration(60 * 5 * 1000),
116 new Duration(1 * 1000)
117   );

Thanks

Em 4/4/14, 22:56, Tathagata Das escreveu:
I havent really compiled the code, but it looks good to me. Why? Is 
there any problem you are facing?


TD


On Fri, Apr 4, 2014 at 8:03 AM, Eduardo Costa Alfaia 
mailto:e.costaalf...@unibs.it>> wrote:



Hi guys,

I would like knowing if the part of code is right to use in Window.

JavaPairDStream wordCounts = words.map(
103   new PairFunction() {
104 @Override
105 public Tuple2 call(String s) {
106   return new Tuple2(s, 1);
107 }
108   }).reduceByKeyAndWindow(
109 new Function2() {
110   public Integer call(Integer i1, Integer i2) { return
i1 + i2; }
111 },
112 new Function2() {
113   public Integer call(Integer i1, Integer i2) { return
i1 - i2; }
114 },
115 new Duration(60 * 5 * 1000),
116 new Duration(1 * 1000)
117   );



Thanks

-- 
Informativa sulla Privacy: http://www.unibs.it/node/8155






--
Informativa sulla Privacy: http://www.unibs.it/node/8155

Re: Spark Disk Usage

2014-04-07 Thread Surendranauth Hiraman

It might help if I clarify my questions. :-)

1. Is persist() applied during the transformation right before the
persist() call in the graph? Or is is applied after the transform's
processing is complete? In the case of things like GroupBy, is the Seq
backed by disk as it is being created? We're trying to get a sense of how
the processing is handled behind the scenes with respect to disk.

2. When else is disk used internally?

Any pointers are appreciated.

-Suren




On Mon, Apr 7, 2014 at 8:46 AM, Surendranauth Hiraman <
suren.hira...@velos.io> wrote:

> Hi,
>
> Any thoughts on this? Thanks.
>
> -Suren
>
>
>
> On Thu, Apr 3, 2014 at 8:27 AM, Surendranauth Hiraman <
> suren.hira...@velos.io> wrote:
>
>> Hi,
>>
>> I know if we call persist with the right options, we can have Spark
>> persist an RDD's data on disk.
>>
>> I am wondering what happens in intermediate operations that could
>> conceivably create large collections/Sequences, like GroupBy and shuffling.
>>
>> Basically, one part of the question is when is disk used internally?
>>
>> And is calling persist() on the RDD returned by such transformations what
>> let's it know to use disk in those situations? Trying to understand if
>> persist() is applied during the transformation or after it.
>>
>> Thank you.
>>
>>
>> SUREN HIRAMAN, VP TECHNOLOGY
>> Velos
>> Accelerating Machine Learning
>>
>> 440 NINTH AVENUE, 11TH FLOOR
>> NEW YORK, NY 10001
>> O: (917) 525-2466 ext. 105
>> F: 646.349.4063
>> E: suren.hiraman@v elos.io
>> W: www.velos.io
>>
>>
>
>
> --
>
> SUREN HIRAMAN, VP TECHNOLOGY
> Velos
> Accelerating Machine Learning
>
> 440 NINTH AVENUE, 11TH FLOOR
> NEW YORK, NY 10001
> O: (917) 525-2466 ext. 105
> F: 646.349.4063
> E: suren.hiraman@v elos.io
> W: www.velos.io
>
>


-- 

SUREN HIRAMAN, VP TECHNOLOGY
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR
NEW YORK, NY 10001
O: (917) 525-2466 ext. 105
F: 646.349.4063
E: suren.hiraman@v elos.io
W: www.velos.io

Re: non-lazy execution of sortByKey?

2014-04-07 Thread Matei Zaharia

Yeah, the reason it happens is that sortByKey tries to sample the data to 
figure out the right range partitions for it. But we could do this later, as 
the suggestion in there says.

Matei

On Apr 7, 2014, at 10:06 AM, Diana Carroll  wrote:

> Aha!  Well I'm not crazy then, thanks.
> 
> 
> On Mon, Apr 7, 2014 at 11:51 AM, Mark Hamstra  wrote:
> https://issues.apache.org/jira/browse/SPARK-1021?jql=text%20~%20%22sortByKey%22
> 
> 
> On Mon, Apr 7, 2014 at 8:42 AM, Diana Carroll  wrote:
> Until today, I was under the impression that *all* Spark transformations were 
> "lazy"...that is, they wouldn't actually execute until an *action* such as 
> count or take was performed.
> 
> However today I'm using the "sortByKey" transformation, which would appear to 
> execute immediately, rather than as a result of an operation.  Am I 
> misunderstanding something, is this a bug, or is this a deliberate difference 
> between sortByKey and other transformations?
> 
> Here's my test. I'm parsing a bunch of weblog files and I want to know which 
> users made the most requests.  So my code pull out the 2nd field of each line 
> (the user ID), add up the total number of hits for each user ID, swap user 
> ID/hit count, and sort of hitcount.
> 
> var userreqs = 
> sc.textFile("file:/home/training/training_materials/sparkdev/data/weblogs/*").
>map(_.split(" ")).
>map(words => (words(2),1)).  
>reduceByKey(_ + _).
>map(pair => (pair._2,pair._1)).
>sortByKey(false)
> 
> I thought nothing would actually happen here until I did userreqs.take(10) 
> but actually it did execute without the take(). It took about a minute for it 
> to complete and if I look at the web UI I see completed execution of 3 
> stages:  (Why is sortByKey two stages?)
> 
> 
> 
> Something else about this strikes me as odd, too.  If I follow this command 
> by userreqs.take(10), I think it executes the whole thing all over again, but 
> doesn't show all the stages: stage 3 is missing in the UI:
> 
> 
> 
> Plus it seems to automatically be caching my results?  Because when I execute 
> "take(10)" repeatedly, subsequent executions are very fast, and trigger only 
> a single stage:
> 
> 
> 
> And I confirmed it is caching because i tried deleting the underlying files 
> and the take() still worked.
> 
> Anyone have any insight?
> 
> Diana
> 
>

PySpark SocketConnect Issue in Cluster

2014-04-07 Thread Surendranauth Hiraman

Hi,

We have a situation where a Pyspark script works fine as a local process
("local" url) on the Master and the Worker nodes, which would indicate that
all python dependencies are set up properly on each machine.

But when we try to run the script at the cluster level (using the master's
url), if fails partway through the flow on a GroupBy with a SocketConnect
error and python crashes.

This is on ec2 using the AMI. This doesn't seem to be an issue of the
master not seeing the workers, since they show up in the web ui.

Also, we can see the job running on the cluster until it reaches the
GroupBy transform step, which is when we get the SocketConnect error.

Any ideas?

-Suren


SUREN HIRAMAN, VP TECHNOLOGY
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR
NEW YORK, NY 10001
O: (917) 525-2466 ext. 105
F: 646.349.4063
E: suren.hiraman@v elos.io
W: www.velos.io

Re: Null Pointer Exception in Spark Application with Yarn Client Mode

2014-04-07 Thread Sai Prasanna

Thanks Rahul, let me try that.
On Apr 7, 2014 7:33 PM, "Rahul Singhal"  wrote:

>   Hi Sai,
>
>  I recently also ran into this problem on 0.9.1. The problem is that
> spark tries to read yarn's class path but when it finds it be empty does
> not fallback to it's default value. To resolve this, either set
> yarn.application.classpath in yarn-site.xml to its default value or put in
> a bug fix. FYI, this issue seems to be resolved in master branch (1.0).
>
>  Reference:
> https://groups.google.com/forum/#!topic/shark-users/8qFsy9JSt4E
>
> http://hadoop.apache.org/docs/r2.3.0/hadoop-yarn/hadoop-yarn-common/yarn-default.xml
>
>   Thanks,
> Rahul Singhal
>
>   From: Sai Prasanna 
> Reply-To: "user@spark.apache.org" 
> Date: Monday 7 April 2014 6:56 PM
> To: "user@spark.apache.org" 
> Subject: Null Pointer Exception in Spark Application with Yarn Client Mode
>
>   Hi All,
>
>  I wanted Spark on Yarn to up and running.
>
>  I did "*SPARK_HADOOP_VERSION=2.3.0 SPARK_YARN=true ./sbt/sbt assembly*"
>
>  Then i ran
> "*SPARK_JAR=./assembly/target/scala-2.9.3/spark-assembly-0.8.1-incubating-hadoop2.3.0.jar
> SPARK_YARN_APP_JAR=examples/target/scala-2.9.3/spark-examples_2.9.3-0.8.1-incubating.jar
> MASTER=yarn-client ./spark-shell*"
>
>  I have SPARK_HOME, YARN_CONF_DIR/HADOOP_CONF_DIR set. Still i get the
> following error.
>
>  Any clues ??
>
>  *ERROR:*
> * ... Using Scala version 2.9.3
> (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_51) Initializing
> interpreter... Creating SparkContext... java.lang.NullPointerException at
> scala.collection.mutable.ArrayOps$ofRef.length(ArrayOps.scala:115) at
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps.foreach(ArrayOps.scala:38) at
> org.apache.spark.deploy.yarn.Client$.populateHadoopClasspath(Client.scala:489)
> at org.apache.spark.deploy.yarn.Client$.populateClasspath(Client.scala:510)
> at org.apache.spark.deploy.yarn.Client.setupLaunchEnv(Client.scala:327) at
> org.apache.spark.deploy.yarn.Client.runApp(Client.scala:90) at
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:71)
> at
> org.apache.spark.scheduler.cluster.ClusterScheduler.start(ClusterScheduler.scala:119)
> at org.apache.spark.SparkContext.(SparkContext.scala:273) at
> org.apache.spark.repl.SparkILoop.createSparkContext(SparkILoop.scala:862)
> at (:10) at (:22) at (:24) at
> .(:28) at .() at .(:7) at
> .() at $export() at
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606) at
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:629)
> at
> org.apache.spark.repl.SparkIMain$Request$$anonfun$10.apply(SparkIMain.scala:897)
> at scala.tools.nsc.interpreter.Line$$anonfun$1.apply$mcV$sp(Line.scala:43)
> at scala.tools.nsc.io.package$$anon$2.run(package.scala:25) at
> java.lang.Thread.run(Thread.java:744) *
> *.*
>
>
>
>
>

Re: How to create a RPM package

2014-04-07 Thread Will Benton

> For issue #2 I was concerned that the build & packaging had to be
> internal. So I am using the already packaged make-distribution.sh
> (modified to use a maven build) to create a tar ball which I then package
> it using a RPM spec file.

Hi Rahul, so the issue for downstream operating system distributions is that 
they basically need to be able to build everything from source.  So while 
you're building Spark itself locally, a package that appears in Fedora or 
Debian will go a bit further and also build all of the library dependencies, 
compilers, etc., from sources (possibly with some exceptions for bootstrapping 
compilers that can't be built without themselves) rather than pulling down 
binaries from a public Maven or Ivy repository.

> Although on a side note, it would interesting to learn how the list of
> files does not need to be maintained in the spec file (the spec file that
> Christophe attached was using a explicit list).

If you take a look at the spec that is in Fedora 
(http://pkgs.fedoraproject.org/cgit/spark.git/tree/spark.spec, starting at line 
257 in the current revision), you'll see the following %files section:

%files -f .mfiles
%dir %{_javadir}/%{name}

%doc LICENSE README.md

%files javadoc
%{_javadocdir}/%{name}
%doc LICENSE

As you can see, this list is pretty generic (and obviously doesn't include all 
the files in the package).  The list of JAR and POM files is provided in a file 
called .mfiles (which is automatically generated by macros) and we just have to 
specify the directories that the package owns (which aren't picked up by the 
macros at this time), the license, and the README.



best,
wb

Re: Null Pointer Exception in Spark Application with Yarn Client Mode

2014-04-07 Thread Rahul Singhal

Hi Sai,

I recently also ran into this problem on 0.9.1. The problem is that spark tries 
to read yarn's class path but when it finds it be empty does not fallback to 
it's default value. To resolve this, either set yarn.application.classpath in 
yarn-site.xml to its default value or put in a bug fix. FYI, this issue seems 
to be resolved in master branch (1.0).

Reference:
https://groups.google.com/forum/#!topic/shark-users/8qFsy9JSt4E
http://hadoop.apache.org/docs/r2.3.0/hadoop-yarn/hadoop-yarn-common/yarn-default.xml

Thanks,
Rahul Singhal

From: Sai Prasanna mailto:ansaiprasa...@gmail.com>>
Reply-To: "user@spark.apache.org" 
mailto:user@spark.apache.org>>
Date: Monday 7 April 2014 6:56 PM
To: "user@spark.apache.org" 
mailto:user@spark.apache.org>>
Subject: Null Pointer Exception in Spark Application with Yarn Client Mode

Hi All,

I wanted Spark on Yarn to up and running.

I did "SPARK_HADOOP_VERSION=2.3.0 SPARK_YARN=true ./sbt/sbt assembly"

Then i ran
"SPARK_JAR=./assembly/target/scala-2.9.3/spark-assembly-0.8.1-incubating-hadoop2.3.0.jar
 
SPARK_YARN_APP_JAR=examples/target/scala-2.9.3/spark-examples_2.9.3-0.8.1-incubating.jar
 MASTER=yarn-client ./spark-shell"

I have SPARK_HOME, YARN_CONF_DIR/HADOOP_CONF_DIR set. Still i get the following 
error.

Any clues ??

ERROR:
...
Using Scala version 2.9.3 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_51)
Initializing interpreter...
Creating SparkContext...
java.lang.NullPointerException
at scala.collection.mutable.ArrayOps$ofRef.length(ArrayOps.scala:115)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps.foreach(ArrayOps.scala:38)
at 
org.apache.spark.deploy.yarn.Client$.populateHadoopClasspath(Client.scala:489)
at org.apache.spark.deploy.yarn.Client$.populateClasspath(Client.scala:510)
at org.apache.spark.deploy.yarn.Client.setupLaunchEnv(Client.scala:327)
at org.apache.spark.deploy.yarn.Client.runApp(Client.scala:90)
at 
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:71)
at 
org.apache.spark.scheduler.cluster.ClusterScheduler.start(ClusterScheduler.scala:119)
at org.apache.spark.SparkContext.(SparkContext.scala:273)
at org.apache.spark.repl.SparkILoop.createSparkContext(SparkILoop.scala:862)
at (:10)
at (:22)
at (:24)
at .(:28)
at .()
at .(:7)
at .()
at $export()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:629)
at 
org.apache.spark.repl.SparkIMain$Request$$anonfun$10.apply(SparkIMain.scala:897)
at scala.tools.nsc.interpreter.Line$$anonfun$1.apply$mcV$sp(Line.scala:43)
at scala.tools.nsc.io.package$$anon$2.run(package.scala:25)
at java.lang.Thread.run(Thread.java:744)
.

Null Pointer Exception in Spark Application with Yarn Client Mode

2014-04-07 Thread Sai Prasanna

Hi All,

I wanted Spark on Yarn to up and running.

I did "*SPARK_HADOOP_VERSION=2.3.0 SPARK_YARN=true ./sbt/sbt assembly*"

Then i ran
"*SPARK_JAR=./assembly/target/scala-2.9.3/spark-assembly-0.8.1-incubating-hadoop2.3.0.jar
SPARK_YARN_APP_JAR=examples/target/scala-2.9.3/spark-examples_2.9.3-0.8.1-incubating.jar
MASTER=yarn-client ./spark-shell*"

I have SPARK_HOME, YARN_CONF_DIR/HADOOP_CONF_DIR set. Still i get the
following error.

Any clues ??

*ERROR:*
*...Using Scala version 2.9.3 (Java
HotSpot(TM) 64-Bit Server VM, Java 1.7.0_51)Initializing
interpreter...Creating SparkContext...java.lang.NullPointerException at
scala.collection.mutable.ArrayOps$ofRef.length(ArrayOps.scala:115) at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps.foreach(ArrayOps.scala:38) at
org.apache.spark.deploy.yarn.Client$.populateHadoopClasspath(Client.scala:489)
at org.apache.spark.deploy.yarn.Client$.populateClasspath(Client.scala:510)
at org.apache.spark.deploy.yarn.Client.setupLaunchEnv(Client.scala:327) at
org.apache.spark.deploy.yarn.Client.runApp(Client.scala:90) at
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:71)
at
org.apache.spark.scheduler.cluster.ClusterScheduler.start(ClusterScheduler.scala:119)
at org.apache.spark.SparkContext.(SparkContext.scala:273) at
org.apache.spark.repl.SparkILoop.createSparkContext(SparkILoop.scala:862)
at (:10) at (:22) at (:24) at
.(:28) at .() at .(:7) at
.() at $export() at
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606) at
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:629)
at
org.apache.spark.repl.SparkIMain$Request$$anonfun$10.apply(SparkIMain.scala:897)
at scala.tools.nsc.interpreter.Line$$anonfun$1.apply$mcV$sp(Line.scala:43)
at scala.tools.nsc.io.package$$anon$2.run(package.scala:25) at
java.lang.Thread.run(Thread.java:744)*
*.*

Require some clarity on partitioning

2014-04-07 Thread Sanjay Awatramani

Hi,

I was going through Matei's Advanced Spark presentation at 
https://www.youtube.com/watch?v=w0Tisli7zn4 , and had few questions.
The presentation of this video is at 
http://ampcamp.berkeley.edu/wp-content/uploads/2012/06/matei-zaharia-amp-camp-2012-advanced-spark.pdf

The PageRank example introduces partitioning in the below way:
val ranks = // RDD of (url, rank) pairs
val links = sc.textFile(...).map(...).partitionBy(new HashPartitioner(8))

However later on, it is said that
1) Any shuffle operation on two RDDs will take on the Partitioner of one of 
them, if one is set
Question1: Could we have applied partitionBy on the ranks RDD and have the same 
result/performance ?
2) Otherwise, by default use HashPartitioner
Question2: If partitionBy applies HashPartitioner in this example, could we 
simply not have any partitioner and relied on the default HashPartitioner to 
achieve the same result/performance ?

I had another question unrelated to this presentation.
Question3: If my processing is something like this
rdd3 = rdd1.join(rdd2)
rdd4 = rdd3.map((k,(v1,v2))=>(v1,k))
rdd6 = rdd4.join(rdd5)
rdd6.saveAsTextFiles(out.txt)

Would I benefit by partitioning ? Unlike the PageRank example, I do not have to 
join/shuffle the same RDD or key more than once.

Regards,
Sanjay

Re: Sample Project for using Shark API in Spark programs

2014-04-07 Thread Jerry Lam

Hi Shark,

Should I assume that Shark users should not use the shark APIs since there
are no documentations for it? If there are documentations, can you point it
out?

Best Regards,

Jerry


On Thu, Apr 3, 2014 at 9:24 PM, Jerry Lam  wrote:

> Hello everyone,
>
> I have successfully installed Shark 0.9 and Spark 0.9 in standalone mode
> in a cluster of 6 nodes for testing purposes.
>
> I would like to use Shark API in Spark programs. So far I could only find
> the following:
>
> $./bin/shark-shell
> scala> val youngUsers = sc.sql2rdd("SELECT * FROM users WHERE age < 20")
> scala> println(youngUsers.count)
> ...
> scala> val featureMatrix = youngUsers.map(extractFeatures(_))
> scala> kmeans(featureMatrix)
>
> Is there a more complete sample code to start a program using Shark API in
> Spark?
>
> Thanks!
>
> Jerry
>

Re: Spark Disk Usage

2014-04-07 Thread Surendranauth Hiraman

Hi,

Any thoughts on this? Thanks.

-Suren



On Thu, Apr 3, 2014 at 8:27 AM, Surendranauth Hiraman <
suren.hira...@velos.io> wrote:

> Hi,
>
> I know if we call persist with the right options, we can have Spark
> persist an RDD's data on disk.
>
> I am wondering what happens in intermediate operations that could
> conceivably create large collections/Sequences, like GroupBy and shuffling.
>
> Basically, one part of the question is when is disk used internally?
>
> And is calling persist() on the RDD returned by such transformations what
> let's it know to use disk in those situations? Trying to understand if
> persist() is applied during the transformation or after it.
>
> Thank you.
>
>
> SUREN HIRAMAN, VP TECHNOLOGY
> Velos
> Accelerating Machine Learning
>
> 440 NINTH AVENUE, 11TH FLOOR
> NEW YORK, NY 10001
> O: (917) 525-2466 ext. 105
> F: 646.349.4063
> E: suren.hiraman@v elos.io
> W: www.velos.io
>
>


-- 

SUREN HIRAMAN, VP TECHNOLOGY
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR
NEW YORK, NY 10001
O: (917) 525-2466 ext. 105
F: 646.349.4063
E: suren.hiraman@v elos.io
W: www.velos.io

Recommended way to develop spark application with both java and python

2014-04-07 Thread Wush Wu

Dear all,

We have a spark 0.8.1 cluster on mesos 0.15. Some of my colleagues are
familiar with python, but some of features are developed under java. I am
looking for a way to integrate java and python on spark.

I notice that the initialization of pyspark does not include a field to
distribute jar files to slaves. After exploring the source code and do some
hacking, I could control the java sparkcontext object through py4j, but the
jar files are not delivered to slaves. Moreover, it seems that the spark
lauch the process through the spark home on pyspark but through the
spark.executor.uri on scala.

Is there a recommended way to develop spark application with both
java/scala and python? Should I suggest my team to unify the language?

Thanks!

hang on sorting operation

2014-04-07 Thread Stuart Zakon

I am seeing a small standalone cluster (master, slave) hang when I reach a 
certain memory threshold, but I cannot detect how to configure memory to avoid 
this.
I added memory by configuring SPARK_DAEMON_MEMORY=2G and I can see this 
allocated, but it does not help.

The reduce is by key to get the counts by key:
        rdd = sc.parallelize(self.phrases)

        # do a distributed count using reduceByKey
        counts = rdd.map(lambda x: (x, 1)).reduceByKey(lambda x, y: x + y)

        # reverse the (key, count) pairs into (count, key) and then sort in 
descending order
        sorted = counts.map(lambda (key, count): (count, key)).sortByKey(False)

Below is the log to the point of hanging:

14/04/06 19:39:15 INFO DAGScheduler: Submitting 2 missing tasks from Stage 1 
(PairwiseRDD[2] at reduceByKey)
14/04/06 19:39:15 INFO TaskSchedulerImpl: Adding task set 1.0 with 2 tasks
14/04/06 19:39:15 INFO SparkDeploySchedulerBackend: Registered executor: 
Actor[akka.tcp://sparkExecutor@localhost:64370/user/Executor#-2031138316] with 
ID 0
14/04/06 19:39:15 INFO TaskSetManager: Starting task 1.0:0 as TID 0 on executor 
0: localhost (PROCESS_LOCAL)
14/04/06 19:39:15 INFO TaskSetManager: Serialized task 1.0:0 as 10417848 bytes 
in 18 ms
14/04/06 19:39:15 INFO TaskSetManager: Starting task 1.0:1 as TID 1 on executor 
0: localhost (PROCESS_LOCAL)
14/04/06 19:39:15 INFO TaskSetManager: Serialized task 1.0:1 as 10571697 bytes 
in 13 ms
14/04/06 19:39:15 INFO BlockManagerMasterActor$BlockManagerInfo: Registering 
block manager localhost:64375 with 294.9 MB RAM
14/04/06 19:39:16 INFO TaskSetManager: Finished TID 0 in 1397 ms on localhost 
(progress: 0/2)
14/04/06 19:39:16 INFO DAGScheduler: Completed ShuffleMapTask(1, 0)


When I interrupt the running program, here is the stack trace, which appears 
stuck after the reduce in the sorting by count in descending order:

    sorted = counts.map(lambda (key, count): (count, key)).sortByKey(False)
  File 
"/Users/zakons/Spark/spark-0.9.0-incubating-bin-hadoop1/python/pyspark/rdd.py", 
line 361, in sortByKey
    rddSize = self.count()
  File 
"/Users/zakons/Spark/spark-0.9.0-incubating-bin-hadoop1/python/pyspark/rdd.py", 
line 542, in count
    return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
  File 
"/Users/zakons/Spark/spark-0.9.0-incubating-bin-hadoop1/python/pyspark/rdd.py", 
line 533, in sum
    return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add)
  File 
"/Users/zakons/Spark/spark-0.9.0-incubating-bin-hadoop1/python/pyspark/rdd.py", 
line 499, in reduce
    vals = self.mapPartitions(func).collect()
  File 
"/Users/zakons/Spark/spark-0.9.0-incubating-bin-hadoop1/python/pyspark/rdd.py", 
line 463, in collect
    bytesInJava = self._jrdd.collect().iterator()
  File 
"/Users/zakons/Spark/spark-0.9.0-incubating-bin-hadoop1/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py",
 line 535, in __call__
  File 
"/Users/zakons/Spark/spark-0.9.0-incubating-bin-hadoop1/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py",
 line 363, in send_command
  File 
"/Users/zakons/Spark/spark-0.9.0-incubating-bin-hadoop1/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py",
 line 472, in send_command
  File 
"/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/socket.py",
 line 430, in readline
    data = recv(1)
KeyboardInterrupt

Is there a reason why the sorting gets stuck?  I can easily remove the problem 
by reducing the size of the RDD below the threshold of about 800,000 items 
prior to the reduce is run.
It would help to see where resources like memory are depleted, but this does 
not show up in the console.

Many thanks,
Stuart Zakon

Re: how to save RDD partitions in different folders?

2014-04-07 Thread dmpour23

Can you provide an example?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/how-to-save-RDD-partitions-in-different-folders-tp3754p3823.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

39 matches

Mail list logo