Re: Kmeans example reduceByKey slow

2014-03-23 Thread Tsai Li Ming
Hi,

This is on a 4 nodes cluster each with 32 cores/256GB Ram. 

(0.9.0) is deployed in a stand alone mode.

Each worker is configured with 192GB. Spark executor memory is also 192GB. 

This is on the first iteration. K=50. Here’s the code I use:
http://pastebin.com/2yXL3y8i , which is a copy-and-paste of the example.

Thanks!



On 24 Mar, 2014, at 2:46 pm, Xiangrui Meng  wrote:

> Hi Tsai,
> 
> Could you share more information about the machine you used and the
> training parameters (runs, k, and iterations)? It can help solve your
> issues. Thanks!
> 
> Best,
> Xiangrui
> 
> On Sun, Mar 23, 2014 at 3:15 AM, Tsai Li Ming  wrote:
>> Hi,
>> 
>> At the reduceBuyKey stage, it takes a few minutes before the tasks start 
>> working.
>> 
>> I have -Dspark.default.parallelism=127 cores (n-1).
>> 
>> CPU/Network/IO is idling across all nodes when this is happening.
>> 
>> And there is nothing particular on the master log file. From the spark-shell:
>> 
>> 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:124 as TID 538 on 
>> executor 2: XXX (PROCESS_LOCAL)
>> 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:124 as 38765155 
>> bytes in 193 ms
>> 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:125 as TID 539 on 
>> executor 1: XXX (PROCESS_LOCAL)
>> 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:125 as 38765155 
>> bytes in 96 ms
>> 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:126 as TID 540 on 
>> executor 0: XXX (PROCESS_LOCAL)
>> 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:126 as 38765155 
>> bytes in 100 ms
>> 
>> But it stops there for some significant time before any movement.
>> 
>> In the stage detail of the UI, I can see that there are 127 tasks running 
>> but the duration each is at least a few minutes.
>> 
>> I'm working off local storage (not hdfs) and the kmeans data is about 6.5GB 
>> (50M rows).
>> 
>> Is this a normal behaviour?
>> 
>> Thanks!



Re: Kmeans example reduceByKey slow

2014-03-23 Thread Xiangrui Meng
Hi Tsai,

Could you share more information about the machine you used and the
training parameters (runs, k, and iterations)? It can help solve your
issues. Thanks!

Best,
Xiangrui

On Sun, Mar 23, 2014 at 3:15 AM, Tsai Li Ming  wrote:
> Hi,
>
> At the reduceBuyKey stage, it takes a few minutes before the tasks start 
> working.
>
> I have -Dspark.default.parallelism=127 cores (n-1).
>
> CPU/Network/IO is idling across all nodes when this is happening.
>
> And there is nothing particular on the master log file. From the spark-shell:
>
> 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:124 as TID 538 on 
> executor 2: XXX (PROCESS_LOCAL)
> 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:124 as 38765155 
> bytes in 193 ms
> 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:125 as TID 539 on 
> executor 1: XXX (PROCESS_LOCAL)
> 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:125 as 38765155 
> bytes in 96 ms
> 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:126 as TID 540 on 
> executor 0: XXX (PROCESS_LOCAL)
> 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:126 as 38765155 
> bytes in 100 ms
>
> But it stops there for some significant time before any movement.
>
> In the stage detail of the UI, I can see that there are 127 tasks running but 
> the duration each is at least a few minutes.
>
> I'm working off local storage (not hdfs) and the kmeans data is about 6.5GB 
> (50M rows).
>
> Is this a normal behaviour?
>
> Thanks!


Re: No space left on device exception

2014-03-23 Thread Patrick Wendell
Ognen - just so I understand. The issue is that there weren't enough
inodes and this was causing a "No space left on device" error? Is that
correct? If so, that's good to know because it's definitely counter
intuitive.

On Sun, Mar 23, 2014 at 8:36 PM, Ognen Duzlevski
 wrote:
> I would love to work on this (and other) stuff if I can bother someone with
> questions offline or on a dev mailing list.
> Ognen
>
>
> On 3/23/14, 10:04 PM, Aaron Davidson wrote:
>
> Thanks for bringing this up, 100% inode utilization is an issue I haven't
> seen raised before and this raises another issue which is not on our current
> roadmap for state cleanup (cleaning up data which was not fully cleaned up
> from a crashed process).
>
>
> On Sun, Mar 23, 2014 at 7:57 PM, Ognen Duzlevski
>  wrote:
>>
>> Bleh, strike that, one of my slaves was at 100% inode utilization on the
>> file system. It was /tmp/spark* leftovers that apparently did not get
>> cleaned up properly after failed or interrupted jobs.
>> Mental note - run a cron job on all slaves and master to clean up
>> /tmp/spark* regularly.
>>
>> Thanks (and sorry for the noise)!
>> Ognen
>>
>>
>> On 3/23/14, 9:52 PM, Ognen Duzlevski wrote:
>>
>> Aaron, thanks for replying. I am very much puzzled as to what is going on.
>> A job that used to run on the same cluster is failing with this mysterious
>> message about not having enough disk space when in fact I can see through
>> "watch df -h" that the free space is always hovering around 3+GB on the disk
>> and the free inodes are at 50% (this is on master). I went through each
>> slave and the spark/work/app*/stderr and stdout and spark/logs/*out files
>> and no mention of too many open files failures on any of the slaves nor on
>> the master :(
>>
>> Thanks
>> Ognen
>>
>> On 3/23/14, 8:38 PM, Aaron Davidson wrote:
>>
>> By default, with P partitions (for both the pre-shuffle stage and
>> post-shuffle), there are P^2 files created. With
>> spark.shuffle.consolidateFiles turned on, we would instead create only P
>> files. Disk space consumption is largely unaffected, however. by the number
>> of partitions unless each partition is particularly small.
>>
>> You might look at the actual executors' logs, as it's possible that this
>> error was caused by an earlier exception, such as "too many open files".
>>
>>
>> On Sun, Mar 23, 2014 at 4:46 PM, Ognen Duzlevski
>>  wrote:
>>>
>>> On 3/23/14, 5:49 PM, Matei Zaharia wrote:
>>>
>>> You can set spark.local.dir to put this data somewhere other than /tmp if
>>> /tmp is full. Actually it's recommended to have multiple local disks and set
>>> to to a comma-separated list of directories, one per disk.
>>>
>>> Matei, does the number of tasks/partitions in a transformation influence
>>> something in terms of disk space consumption? Or inode consumption?
>>>
>>> Thanks,
>>> Ognen
>>
>>
>>
>> --
>> "A distributed system is one in which the failure of a computer you didn't
>> even know existed can render your own computer unusable"
>> -- Leslie Lamport
>
>
>
> --
> "No matter what they ever do to us, we must always act for the love of our
> people and the earth. We must not react out of hatred against those who have
> no sense."
> -- John Trudell


Re: How many partitions is my RDD split into?

2014-03-23 Thread Patrick Wendell
As Mark said you can actually access this easily. The main issue I've
seen from a performance perspective is people having a bunch of really
small partitions. This will still work but the performance will
improve if you coalesce the partitions using rdd.coalesce().

This can happen for example if you do a highly selective filter on an
RDD. For instance, you filter out one day of data from a dataset of a
year.

- Patrick

On Sun, Mar 23, 2014 at 9:53 PM, Mark Hamstra  wrote:
> It's much simpler: rdd.partitions.size
>
>
> On Sun, Mar 23, 2014 at 9:24 PM, Nicholas Chammas
>  wrote:
>>
>> Hey there fellow Dukes of Data,
>>
>> How can I tell how many partitions my RDD is split into?
>>
>> I'm interested in knowing because, from what I gather, having a good
>> number of partitions is good for performance. If I'm looking to understand
>> how my pipeline is performing, say for a parallelized write out to HDFS,
>> knowing how many partitions an RDD has would be a good thing to check.
>>
>> Is that correct?
>>
>> I could not find an obvious method or property to see how my RDD is
>> partitioned. Instead, I devised the following thingy:
>>
>> def f(idx, itr): yield idx
>>
>> rdd = sc.parallelize([1, 2, 3, 4], 4)
>> rdd.mapPartitionsWithIndex(f).count()
>>
>> Frankly, I'm not sure what I'm doing here, but this seems to give me the
>> answer I'm looking for. Derp. :)
>>
>> So in summary, should I care about how finely my RDDs are partitioned? And
>> how would I check on that?
>>
>> Nick
>>
>>
>> 
>> View this message in context: How many partitions is my RDD split into?
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
>


Re: How many partitions is my RDD split into?

2014-03-23 Thread Mark Hamstra
It's much simpler: rdd.partitions.size


On Sun, Mar 23, 2014 at 9:24 PM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> Hey there fellow Dukes of Data,
>
> How can I tell how many partitions my RDD is split into?
>
> I'm interested in knowing because, from what I gather, having a good
> number of partitions is good for performance. If I'm looking to understand
> how my pipeline is performing, say for a parallelized write out to HDFS,
> knowing how many partitions an RDD has would be a good thing to check.
>
> Is that correct?
>
> I could not find an obvious method or property to see how my RDD is
> partitioned. Instead, I devised the following thingy:
>
> def f(idx, itr): yield idx
>
> rdd = sc.parallelize([1, 2, 3, 4], 4)
> rdd.mapPartitionsWithIndex(f).count()
>
> Frankly, I'm not sure what I'm doing here, but this seems to give me the
> answer I'm looking for. Derp. :)
>
> So in summary, should I care about how finely my RDDs are partitioned? And
> how would I check on that?
>
> Nick
>
>
> --
> View this message in context: How many partitions is my RDD split 
> into?
> Sent from the Apache Spark User List mailing list 
> archiveat Nabble.com.
>


How many partitions is my RDD split into?

2014-03-23 Thread Nicholas Chammas
Hey there fellow Dukes of Data,

How can I tell how many partitions my RDD is split into?

I'm interested in knowing because, from what I gather, having a good number
of partitions is good for performance. If I'm looking to understand how my
pipeline is performing, say for a parallelized write out to HDFS, knowing
how many partitions an RDD has would be a good thing to check.

Is that correct?

I could not find an obvious method or property to see how my RDD is
partitioned. Instead, I devised the following thingy:

def f(idx, itr): yield idx

rdd = sc.parallelize([1, 2, 3, 4], 4)
rdd.mapPartitionsWithIndex(f).count()

Frankly, I'm not sure what I'm doing here, but this seems to give me the
answer I'm looking for. Derp. :)

So in summary, should I care about how finely my RDDs are partitioned? And
how would I check on that?

Nick




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-many-partitions-is-my-RDD-split-into-tp3072.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: No space left on device exception

2014-03-23 Thread Ognen Duzlevski
I would love to work on this (and other) stuff if I can bother someone 
with questions offline or on a dev mailing list.

Ognen

On 3/23/14, 10:04 PM, Aaron Davidson wrote:
Thanks for bringing this up, 100% inode utilization is an issue I 
haven't seen raised before and this raises another issue which is not 
on our current roadmap for state cleanup (cleaning up data which was 
not fully cleaned up from a crashed process).



On Sun, Mar 23, 2014 at 7:57 PM, Ognen Duzlevski 
mailto:og...@plainvanillagames.com>> wrote:


Bleh, strike that, one of my slaves was at 100% inode utilization
on the file system. It was /tmp/spark* leftovers that apparently
did not get cleaned up properly after failed or interrupted jobs.
Mental note - run a cron job on all slaves and master to clean up
/tmp/spark* regularly.

Thanks (and sorry for the noise)!
Ognen


On 3/23/14, 9:52 PM, Ognen Duzlevski wrote:

Aaron, thanks for replying. I am very much puzzled as to what is
going on. A job that used to run on the same cluster is failing
with this mysterious message about not having enough disk space
when in fact I can see through "watch df -h" that the free space
is always hovering around 3+GB on the disk and the free inodes
are at 50% (this is on master). I went through each slave and the
spark/work/app*/stderr and stdout and spark/logs/*out files and
no mention of too many open files failures on any of the slaves
nor on the master :(

Thanks
Ognen

On 3/23/14, 8:38 PM, Aaron Davidson wrote:

By default, with P partitions (for both the pre-shuffle stage
and post-shuffle), there are P^2 files created.
With spark.shuffle.consolidateFiles turned on, we would instead
create only P files. Disk space consumption is largely
unaffected, however. by the number of partitions unless each
partition is particularly small.

You might look at the actual executors' logs, as it's possible
that this error was caused by an earlier exception, such as "too
many open files".


On Sun, Mar 23, 2014 at 4:46 PM, Ognen Duzlevski
mailto:og...@plainvanillagames.com>> wrote:

On 3/23/14, 5:49 PM, Matei Zaharia wrote:

You can set spark.local.dir to put this data somewhere
other than /tmp if /tmp is full. Actually it's recommended
to have multiple local disks and set to to a
comma-separated list of directories, one per disk.

Matei, does the number of tasks/partitions in a
transformation influence something in terms of disk space
consumption? Or inode consumption?

Thanks,
Ognen





-- 
"A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable"

-- Leslie Lamport



--
"No matter what they ever do to us, we must always act for the love of our people 
and the earth. We must not react out of hatred against those who have no sense."
-- John Trudell


Re: No space left on device exception

2014-03-23 Thread Aaron Davidson
Thanks for bringing this up, 100% inode utilization is an issue I haven't
seen raised before and this raises another issue which is not on our
current roadmap for state cleanup (cleaning up data which was not fully
cleaned up from a crashed process).


On Sun, Mar 23, 2014 at 7:57 PM, Ognen Duzlevski <
og...@plainvanillagames.com> wrote:

>  Bleh, strike that, one of my slaves was at 100% inode utilization on the
> file system. It was /tmp/spark* leftovers that apparently did not get
> cleaned up properly after failed or interrupted jobs.
> Mental note - run a cron job on all slaves and master to clean up
> /tmp/spark* regularly.
>
> Thanks (and sorry for the noise)!
> Ognen
>
>
> On 3/23/14, 9:52 PM, Ognen Duzlevski wrote:
>
> Aaron, thanks for replying. I am very much puzzled as to what is going on.
> A job that used to run on the same cluster is failing with this mysterious
> message about not having enough disk space when in fact I can see through
> "watch df -h" that the free space is always hovering around 3+GB on the
> disk and the free inodes are at 50% (this is on master). I went through
> each slave and the spark/work/app*/stderr and stdout and spark/logs/*out
> files and no mention of too many open files failures on any of the slaves
> nor on the master :(
>
> Thanks
> Ognen
>
> On 3/23/14, 8:38 PM, Aaron Davidson wrote:
>
> By default, with P partitions (for both the pre-shuffle stage and
> post-shuffle), there are P^2 files created.
> With spark.shuffle.consolidateFiles turned on, we would instead create only
> P files. Disk space consumption is largely unaffected, however. by the
> number of partitions unless each partition is particularly small.
>
>  You might look at the actual executors' logs, as it's possible that this
> error was caused by an earlier exception, such as "too many open files".
>
>
> On Sun, Mar 23, 2014 at 4:46 PM, Ognen Duzlevski <
> og...@plainvanillagames.com> wrote:
>
>>  On 3/23/14, 5:49 PM, Matei Zaharia wrote:
>>
>> You can set spark.local.dir to put this data somewhere other than /tmp if
>> /tmp is full. Actually it's recommended to have multiple local disks and
>> set to to a comma-separated list of directories, one per disk.
>>
>>  Matei, does the number of tasks/partitions in a transformation influence
>> something in terms of disk space consumption? Or inode consumption?
>>
>> Thanks,
>> Ognen
>>
>
>
> --
> "A distributed system is one in which the failure of a computer you didn't 
> even know existed can render your own computer unusable"
> -- Leslie Lamport
>
>


Re: distinct on huge dataset

2014-03-23 Thread Aaron Davidson
Ah, interesting. count() without distinct is streaming and does not require
that a single partition fits in memory, for instance. That said, the
behavior may change if you increase the number of partitions in your input
RDD by using RDD.repartition()


On Sun, Mar 23, 2014 at 11:47 AM, Kane  wrote:

> Yes, there was an error in data, after fixing it - count fails with Out of
> Memory Error.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3051.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Re: No space left on device exception

2014-03-23 Thread Ognen Duzlevski
Bleh, strike that, one of my slaves was at 100% inode utilization on the 
file system. It was /tmp/spark* leftovers that apparently did not get 
cleaned up properly after failed or interrupted jobs.
Mental note - run a cron job on all slaves and master to clean up 
/tmp/spark* regularly.


Thanks (and sorry for the noise)!
Ognen

On 3/23/14, 9:52 PM, Ognen Duzlevski wrote:
Aaron, thanks for replying. I am very much puzzled as to what is going 
on. A job that used to run on the same cluster is failing with this 
mysterious message about not having enough disk space when in fact I 
can see through "watch df -h" that the free space is always hovering 
around 3+GB on the disk and the free inodes are at 50% (this is on 
master). I went through each slave and the spark/work/app*/stderr and 
stdout and spark/logs/*out files and no mention of too many open files 
failures on any of the slaves nor on the master :(


Thanks
Ognen

On 3/23/14, 8:38 PM, Aaron Davidson wrote:
By default, with P partitions (for both the pre-shuffle stage and 
post-shuffle), there are P^2 files created. 
With spark.shuffle.consolidateFiles turned on, we would instead 
create only P files. Disk space consumption is largely unaffected, 
however. by the number of partitions unless each partition is 
particularly small.


You might look at the actual executors' logs, as it's possible that 
this error was caused by an earlier exception, such as "too many open 
files".



On Sun, Mar 23, 2014 at 4:46 PM, Ognen Duzlevski 
mailto:og...@plainvanillagames.com>> wrote:


On 3/23/14, 5:49 PM, Matei Zaharia wrote:

You can set spark.local.dir to put this data somewhere other
than /tmp if /tmp is full. Actually it's recommended to have
multiple local disks and set to to a comma-separated list of
directories, one per disk.

Matei, does the number of tasks/partitions in a transformation
influence something in terms of disk space consumption? Or inode
consumption?

Thanks,
Ognen





--
"A distributed system is one in which the failure of a computer you didn't even know 
existed can render your own computer unusable"
-- Leslie Lamport



Re: No space left on device exception

2014-03-23 Thread Ognen Duzlevski
Aaron, thanks for replying. I am very much puzzled as to what is going 
on. A job that used to run on the same cluster is failing with this 
mysterious message about not having enough disk space when in fact I can 
see through "watch df -h" that the free space is always hovering around 
3+GB on the disk and the free inodes are at 50% (this is on master). I 
went through each slave and the spark/work/app*/stderr and stdout and 
spark/logs/*out files and no mention of too many open files failures on 
any of the slaves nor on the master :(


Thanks
Ognen

On 3/23/14, 8:38 PM, Aaron Davidson wrote:
By default, with P partitions (for both the pre-shuffle stage and 
post-shuffle), there are P^2 files created. 
With spark.shuffle.consolidateFiles turned on, we would instead create 
only P files. Disk space consumption is largely unaffected, however. 
by the number of partitions unless each partition is particularly small.


You might look at the actual executors' logs, as it's possible that 
this error was caused by an earlier exception, such as "too many open 
files".



On Sun, Mar 23, 2014 at 4:46 PM, Ognen Duzlevski 
mailto:og...@plainvanillagames.com>> wrote:


On 3/23/14, 5:49 PM, Matei Zaharia wrote:

You can set spark.local.dir to put this data somewhere other than
/tmp if /tmp is full. Actually it's recommended to have multiple
local disks and set to to a comma-separated list of directories,
one per disk.

Matei, does the number of tasks/partitions in a transformation
influence something in terms of disk space consumption? Or inode
consumption?

Thanks,
Ognen





Re: sbt/sbt assembly fails with ssl certificate error

2014-03-23 Thread Bharath Bhushan
I don’t see the errors anymore. Thanks Aaron.

On 24-Mar-2014, at 12:52 am, Aaron Davidson  wrote:

> These errors should be fixed on master with Sean's PR: 
> https://github.com/apache/spark/pull/209
> 
> The orbit errors are quite possibly due to using https instead of http, 
> whether or not the SSL cert was bad. Let us know if they go away with 
> reverting to http.
> 
> 
> On Sun, Mar 23, 2014 at 11:48 AM, Debasish Das  
> wrote:
> I am getting these weird errors which I have not seen before:
> 
> [error] Server access Error: handshake alert:  unrecognized_name 
> url=https://repo.maven.apache.org/maven2/org/eclipse/jetty/orbit/javax.servlet/2.5.0.v201103041518/javax.servlet-2.5.0.v201103041518.orbit
> [info] Resolving org.eclipse.jetty.orbit#jetty-orbit;1 ...
> [error] Server access Error: handshake alert:  unrecognized_name 
> url=https://repo.maven.apache.org/maven2/org/eclipse/jetty/orbit/javax.transaction/1.1.1.v201105210645/javax.transaction-1.1.1.v201105210645.orbit
> [info] Resolving org.eclipse.jetty.orbit#jetty-orbit;1 ...
> [error] Server access Error: handshake alert:  unrecognized_name 
> url=https://repo.maven.apache.org/maven2/org/eclipse/jetty/orbit/javax.mail.glassfish/1.4.1.v201005082020/javax.mail.glassfish-1.4.1.v201005082020.orbit
> [info] Resolving org.eclipse.jetty.orbit#jetty-orbit;1 ...
> [error] Server access Error: handshake alert:  unrecognized_name 
> url=https://repo.maven.apache.org/maven2/org/eclipse/jetty/orbit/javax.activation/1.1.0.v201105071233/javax.activation-1.1.0.v201105071233.orbit
> 
> Is it also due to the SSL error ?
> 
> Thanks.
> Deb
> 
> 
> 
> On Sun, Mar 23, 2014 at 6:08 AM, Sean Owen  wrote:
> I'm also seeing this. It also was working for me previously AFAIK.
> 
> Tthe proximate cause is my well-intentioned change that uses HTTPS to access 
> all artifact repos. The default for Maven Central before would have been 
> HTTP. While it's a good idea to use HTTPS, it may run into complications.
> 
> I see:
> 
> https://issues.sonatype.org/browse/MVNCENTRAL-377 
> https://issues.apache.org/jira/browse/INFRA-7363
> 
> ... which suggest that actually this isn't supported, and should return 404. 
> 
> Then I see:
> http://blog.sonatype.com/2012/10/now-available-ssl-connectivity-to-central/#.Uy7Yuq1_tj4
> ... that suggests you have to pay for HTTPS access to Maven Central?
> 
> But, we both seem to have found it working (even after the change above) and 
> it does not return 404 now.
> 
> The Maven build works, still, but if you look carefully, it's actually only 
> because it eventually falls back to HTTP for Maven Central artifacts.
> 
> So I think the thing to do is simply back off to HTTP for Maven Central only. 
> That's unfortunate because there's a small but non-trivial security issue 
> here in downloading artifacts without security.
> 
> Any brighter ideas? I'll open a supplementary PR if so to adjust this.
> 
> 
> 



Re: error loading large files in PySpark 0.9.0

2014-03-23 Thread Matei Zaharia
Hey Jeremy, what happens if you pass batchSize=10 as an argument to your 
SparkContext? It tries to serialize that many objects together at a time, which 
might be too much. By default the batchSize is 1024.

Matei

On Mar 23, 2014, at 10:11 AM, Jeremy Freeman  wrote:

> Hi all,
> 
> Hitting a mysterious error loading large text files, specific to PySpark
> 0.9.0.
> 
> In PySpark 0.8.1, this works:
> 
> data = sc.textFile("path/to/myfile")
> data.count()
> 
> But in 0.9.0, it stalls. There are indications of completion up to:
> 
> 14/03/17 16:54:24 INFO TaskSetManager: Finished TID 4 in 1699 ms on X.X.X.X
> (progress: 15/537)
> 14/03/17 16:54:24 INFO DAGScheduler: Completed ResultTask(5, 4)
> 
> And then this repeats indefinitely
> 
> 14/03/17 16:54:24 DEBUG TaskSchedulerImpl: parentName: , name: TaskSet_5,
> runningTasks: 144
> 14/03/17 16:54:25 DEBUG TaskSchedulerImpl: parentName: , name: TaskSet_5,
> runningTasks: 144
> 
> Always stalls at the same place. There's nothing in stderr on the workers,
> but in stdout there are several of these messages:
> 
> INFO PythonRDD: stdin writer to Python finished early
> 
> So perhaps the real error is being suppressed as in
> https://spark-project.atlassian.net/browse/SPARK-1025
> 
> Data is just rows of space-separated numbers, ~20GB, with 300k rows and 50k
> characters per row. Running on a private cluster with 10 nodes, 100GB / 16
> cores each, Python v 2.7.6.
> 
> I doubt the data is corrupted as it works fine in Scala in 0.8.1 and 0.9.0,
> and in PySpark in 0.8.1. Happy to post the file, but it should repro for
> anything with these dimensions. It *might* be specific to long strings: I
> don't see it with fewer characters (10k) per row, but I also don't see it
> with many fewer rows but the same number of characters per row.
> 
> Happy to try and provide more info / help debug!
> 
> -- Jeremy
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/error-loading-large-files-in-PySpark-0-9-0-tp3049.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.



is it possible to access the inputsplit in Spark directly?

2014-03-23 Thread hwpstorage
Hello,
In spark we can use *newAPIHadoopRDD *to access the different distributed
system like HDFS, HBase, and MongoDB via different inputformat.
Is it possible to access the *inputsplit *in Spark directly? Spark can
cache data in local memory.
Perform local computation/aggregation on the local
inputsplit could speed up the whole performance.

Thanks a lot


Re: No space left on device exception

2014-03-23 Thread Aaron Davidson
By default, with P partitions (for both the pre-shuffle stage and
post-shuffle), there are P^2 files created.
With spark.shuffle.consolidateFiles turned on, we would instead create only
P files. Disk space consumption is largely unaffected, however. by the
number of partitions unless each partition is particularly small.

You might look at the actual executors' logs, as it's possible that this
error was caused by an earlier exception, such as "too many open files".


On Sun, Mar 23, 2014 at 4:46 PM, Ognen Duzlevski <
og...@plainvanillagames.com> wrote:

>  On 3/23/14, 5:49 PM, Matei Zaharia wrote:
>
> You can set spark.local.dir to put this data somewhere other than /tmp if
> /tmp is full. Actually it's recommended to have multiple local disks and
> set to to a comma-separated list of directories, one per disk.
>
> Matei, does the number of tasks/partitions in a transformation influence
> something in terms of disk space consumption? Or inode consumption?
>
> Thanks,
> Ognen
>


Re: Problem with SparkR

2014-03-23 Thread Shivaram Venkataraman
Hi

Thanks for reporting this. It'll be great if you can check a couple of
things:

1. Are you trying to use this with Hadoop2 by any chance ? There was an
incompatible ASM version bug that we fixed for Hadoop2
https://github.com/amplab-extras/SparkR-pkg/issues/17 and we verified it,
but I just want to check if the same error is cropping up again.

2. Is there a stack trace that follows the IncompatibleClassChangeError ?
If so could you attach that ? The error message indicates there is some
incompatibility between class versions and having a more detailed stack
trace might help us track this down.

Thanks
Shivaram





On Sun, Mar 23, 2014 at 4:48 PM, Jacques Basaldúa  wrote:

>  I am really interested in using Spark from R and have tried to use
> SparkR, but always get the same error.
>
>
>
> This is how I installed:
>
>
>
>  - I successfully installed Spark version  0.9.0 with Scala  2.10.3
> (OpenJDK 64-Bit Server VM, Java 1.7.0_45)
>
>I can run examples from spark-shell and Python
>
>
>
>  - I installed the R package devtools and installed SparkR using:
>
>
>
>  - library(devtools)
>
>  - install_github("amplab-extras/SparkR-pkg", subdir="pkg")
>
>
>
>   This compiled the package successfully.
>
>
>
> When I try to run the package
>
>
>
> E.g.,
>
>   library(SparkR)
>
>   sc <- sparkR.init(master="local")   //- so far the program runs
> fine
>
>
>
>   rdd <- parallelize(sc, 1:10)  // This returns the following error
>
>
>
>   Error in .jcall(getJRDD(rdd), "Ljava/util/List;", "collect") :
>
>   java.lang.IncompatibleClassChangeError:
> org/apache/spark/util/InnerClosureFinder
>
>
>
> No matter how I try to use the sc (I have tried all the examples) I always
> get an error.
>
>
>
> Any ideas?
>
>
>
> Jacques.
>


Problem with SparkR

2014-03-23 Thread Jacques Basaldúa
I am really interested in using Spark from R and have tried to use SparkR,
but always get the same error.

 

This is how I installed:

 

 - I successfully installed Spark version  0.9.0 with Scala  2.10.3 (OpenJDK
64-Bit Server VM, Java 1.7.0_45)

   I can run examples from spark-shell and Python

 

 - I installed the R package devtools and installed SparkR using:

 

 - library(devtools)

 - install_github("amplab-extras/SparkR-pkg", subdir="pkg")

 

  This compiled the package successfully.

  

When I try to run the package

 

E.g., 

  library(SparkR)

  sc <- sparkR.init(master="local")   //- so far the program runs
fine

 

  rdd <- parallelize(sc, 1:10)  // This returns the following error



  Error in .jcall(getJRDD(rdd), "Ljava/util/List;", "collect") : 

  java.lang.IncompatibleClassChangeError:
org/apache/spark/util/InnerClosureFinder

 

No matter how I try to use the sc (I have tried all the examples) I always
get an error.

 

Any ideas?

 

Jacques.



Re: No space left on device exception

2014-03-23 Thread Ognen Duzlevski

On 3/23/14, 5:49 PM, Matei Zaharia wrote:
You can set spark.local.dir to put this data somewhere other than /tmp 
if /tmp is full. Actually it’s recommended to have multiple local 
disks and set to to a comma-separated list of directories, one per disk.
Matei, does the number of tasks/partitions in a transformation influence 
something in terms of disk space consumption? Or inode consumption?


Thanks,
Ognen


Re: combining operations elegantly

2014-03-23 Thread Patrick Wendell
Hey All,

I think the old thread is here:
https://groups.google.com/forum/#!msg/spark-users/gVtOp1xaPdU/Uyy9cQz9H_8J

The method proposed in that thread is to create a utility class for
doing single-pass aggregations. Using Algebird is a pretty good way to
do this and is a bit more flexible since you don't need to create a
new utility each time you want to do this.

In Spark 1.0 and later you will be able to do this more elegantly with
the schema support:
myRDD.groupBy('user).select(Sum('clicks) as 'clicks,
Average('duration) as 'duration)

and it will use a single pass automatically... but that's not quite
released yet :)

- Patrick




On Sun, Mar 23, 2014 at 1:31 PM, Koert Kuipers  wrote:
> i currently typically do something like this:
>
> scala> val rdd = sc.parallelize(1 to 10)
> scala> import com.twitter.algebird.Operators._
> scala> import com.twitter.algebird.{Max, Min}
> scala> rdd.map{ x => (
>  |   1L,
>  |   Min(x),
>  |   Max(x),
>  |   x
>  | )}.reduce(_ + _)
> res0: (Long, com.twitter.algebird.Min[Int], com.twitter.algebird.Max[Int],
> Int) = (10,Min(1),Max(10),55)
>
> however for this you need twitter algebird dependency. without that you have
> to code the reduce function on the tuples yourself...
>
> another example with 2 columns, where i do conditional count for first
> column, and simple sum for second:
> scala> sc.parallelize((1 to 10).zip(11 to 20)).map{ case (x, y) => (
>  |   if (x > 5) 1 else 0,
>  |   y
>  | )}.reduce(_ + _)
> res3: (Int, Int) = (5,155)
>
>
>
> On Sun, Mar 23, 2014 at 2:26 PM, Richard Siebeling 
> wrote:
>>
>> Hi Koert, Patrick,
>>
>> do you already have an elegant solution to combine multiple operations on
>> a single RDD?
>> Say for example that I want to do a sum over one column, a count and an
>> average over another column,
>>
>> thanks in advance,
>> Richard
>>
>>
>> On Mon, Mar 17, 2014 at 8:20 AM, Richard Siebeling 
>> wrote:
>>>
>>> Patrick, Koert,
>>>
>>> I'm also very interested in these examples, could you please post them if
>>> you find them?
>>> thanks in advance,
>>> Richard
>>>
>>>
>>> On Thu, Mar 13, 2014 at 9:39 PM, Koert Kuipers  wrote:

 not that long ago there was a nice example on here about how to combine
 multiple operations on a single RDD. so basically if you want to do a
 count() and something else, how to roll them into a single job. i think
 patrick wendell gave the examples.

 i cant find them anymore patrick can you please repost? thanks!
>>>
>>>
>>
>


Re: No space left on device exception

2014-03-23 Thread Ognen Duzlevski


On 3/23/14, 5:35 PM, Aaron Davidson wrote:
On some systems, /tmp/ is an in-memory tmpfs file system, with its own 
size limit. It's possible that this limit has been exceeded. You might 
try running the "df" command to check to free space of "/tmp" or root 
if tmp isn't listed.


3 GB also seems pretty low for the remaining free space of a disk. If 
your disk size is in the TB range, it's possible that the last couple 
GB have issues when being allocated due to fragmentation or 
reclamation policies.




Aaron, thanks for the reply. These are Amazon Ubuntu instances that I 
have that have an 8GB root filesystem (with everything OS+Spark taking 
up about 4.5GB). The /tmp appears to be just a regular directory in / - 
hence it shares in the same 3.5GB left free space. I was "watch df"ing 
the space while my job was running.

Ognen


Re: No space left on device exception

2014-03-23 Thread Matei Zaharia
You can set spark.local.dir to put this data somewhere other than /tmp if /tmp 
is full. Actually it’s recommended to have multiple local disks and set to to a 
comma-separated list of directories, one per disk.

Matei

On Mar 23, 2014, at 3:35 PM, Aaron Davidson  wrote:

> On some systems, /tmp/ is an in-memory tmpfs file system, with its own size 
> limit. It's possible that this limit has been exceeded. You might try running 
> the "df" command to check to free space of "/tmp" or root if tmp isn't listed.
> 
> 3 GB also seems pretty low for the remaining free space of a disk. If your 
> disk size is in the TB range, it's possible that the last couple GB have 
> issues when being allocated due to fragmentation or reclamation policies.
> 
> 
> On Sun, Mar 23, 2014 at 3:06 PM, Ognen Duzlevski  
> wrote:
> Hello,
> 
> I have a weird error showing up when I run a job on my Spark cluster. The 
> version of spark is 0.9 and I have 3+ GB free on the disk when this error 
> shows up. Any ideas what I should be looking for?
> 
> [error] (run-main-0) org.apache.spark.SparkException: Job aborted: Task 
> 167.0:3 failed 4 times (most recent failure: Exception failure: 
> java.io.FileNotFoundException: 
> /tmp/spark-local-20140323214638-72df/31/shuffle_31_3_127 (No space left on 
> device))
> org.apache.spark.SparkException: Job aborted: Task 167.0:3 failed 4 times 
> (most recent failure: Exception failure: java.io.FileNotFoundException: 
> /tmp/spark-local-20140323214638-72df/31/shuffle_31_3_127 (No space left on 
> device))
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1026)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1026)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619)
> at scala.Option.foreach(Option.scala:236)
> at 
> org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:619)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:207)
> 
> Thanks!
> Ognen
> 



Re: No space left on device exception

2014-03-23 Thread Aaron Davidson
On some systems, /tmp/ is an in-memory tmpfs file system, with its own size
limit. It's possible that this limit has been exceeded. You might try
running the "df" command to check to free space of "/tmp" or root if tmp
isn't listed.

3 GB also seems pretty low for the remaining free space of a disk. If your
disk size is in the TB range, it's possible that the last couple GB have
issues when being allocated due to fragmentation or reclamation policies.


On Sun, Mar 23, 2014 at 3:06 PM, Ognen Duzlevski
wrote:

> Hello,
>
> I have a weird error showing up when I run a job on my Spark cluster. The
> version of spark is 0.9 and I have 3+ GB free on the disk when this error
> shows up. Any ideas what I should be looking for?
>
> [error] (run-main-0) org.apache.spark.SparkException: Job aborted: Task
> 167.0:3 failed 4 times (most recent failure: Exception failure:
> java.io.FileNotFoundException: 
> /tmp/spark-local-20140323214638-72df/31/shuffle_31_3_127
> (No space left on device))
> org.apache.spark.SparkException: Job aborted: Task 167.0:3 failed 4 times
> (most recent failure: Exception failure: java.io.FileNotFoundException:
> /tmp/spark-local-20140323214638-72df/31/shuffle_31_3_127 (No space left
> on device))
> at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$
> apache$spark$scheduler$DAGScheduler$$abortStage$1.
> apply(DAGScheduler.scala:1028)
> at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$
> apache$spark$scheduler$DAGScheduler$$abortStage$1.
> apply(DAGScheduler.scala:1026)
> at scala.collection.mutable.ResizableArray$class.foreach(
> ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$
> scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1026)
> at org.apache.spark.scheduler.DAGScheduler$$anonfun$
> processEvent$10.apply(DAGScheduler.scala:619)
> at org.apache.spark.scheduler.DAGScheduler$$anonfun$
> processEvent$10.apply(DAGScheduler.scala:619)
> at scala.Option.foreach(Option.scala:236)
> at org.apache.spark.scheduler.DAGScheduler.processEvent(
> DAGScheduler.scala:619)
> at org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$
> $anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:207)
>
> Thanks!
> Ognen
>


No space left on device exception

2014-03-23 Thread Ognen Duzlevski

Hello,

I have a weird error showing up when I run a job on my Spark cluster. 
The version of spark is 0.9 and I have 3+ GB free on the disk when this 
error shows up. Any ideas what I should be looking for?


[error] (run-main-0) org.apache.spark.SparkException: Job aborted: Task 
167.0:3 failed 4 times (most recent failure: Exception failure: 
java.io.FileNotFoundException: 
/tmp/spark-local-20140323214638-72df/31/shuffle_31_3_127 (No space left 
on device))
org.apache.spark.SparkException: Job aborted: Task 167.0:3 failed 4 
times (most recent failure: Exception failure: 
java.io.FileNotFoundException: 
/tmp/spark-local-20140323214638-72df/31/shuffle_31_3_127 (No space left 
on device))
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1026)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1026)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619)

at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:619)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:207)


Thanks!
Ognen


Re: combining operations elegantly

2014-03-23 Thread Koert Kuipers
i currently typically do something like this:

scala> val rdd = sc.parallelize(1 to 10)
scala> import com.twitter.algebird.Operators._
scala> import com.twitter.algebird.{Max, Min}
scala> rdd.map{ x => (
 |   1L,
 |   Min(x),
 |   Max(x),
 |   x
 | )}.reduce(_ + _)
res0: (Long, com.twitter.algebird.Min[Int], com.twitter.algebird.Max[Int],
Int) = (10,Min(1),Max(10),55)

however for this you need twitter algebird dependency. without that you
have to code the reduce function on the tuples yourself...

another example with 2 columns, where i do conditional count for first
column, and simple sum for second:
scala> sc.parallelize((1 to 10).zip(11 to 20)).map{ case (x, y) => (
 |   if (x > 5) 1 else 0,
 |   y
 | )}.reduce(_ + _)
res3: (Int, Int) = (5,155)



On Sun, Mar 23, 2014 at 2:26 PM, Richard Siebeling wrote:

> Hi Koert, Patrick,
>
> do you already have an elegant solution to combine multiple operations on
> a single RDD?
> Say for example that I want to do a sum over one column, a count and an
> average over another column,
>
> thanks in advance,
> Richard
>
>
> On Mon, Mar 17, 2014 at 8:20 AM, Richard Siebeling 
> wrote:
>
>> Patrick, Koert,
>>
>> I'm also very interested in these examples, could you please post them if
>> you find them?
>> thanks in advance,
>> Richard
>>
>>
>> On Thu, Mar 13, 2014 at 9:39 PM, Koert Kuipers  wrote:
>>
>>> not that long ago there was a nice example on here about how to combine
>>> multiple operations on a single RDD. so basically if you want to do a
>>> count() and something else, how to roll them into a single job. i think
>>> patrick wendell gave the examples.
>>>
>>> i cant find them anymore patrick can you please repost? thanks!
>>>
>>
>>
>


Re: sbt/sbt assembly fails with ssl certificate error

2014-03-23 Thread Aaron Davidson
These errors should be fixed on master with Sean's PR:
https://github.com/apache/spark/pull/209

The orbit errors are quite possibly due to using https instead of http,
whether or not the SSL cert was bad. Let us know if they go away with
reverting to http.


On Sun, Mar 23, 2014 at 11:48 AM, Debasish Das wrote:

> I am getting these weird errors which I have not seen before:
>
> [error] Server access Error: handshake alert:  unrecognized_name url=
> https://repo.maven.apache.org/maven2/org/eclipse/jetty/orbit/javax.servlet/2.5.0.v201103041518/javax.servlet-2.5.0.v201103041518.orbit
> [info] Resolving org.eclipse.jetty.orbit#jetty-orbit;1 ...
> [error] Server access Error: handshake alert:  unrecognized_name url=
> https://repo.maven.apache.org/maven2/org/eclipse/jetty/orbit/javax.transaction/1.1.1.v201105210645/javax.transaction-1.1.1.v201105210645.orbit
> [info] Resolving org.eclipse.jetty.orbit#jetty-orbit;1 ...
> [error] Server access Error: handshake alert:  unrecognized_name url=
> https://repo.maven.apache.org/maven2/org/eclipse/jetty/orbit/javax.mail.glassfish/1.4.1.v201005082020/javax.mail.glassfish-1.4.1.v201005082020.orbit
> [info] Resolving org.eclipse.jetty.orbit#jetty-orbit;1 ...
> [error] Server access Error: handshake alert:  unrecognized_name url=
> https://repo.maven.apache.org/maven2/org/eclipse/jetty/orbit/javax.activation/1.1.0.v201105071233/javax.activation-1.1.0.v201105071233.orbit
>
> Is it also due to the SSL error ?
>
> Thanks.
> Deb
>
>
>
> On Sun, Mar 23, 2014 at 6:08 AM, Sean Owen  wrote:
>
>> I'm also seeing this. It also was working for me previously AFAIK.
>>
>> Tthe proximate cause is my well-intentioned change that uses HTTPS to
>> access all artifact repos. The default for Maven Central before would have
>> been HTTP. While it's a good idea to use HTTPS, it may run into
>> complications.
>>
>> I see:
>>
>> https://issues.sonatype.org/browse/MVNCENTRAL-377
>> https://issues.apache.org/jira/browse/INFRA-7363
>>
>> ... which suggest that actually this isn't supported, and should return
>> 404.
>>
>> Then I see:
>>
>> http://blog.sonatype.com/2012/10/now-available-ssl-connectivity-to-central/#.Uy7Yuq1_tj4
>> ... that suggests you have to pay for HTTPS access to Maven Central?
>>
>> But, we both seem to have found it working (even after the change above)
>> and it does not return 404 now.
>>
>> The Maven build works, still, but if you look carefully, it's actually
>> only because it eventually falls back to HTTP for Maven Central artifacts.
>>
>> So I think the thing to do is simply back off to HTTP for Maven Central
>> only. That's unfortunate because there's a small but non-trivial security
>> issue here in downloading artifacts without security.
>>
>> Any brighter ideas? I'll open a supplementary PR if so to adjust this.
>>
>>
>


Re: sbt/sbt assembly fails with ssl certificate error

2014-03-23 Thread Debasish Das
I am getting these weird errors which I have not seen before:

[error] Server access Error: handshake alert:  unrecognized_name url=
https://repo.maven.apache.org/maven2/org/eclipse/jetty/orbit/javax.servlet/2.5.0.v201103041518/javax.servlet-2.5.0.v201103041518.orbit
[info] Resolving org.eclipse.jetty.orbit#jetty-orbit;1 ...
[error] Server access Error: handshake alert:  unrecognized_name url=
https://repo.maven.apache.org/maven2/org/eclipse/jetty/orbit/javax.transaction/1.1.1.v201105210645/javax.transaction-1.1.1.v201105210645.orbit
[info] Resolving org.eclipse.jetty.orbit#jetty-orbit;1 ...
[error] Server access Error: handshake alert:  unrecognized_name url=
https://repo.maven.apache.org/maven2/org/eclipse/jetty/orbit/javax.mail.glassfish/1.4.1.v201005082020/javax.mail.glassfish-1.4.1.v201005082020.orbit
[info] Resolving org.eclipse.jetty.orbit#jetty-orbit;1 ...
[error] Server access Error: handshake alert:  unrecognized_name url=
https://repo.maven.apache.org/maven2/org/eclipse/jetty/orbit/javax.activation/1.1.0.v201105071233/javax.activation-1.1.0.v201105071233.orbit

Is it also due to the SSL error ?

Thanks.
Deb



On Sun, Mar 23, 2014 at 6:08 AM, Sean Owen  wrote:

> I'm also seeing this. It also was working for me previously AFAIK.
>
> Tthe proximate cause is my well-intentioned change that uses HTTPS to
> access all artifact repos. The default for Maven Central before would have
> been HTTP. While it's a good idea to use HTTPS, it may run into
> complications.
>
> I see:
>
> https://issues.sonatype.org/browse/MVNCENTRAL-377
> https://issues.apache.org/jira/browse/INFRA-7363
>
> ... which suggest that actually this isn't supported, and should return
> 404.
>
> Then I see:
>
> http://blog.sonatype.com/2012/10/now-available-ssl-connectivity-to-central/#.Uy7Yuq1_tj4
> ... that suggests you have to pay for HTTPS access to Maven Central?
>
> But, we both seem to have found it working (even after the change above)
> and it does not return 404 now.
>
> The Maven build works, still, but if you look carefully, it's actually
> only because it eventually falls back to HTTP for Maven Central artifacts.
>
> So I think the thing to do is simply back off to HTTP for Maven Central
> only. That's unfortunate because there's a small but non-trivial security
> issue here in downloading artifacts without security.
>
> Any brighter ideas? I'll open a supplementary PR if so to adjust this.
>
>


Re: distinct on huge dataset

2014-03-23 Thread Kane
Yes, there was an error in data, after fixing it - count fails with Out of
Memory Error.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3051.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: combining operations elegantly

2014-03-23 Thread Richard Siebeling
Hi Koert, Patrick,

do you already have an elegant solution to combine multiple operations on a
single RDD?
Say for example that I want to do a sum over one column, a count and an
average over another column,

thanks in advance,
Richard


On Mon, Mar 17, 2014 at 8:20 AM, Richard Siebeling wrote:

> Patrick, Koert,
>
> I'm also very interested in these examples, could you please post them if
> you find them?
> thanks in advance,
> Richard
>
>
> On Thu, Mar 13, 2014 at 9:39 PM, Koert Kuipers  wrote:
>
>> not that long ago there was a nice example on here about how to combine
>> multiple operations on a single RDD. so basically if you want to do a
>> count() and something else, how to roll them into a single job. i think
>> patrick wendell gave the examples.
>>
>> i cant find them anymore patrick can you please repost? thanks!
>>
>
>


error loading large files in PySpark 0.9.0

2014-03-23 Thread Jeremy Freeman
Hi all,

Hitting a mysterious error loading large text files, specific to PySpark
0.9.0.

In PySpark 0.8.1, this works:

data = sc.textFile("path/to/myfile")
data.count()

But in 0.9.0, it stalls. There are indications of completion up to:

14/03/17 16:54:24 INFO TaskSetManager: Finished TID 4 in 1699 ms on X.X.X.X
(progress: 15/537)
14/03/17 16:54:24 INFO DAGScheduler: Completed ResultTask(5, 4)

And then this repeats indefinitely

14/03/17 16:54:24 DEBUG TaskSchedulerImpl: parentName: , name: TaskSet_5,
runningTasks: 144
14/03/17 16:54:25 DEBUG TaskSchedulerImpl: parentName: , name: TaskSet_5,
runningTasks: 144

Always stalls at the same place. There's nothing in stderr on the workers,
but in stdout there are several of these messages:

INFO PythonRDD: stdin writer to Python finished early

So perhaps the real error is being suppressed as in
https://spark-project.atlassian.net/browse/SPARK-1025

Data is just rows of space-separated numbers, ~20GB, with 300k rows and 50k
characters per row. Running on a private cluster with 10 nodes, 100GB / 16
cores each, Python v 2.7.6.

I doubt the data is corrupted as it works fine in Scala in 0.8.1 and 0.9.0,
and in PySpark in 0.8.1. Happy to post the file, but it should repro for
anything with these dimensions. It *might* be specific to long strings: I
don't see it with fewer characters (10k) per row, but I also don't see it
with many fewer rows but the same number of characters per row.

Happy to try and provide more info / help debug!

-- Jeremy



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/error-loading-large-files-in-PySpark-0-9-0-tp3049.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: distinct on huge dataset

2014-03-23 Thread Aaron Davidson
Andrew, this should be fixed in 0.9.1, assuming it is the same hash
collision error we found there.

Kane, is it possible your bigger data is corrupt, such that that any
operations on it fail?


On Sat, Mar 22, 2014 at 10:39 PM, Andrew Ash  wrote:

> FWIW I've seen correctness errors with spark.shuffle.spill on 0.9.0 and
> have it disabled now. The specific error behavior was that a join would
> consistently return one count of rows with spill enabled and another count
> with it disabled.
>
> Sent from my mobile phone
> On Mar 22, 2014 1:52 PM, "Kane"  wrote:
>
>> But i was wrong - map also fails on big file and setting
>> spark.shuffle.spill
>> doesn't help. Map fails with the same error.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3039.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>


Re: sbt/sbt assembly fails with ssl certificate error

2014-03-23 Thread Sean Owen
I'm also seeing this. It also was working for me previously AFAIK.

Tthe proximate cause is my well-intentioned change that uses HTTPS to
access all artifact repos. The default for Maven Central before would have
been HTTP. While it's a good idea to use HTTPS, it may run into
complications.

I see:

https://issues.sonatype.org/browse/MVNCENTRAL-377
https://issues.apache.org/jira/browse/INFRA-7363

... which suggest that actually this isn't supported, and should return
404.

Then I see:
http://blog.sonatype.com/2012/10/now-available-ssl-connectivity-to-central/#.Uy7Yuq1_tj4
... that suggests you have to pay for HTTPS access to Maven Central?

But, we both seem to have found it working (even after the change above)
and it does not return 404 now.

The Maven build works, still, but if you look carefully, it's actually only
because it eventually falls back to HTTP for Maven Central artifacts.

So I think the thing to do is simply back off to HTTP for Maven Central
only. That's unfortunate because there's a small but non-trivial security
issue here in downloading artifacts without security.

Any brighter ideas? I'll open a supplementary PR if so to adjust this.


sbt/sbt assembly fails with ssl certificate error

2014-03-23 Thread Bharath Bhushan
I am facing a weird failure where "sbt/sbt assembly” shows a lot of SSL 
certificate errors for repo.maven.apache.org. Is anyone else facing the same 
problems? Any idea why this is happening? Yesterday I was able to successfully 
run it.

Loading https://repo.maven.apache.org shows an invalid cert chain.

—
Thanks

Kmeans example reduceByKey slow

2014-03-23 Thread Tsai Li Ming
Hi,

At the reduceBuyKey stage, it takes a few minutes before the tasks start 
working.

I have -Dspark.default.parallelism=127 cores (n-1).

CPU/Network/IO is idling across all nodes when this is happening. 

And there is nothing particular on the master log file. From the spark-shell:

14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:124 as TID 538 on 
executor 2: XXX (PROCESS_LOCAL)
14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:124 as 38765155 
bytes in 193 ms
14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:125 as TID 539 on 
executor 1: XXX (PROCESS_LOCAL)
14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:125 as 38765155 
bytes in 96 ms
14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:126 as TID 540 on 
executor 0: XXX (PROCESS_LOCAL)
14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:126 as 38765155 
bytes in 100 ms

But it stops there for some significant time before any movement. 

In the stage detail of the UI, I can see that there are 127 tasks running but 
the duration each is at least a few minutes.

I’m working off local storage (not hdfs) and the kmeans data is about 6.5GB 
(50M rows).

Is this a normal behaviour?

Thanks!