Qingyang,
Aha. Got it.
800MB data is pretty small. Loading from Tachyon does have a bit of extra
overhead. But it will have more benefit when the data size is larger. Also,
if you store the table in Tachyon, you can have different shark servers to
query the data at the same time. For more
I found the spark Jira ticket,
https://issues.apache.org/jira/browse/SPARK-911, and noticed that no one had
done it. It seemed like a useful and interesting feature. I implemented a
pull request here ( https://github.com/apache/spark/pull/1381) and was
looking for some input on it( go easy on me
A few weeks ago I submitted a PR for supporting rdd.drop(n), under SPARK-2315:
https://issues.apache.org/jira/browse/SPARK-2315
Supporting the drop method would make some operations convenient, however it
forces computation of = 1 partition of the parent RDD, and so it would behave
like a
Personally I'd find the method useful -- I've often had a .csv file with a
header row that I want to drop so filter it out, which touches all
partitions anyway. I don't have any comments on the implementation quite
yet though.
On Mon, Jul 21, 2014 at 8:24 AM, Erik Erlandson e...@redhat.com
Sure, drop() would be useful, but breaking the transformations are lazy;
only actions launch jobs model is abhorrent -- which is not to say that we
haven't already broken that model for useful operations (cf.
RangePartitioner, which is used for sorted RDDs), but rather that each such
exception to
- Original Message -
I too would like this feature. Erik's post makes sense. However, shouldn't
the RDD also repartition itself after drop to effectively make use of
cluster resources?
My thinking is that in most use cases(*), one is dropping a small number of
rows, and they are in
- Original Message -
Sure, drop() would be useful, but breaking the transformations are lazy;
only actions launch jobs model is abhorrent -- which is not to say that we
haven't already broken that model for useful operations (cf.
RangePartitioner, which is used for sorted RDDs), but
Thanks for the clarification, Sandy.
On Mon, Jul 21, 2014 at 12:52 PM, Sandy Ryza sandy.r...@cloudera.com wrote:
Hi RJ,
Spark Shell instantiates a SparkContext for you named sc. In other apps,
the user instantiates it themself and can give the variable whatever name
they want, e.g. spark.
Rather than embrace non-lazy transformations and add more of them, I'd
rather we 1) try to fully characterize the needs that are driving their
creation/usage; and 2) design and implement new Spark abstractions that
will allow us to meet those needs and eliminate existing non-lazy
transformation.
Hi, all
I’m running some unit tests for my Spark applications in Jenkins
it seems that even I set spark.executor.memory to 5g, the value I got with
Runtime.getRuntime.maxMemory is still around 1G
Is it saying that Jenkins limit the process to use no more than 1G (by
default)? how to change
Just a note - there was a fix in branch-1.0 (and Spark 1.0.1) that
introduced a new bug worse than the original one.
https://issues.apache.org/jira/browse/SPARK-1199
The original bug was an issue with case classes defined in the repl.
The fix caused a separate bug which broke most compound
- Original Message -
Rather than embrace non-lazy transformations and add more of them, I'd
rather we 1) try to fully characterize the needs that are driving their
creation/usage; and 2) design and implement new Spark abstractions that
will allow us to meet those needs and eliminate
One way to do this would be to have a Github hook that parses -1s or +1s
and posts a commit status [1] (like say Travis [2]) right next to the PR.
Does anybody know of an existing tool that does this ?
Shivaram
[1] https://github.com/blog/1227-commit-status-api
[2]
I've always operated under the assumption that if a commiter makes a
comment on a PR, and that's not addressed, that should block the PR
from being merged (even without a specific -1). I don't know of any
cases where this has intentionally been violated, but I do think this
happens accidentally
There is ASF guidelines about Voting, including code review for
patches: http://www.apache.org/foundation/voting.html
Some ASF project do three +1 votes are required (to the issues like
JIRA or Github PR in this case) for a patch unless it is tagged with
lazy consensus [1] of like 48 hours.
For
Contributing to Spark
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
needs a line or two about building and testing PySpark. A call out of
run-tests, for example, would be helpful for new contributors to PySpark.
Nick
For the record, the triggering discussion is here
https://github.com/apache/spark/pull/1505#issuecomment-49671550. I
assumed that sbt/sbt test covers all the tests required before submitting a
patch, and it appears that it doesn’t.
On Mon, Jul 21, 2014 at 6:42 PM, Nicholas Chammas
Hi Neil, first off, I'm generally a sympathetic advocate for making changes
to Spark internals to make it easier/better/faster/more awesome.
In this case, I'm (a) not clear about what you're trying to accomplish, and
(b) a bit worried about the proposed solution.
On (a): it is stated that you
Hi,
A couple of month ago, I made a pull request to fix
https://issues.apache.org/jira/browse/SPARK-1825.
My pull request is here: https://github.com/apache/spark/pull/899
But that pull request has problems:
l It is Hadoop 2.4.0+ only. It won't compile on the versions below it.
l The
I added an automated testing section:
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-AutomatedTesting
Can you take a look to see if it is what you had in mind?
On Mon, Jul 21, 2014 at 3:54 PM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
Looks good. Does sbt/sbt test cover the same tests as /dev/run-tests?
I’m looking at step 5 under “Contributing Code”. Someone contributing to
PySpark will want to be directed to run something in addition to (or
instead of) sbt/sbt test, I believe.
Nick
On Mon, Jul 21, 2014 at 11:43 PM,
I missed that bullet point. I removed that and just pointed it towards the
instruction.
On Mon, Jul 21, 2014 at 9:20 PM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
Looks good. Does sbt/sbt test cover the same tests as /dev/run-tests?
I’m looking at step 5 under “Contributing Code”.
Thanks for the thoughtful email, Neil and Christopher.
If I understand this correctly, it seems like the dynamic variable is just
a variant of the accumulator (a static one since it is a global object).
Accumulators are already implemented using thread-local variables under the
hood. Am I
If the purpose is for dropping csv headers, perhaps we don't really need a
common drop and only one that drops the first line in a file? I'd really
try hard to avoid a common drop/dropWhile because they can be expensive to
do.
Note that I think we will be adding this functionality (ignoring
It could make sense to add a skipHeader argument to SparkContext.textFile?
On Mon, Jul 21, 2014 at 10:37 PM, Reynold Xin r...@databricks.com wrote:
If the purpose is for dropping csv headers, perhaps we don't really need a
common drop and only one that drops the first line in a file? I'd
Yes, that could work. But it is not as simple as just a binary flag.
We might want to skip the first row for every file, or the header only for
the first file. The former is not really supported out of the box by the
input format I think?
On Mon, Jul 21, 2014 at 10:50 PM, Sandy Ryza
26 matches
Mail list logo