Hey Nicholas,
Thanks for bringing this up. There are a few dimensions to this... one is
that it's actually precedurally difficult for us to close pull requests.
I've proposed several different solutions to ASF infra to streamline the
process, but thus far they haven't been open to any of my
What would be the timeline for the parquet caching work?
The reason I'm asking about the columnar compressed format is that
there are some problems for which Parquet is not practical.
On Mon, Aug 25, 2014 at 1:13 PM, Michael Armbrust
mich...@databricks.com wrote:
What is the plan for getting
Any initial proposal or design about the caching to Tachyon that you
can share so far?
Caching parquet files in tachyon with saveAsParquetFile and then reading
them with parquetFile should already work. You can use SQL on these tables
by using registerTempTable.
Some of the general parquet
One of my colleagues has been questioning me as to why Spark/HDFS makes no
attempts to try to co-locate related data blocks. He pointed to this
paper: http://www.vldb.org/pvldb/vol4/p575-eltabakh.pdf from 2011 on the
CoHadoop research and the performance improvements it yielded for
Map/Reduce
On 08/26/2014 04:57 AM, Sean Owen wrote:
On Tue, Aug 26, 2014 at 7:02 AM, Patrick Wendell pwend...@gmail.com wrote:
Most other ASF projects I know just ignore these patches. I'd prefer if we
Agree, this drives me crazy. It kills part of JIRA's usefulness. Spark
is blessed/cursed with
It appears support for this type of control over block placement is going
out in the next version of HDFS:
https://issues.apache.org/jira/browse/HDFS-2576
On Tue, Aug 26, 2014 at 7:43 AM, Gary Malouf malouf.g...@gmail.com wrote:
One of my colleagues has been questioning me as to why Spark/HDFS
Hi Shengzhe
I faced to same situation.
I think, Connection and ConnectionManager have some race condition issues
and the error you mentioned may be caused by the issues.
Now I'm trying to resolve the issue in
https://github.com/apache/spark/pull/2019.
Please check it out.
- Kousuke
Sean Owen wrote
Stale JIRAs are a symptom, not a problem per se. I also want to see
the backlog cleared, but automatically closing doesn't help, if the
problem is too many JIRAs and not enough committer-hours to look at
them. Some noise gets closed, but some easy or important fixes may
- Original Message -
Another thing is that we should, IMO, err on the side of explicitly saying
no or not yet to patches, rather than letting them linger without
attention. We do get patches where the user is well intentioned, but it is
Completely agree. The solution is partly
Gary, do you mean Spark and HDFS separately, or Spark's use of HDFS?
If the former, Spark does support copartitioning.
If the latter, it's an HDFS scope that's outside of Spark. On that note,
Hadoop does also make attempts to collocate data, e.g., rack awareness. I'm
sure the paper makes useful
is there any dataType auto convert or detect or something in HiveContext ?all
columns of a table is defined as string in hive metastoreone column is
total_price with values like 123.45, then this column will be recognized as
dataType Float in HiveContext...this is a feature or a bug? it really
is there any dataType auto convert or detect or something in HiveContext ?
all columns of a table is defined as string in hive metastore
one column is total_price with values like 123.45, then this column will be
recognized as dataType Float in HiveContext...
this is a feature or a bug? it
Hi Alexander,
Can you post a link to the code?
RJ
On Tue, Aug 26, 2014 at 6:53 AM, Ulanov, Alexander alexander.ula...@hp.com
wrote:
Hi,
I've implemented back propagation algorithm using Gradient class and a
simple update using Updater class. Then I run the algorithm with mllib's
On Tue, Aug 26, 2014 at 2:02 AM, Patrick Wendell pwend...@gmail.com wrote:
I'd prefer if we took the approach of politely explaining why in the
current form the patch isn't acceptable and closing it (potentially w/ tips
on how to improve it or narrow the scope).
Amen to this. Aiming for such
oops, i tried on a managed table, column types will not be changed
so it is mostly due to the serde lib CSVSerDe
(https://github.com/ogrodnek/csv-serde/blob/master/src/main/java/com/bizo/hive/serde/csv/CSVSerde.java#L123)
or maybe CSVReader from opencsv?...
but if the columns are defined as
Christopher, can you expand on the co-partitioning support?
We have a number of spark SQL tables (saved in parquet format) that all
could be considered to have a common hash key. Our analytics team wants to
do frequent joins across these different data-sets based on this key. It
makes sense
Last weekend, I started hacking on a Google App Engine app for helping with
pull request review (screenshot: http://i.imgur.com/wwpZKYZ.png). Some of my
basic goals (not all implemented yet):
- Users sign in using GitHub and can browse a list of pull requests, including
links to associated
OK, that sounds pretty cool.
Josh,
Do you see this app as encompassing or supplanting the functionality I
described as well?
Nick
On Tue, Aug 26, 2014 at 2:21 PM, Josh Rosen rosenvi...@gmail.com wrote:
Last weekend, I started hacking on a Google App Engine app for helping
with pull request
Sure; App Engine supports cron and sending emails. We can configure the app
with Spark QA’s credentials in order to allow it to post comments on issues,
etc.
- Josh
On August 26, 2014 at 11:38:08 AM, Nicholas Chammas
(nicholas.cham...@gmail.com) wrote:
OK, that sounds pretty cool.
Josh,
I have both SPARK-2878 and SPARK-2893.
--
View this message in context:
http://apache-spark-developers-list.1001551.n3.nabble.com/SPARK-2878-Kryo-serialisation-with-custom-Kryo-registrator-failing-tp7719p8046.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
I downloaded the source code release for 1.0.2 from here
http://spark.apache.org/downloads.html and launched an EC2 cluster using
spark-ec2.
After the cluster finishes launching, I fire up the shell and check the
version:
scala sc.version
res1: String = 1.0.1
The startup banner also shows the
This is a chicken and egg problem in some sense. We can't change the ec2
script till we have made the release and uploaded the binaries -- But once
that is done, we can't update the script.
I think the model we support so far is that you can launch the latest
spark version from the master branch
Xiangrui,
I posted a note on my JIRA for MiniBatch KMeans about the same problem --
sampling running in O(n).
Can you elaborate on ways to get more efficient sampling? I think this
will be important for a variety of stochastic algorithms.
RJ
On Tue, Aug 26, 2014 at 12:54 PM, Xiangrui Meng
Also, another idea: may algorithms that use sampling tend to do so multiple
times. It may be beneficial to allow a transformation to a representation
that is more efficient for multiple rounds of sampling.
On Tue, Aug 26, 2014 at 4:36 PM, RJ Nowling rnowl...@gmail.com wrote:
Xiangrui,
I
Hi Xiangrui,
Thanks for explanation, but I'm still missing something. In my experiments, if
miniBatchFraction == 1.0, no matter how the data is partitioned (2, 4, 8, 16
partitions), the algorithm executes more or less in the same time. (I have 16
Workers). Reduce from runMiniBatchSGD takes
By the way, as a reference point, I just stumbled across the Discourse
GitHub project and their list of pull requests
https://github.com/discourse/discourse/pulls looks pretty neat.
~2,200 closed PRs, 6 open. Least recently updated PR dates to 8 days ago.
Project started ~1.5 years ago.
Dunno
It seems like there are two things here:
- Co-locating blocks with the same keys to avoid network transfer.
- Leveraging partitioning information to avoid a shuffle when data is
already partitioned correctly (even if those partitions aren't yet on the
same machine).
The former seems more
Hi Michael,
I think once that work is into HDFS, it will be great to expose this
functionality via Spark. This is something worth pursuing because it could
grant orders of magnitude perf improvements in cases when people need to
join data.
The second item would be very interesting, could yield
Hi spark.
I've been trying to build spark, but I've been getting lots of oome
exceptions.
https://gist.github.com/jayunit100/d424b6b825ce8517d68c
For the most part, they are of the form:
java.lang.OutOfMemoryError: unable to create new native thread
I've attempted to hard code the
What is your ulimit value?
On Tue, Aug 26, 2014 at 5:49 PM, jay vyas jayunit100.apa...@gmail.com
wrote:
Hi spark.
I've been trying to build spark, but I've been getting lots of oome
exceptions.
https://gist.github.com/jayunit100/d424b6b825ce8517d68c
For the most part, they are of the
This shouldn't be a chicken-and-egg problem, since the script fetches the AMI
from a known URL. Seems like an issue in publishing this release.
On August 26, 2014 at 1:24:45 PM, Shivaram Venkataraman
(shiva...@eecs.berkeley.edu) wrote:
This is a chicken and egg problem in some sense. We can't
Thanks...! Some questions below.
1) you are suggesting that maybe this OOME is a symptom/red herring , and the
true cause of it is that a thread can't span because of ulimit... If so
possibly this could be flagged early on in the build. And -- where are so many
threads coming from that I need
Hi Jay,
The recommended way to build spark from source is through the maven system.
You would want to follow the steps in
https://spark.apache.org/docs/latest/building-with-maven.html to set the
MAVEN_OPTS to prevent OOM build errors.
Thanks
On Tue, Aug 26, 2014 at 5:49 PM, jay vyas
Nicholas Chammas wrote
Dunno how many committers Discourse has, but it looks like they've managed
their PRs well. I hope we can do as well in this regard as they have.
Discourse developers appear to eat their own dog food
https://meta.discourse.org .
Improved collaboration and a shared vision
Yes, this was an oversight on my part. I have opened a JIRA for this.
https://issues.apache.org/jira/browse/SPARK-3242
For the time being the workaround should be providing the version 1.0.2
explicitly as part of the script.
TD
On Tue, Aug 26, 2014 at 6:39 PM, Matei Zaharia
35 matches
Mail list logo