That works when it is launched from same process - which is
unfortunately not our case :-)
- Mridul
On Sun, May 10, 2015 at 9:05 PM, Manku Timma manku.tim...@gmail.com wrote:
sc.applicationId gives the yarn appid.
On 11 May 2015 at 08:13, Mridul Muralidharan mri...@gmail.com wrote:
We had a
In 1.4, you can use struct function to create a struct, e.g. you can
explicitly select out the version column, and then create a new struct
named settings.
The current semantics of select basically follows closely relational
database's SQL, which is well understood and defined. I wouldn't add
This is the stack trace of the worker thread:
org.apache.spark.util.collection.AppendOnlyMap.changeValue(AppendOnlyMap.scala:150)
org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.changeValue(SizeTrackingAppendOnlyMap.scala:32)
Isn't this issue something that should be improved? Based on the following
discussion, there are two places were YARN's heartbeat interval is
respected on job start-up, but do we really need to respect it on start-up?
On Fri, May 8, 2015 at 12:14 PM Taeyun Kim taeyun@innowireless.com
wrote:
Hi,
Thanks Iulian. Yeah, I was kind of anticipating I could just ignore old-deps
ultimately. However, Even after doing a clean and build all, I get the
following still:
Description LocationResourcePathType
not found: type EventBatch line 72
Oh, I see. So then try to run one build on the command time firs (or try sbt
avro:generate, though I’m not sure it’s enough). I just noticed that I have
an additional source folder target/scala-2.10/src_managed/main/compiled_avro
for spark-streaming-flume-sink. I guess I built the project once and
The following worked for me as a workaround for distinct:
val pf = sqlContext.parquetFile(hdfs://file)
val distinctValuesOfColumn4 =
pf.rdd.aggregate[scala.collection.mutable.HashSet[String]](new
scala.collection.mutable.HashSet[String]())( (s, v) = s += v.getString(4),
(s1, s2) = s1 ++= s2)
Hello,
I'm working on SPARK-7400 for DataFrame support for PortableDataStream, i.e.
the data type associated with the RDD from sc.binaryFiles(...).
Assuming a patch is available soon, what is the likelihood of inclusion in
Spark 1.4?
Thanks
Hi,
Could you suggest alternative way of implementing distinct, e.g. via fold or
aggregate? Both SQL distinct and RDD distinct fail on my dataset due to
overflow of Spark shuffle disk. I have 7 nodes with 300GB dedicated to Spark
shuffle each. My dataset is 2B rows, the field which I'm
Thank you for suggestions!
From: Reynold Xin [mailto:r...@databricks.com]
Sent: Friday, May 08, 2015 11:10 AM
To: Will Benton
Cc: Ulanov, Alexander; dev@spark.apache.org
Subject: Re: Easy way to convert Row back to case class
In 1.4, you can do
row.getInt(colName)
In 1.5, some variant of this
Hi Ted,
Yes, those two options can be useful, but in general I think the standard
to set is that tests should never fail. It's actually the worst if tests
fail sometimes but not others, because we can't reproduce them
deterministically. Using -M and -A actually tolerates flaky tests to a
certain
This is really strange.
# Spark 1.3.1
print type(results)
class 'pyspark.sql.dataframe.DataFrame'
a = results.take(1)[0]
print type(a)
class 'pyspark.sql.types.Row'
print pyspark.sql.types.Row
class 'pyspark.sql.types.Row'
print type(a) == pyspark.sql.types.Row
False
print
Looks like it is spending a lot of time doing hash probing. It could be a
number of the following:
1. hash probing itself is inherently expensive compared with rest of your
workload
2. murmur3 doesn't work well with this key distribution
3. quadratic probing (triangular sequence) with a
Makes sense.
Having high determinism in these tests would make Jenkins build stable.
On Mon, May 11, 2015 at 1:08 PM, Andrew Or and...@databricks.com wrote:
Hi Ted,
Yes, those two options can be useful, but in general I think the standard
to set is that tests should never fail. It's
On 7 May 2015, at 01:41, Andrew Or and...@databricks.com wrote:
Dear all,
I'm sure you have all noticed that the Spark tests have been fairly
unstable recently. I wanted to share a tool that I use to track which tests
have been failing most often in order to prioritize fixing these flaky
In Row#equals():
while (i len) {
if (apply(i) != that.apply(i)) {
'!=' should be !apply(i).equals(that.apply(i)) ?
Cheers
On Mon, May 11, 2015 at 1:49 PM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
This is really strange.
# Spark 1.3.1
print type(results)
class
Sorry it's hard to give a definitive answer due to the lack of details (I'm
not sure what exactly is entailed to have this PortableDataStream), but the
answer is probably no if we need to change some existing code and expose a
whole new data type to users.
On Mon, May 11, 2015 at 9:02 AM, Eron
Wow, I hadn't noticed this, but 5 seconds is really long. It's true that
it's configurable, but I think we need to provide a decent out-of-the-box
experience. For comparison, the MapReduce equivalent is 1 second.
I filed https://issues.apache.org/jira/browse/SPARK-7533 for this.
-Sandy
On
For tiny/small clusters (particularly single tenet), you can set it to
lower value.
But for anything reasonably large or multi-tenet, the request storm
can be bad if large enough number of applications start aggressively
polling RM.
That is why the interval is set to configurable.
- Mridul
On
Hi Iulian,
I was able to successfully compile in eclipse after, on the command line,
using sbt avro:generate followed by a sbt clean compile (and then a full
clean compile in eclipse). Thanks for your help!
--
View this message in context:
20 matches
Mail list logo