Re: Change for submitting to yarn in 1.3.1

2015-05-11 Thread Mridul Muralidharan
That works when it is launched from same process - which is unfortunately not our case :-) - Mridul On Sun, May 10, 2015 at 9:05 PM, Manku Timma manku.tim...@gmail.com wrote: sc.applicationId gives the yarn appid. On 11 May 2015 at 08:13, Mridul Muralidharan mri...@gmail.com wrote: We had a

Re: PySpark DataFrame: Preserving nesting when selecting a nested field

2015-05-11 Thread Reynold Xin
In 1.4, you can use struct function to create a struct, e.g. you can explicitly select out the version column, and then create a new struct named settings. The current semantics of select basically follows closely relational database's SQL, which is well understood and defined. I wouldn't add

Re: large volume spark job spends most of the time in AppendOnlyMap.changeValue

2015-05-11 Thread Michal Haris
This is the stack trace of the worker thread: org.apache.spark.util.collection.AppendOnlyMap.changeValue(AppendOnlyMap.scala:150) org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.changeValue(SizeTrackingAppendOnlyMap.scala:32)

Re: YARN mode startup takes too long (10+ secs)

2015-05-11 Thread Zoltán Zvara
Isn't this issue something that should be improved? Based on the following discussion, there are two places were YARN's heartbeat interval is respected on job start-up, but do we really need to respect it on start-up? On Fri, May 8, 2015 at 12:14 PM Taeyun Kim taeyun@innowireless.com wrote:

Re: Intellij Spark Source Compilation

2015-05-11 Thread rtimp
Hi, Thanks Iulian. Yeah, I was kind of anticipating I could just ignore old-deps ultimately. However, Even after doing a clean and build all, I get the following still: Description LocationResourcePathType not found: type EventBatch line 72

Re: Intellij Spark Source Compilation

2015-05-11 Thread Iulian Dragoș
Oh, I see. So then try to run one build on the command time firs (or try sbt avro:generate, though I’m not sure it’s enough). I just noticed that I have an additional source folder target/scala-2.10/src_managed/main/compiled_avro for spark-streaming-flume-sink. I guess I built the project once and

RE: DataFrame distinct vs RDD distinct

2015-05-11 Thread Ulanov, Alexander
The following worked for me as a workaround for distinct: val pf = sqlContext.parquetFile(hdfs://file) val distinctValuesOfColumn4 = pf.rdd.aggregate[scala.collection.mutable.HashSet[String]](new scala.collection.mutable.HashSet[String]())( (s, v) = s += v.getString(4), (s1, s2) = s1 ++= s2)

[SPARK-7400] PortableDataStream UDT

2015-05-11 Thread Eron Wright
Hello, I'm working on SPARK-7400 for DataFrame support for PortableDataStream, i.e. the data type associated with the RDD from sc.binaryFiles(...). Assuming a patch is available soon, what is the likelihood of inclusion in Spark 1.4? Thanks

RE: DataFrame distinct vs RDD distinct

2015-05-11 Thread Ulanov, Alexander
Hi, Could you suggest alternative way of implementing distinct, e.g. via fold or aggregate? Both SQL distinct and RDD distinct fail on my dataset due to overflow of Spark shuffle disk. I have 7 nodes with 300GB dedicated to Spark shuffle each. My dataset is 2B rows, the field which I'm

RE: Easy way to convert Row back to case class

2015-05-11 Thread Ulanov, Alexander
Thank you for suggestions! From: Reynold Xin [mailto:r...@databricks.com] Sent: Friday, May 08, 2015 11:10 AM To: Will Benton Cc: Ulanov, Alexander; dev@spark.apache.org Subject: Re: Easy way to convert Row back to case class In 1.4, you can do row.getInt(colName) In 1.5, some variant of this

Re: Recent Spark test failures

2015-05-11 Thread Andrew Or
Hi Ted, Yes, those two options can be useful, but in general I think the standard to set is that tests should never fail. It's actually the worst if tests fail sometimes but not others, because we can't reproduce them deterministically. Using -M and -A actually tolerates flaky tests to a certain

[PySpark DataFrame] When a Row is not a Row

2015-05-11 Thread Nicholas Chammas
This is really strange. # Spark 1.3.1 print type(results) class 'pyspark.sql.dataframe.DataFrame' a = results.take(1)[0] print type(a) class 'pyspark.sql.types.Row' print pyspark.sql.types.Row class 'pyspark.sql.types.Row' print type(a) == pyspark.sql.types.Row False print

Re: large volume spark job spends most of the time in AppendOnlyMap.changeValue

2015-05-11 Thread Reynold Xin
Looks like it is spending a lot of time doing hash probing. It could be a number of the following: 1. hash probing itself is inherently expensive compared with rest of your workload 2. murmur3 doesn't work well with this key distribution 3. quadratic probing (triangular sequence) with a

Re: Recent Spark test failures

2015-05-11 Thread Ted Yu
Makes sense. Having high determinism in these tests would make Jenkins build stable. On Mon, May 11, 2015 at 1:08 PM, Andrew Or and...@databricks.com wrote: Hi Ted, Yes, those two options can be useful, but in general I think the standard to set is that tests should never fail. It's

Re: Recent Spark test failures

2015-05-11 Thread Steve Loughran
On 7 May 2015, at 01:41, Andrew Or and...@databricks.com wrote: Dear all, I'm sure you have all noticed that the Spark tests have been fairly unstable recently. I wanted to share a tool that I use to track which tests have been failing most often in order to prioritize fixing these flaky

Re: [PySpark DataFrame] When a Row is not a Row

2015-05-11 Thread Ted Yu
In Row#equals(): while (i len) { if (apply(i) != that.apply(i)) { '!=' should be !apply(i).equals(that.apply(i)) ? Cheers On Mon, May 11, 2015 at 1:49 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: This is really strange. # Spark 1.3.1 print type(results) class

Re: [SPARK-7400] PortableDataStream UDT

2015-05-11 Thread Reynold Xin
Sorry it's hard to give a definitive answer due to the lack of details (I'm not sure what exactly is entailed to have this PortableDataStream), but the answer is probably no if we need to change some existing code and expose a whole new data type to users. On Mon, May 11, 2015 at 9:02 AM, Eron

Re: YARN mode startup takes too long (10+ secs)

2015-05-11 Thread Sandy Ryza
Wow, I hadn't noticed this, but 5 seconds is really long. It's true that it's configurable, but I think we need to provide a decent out-of-the-box experience. For comparison, the MapReduce equivalent is 1 second. I filed https://issues.apache.org/jira/browse/SPARK-7533 for this. -Sandy On

Re: YARN mode startup takes too long (10+ secs)

2015-05-11 Thread Mridul Muralidharan
For tiny/small clusters (particularly single tenet), you can set it to lower value. But for anything reasonably large or multi-tenet, the request storm can be bad if large enough number of applications start aggressively polling RM. That is why the interval is set to configurable. - Mridul On

Re: Intellij Spark Source Compilation

2015-05-11 Thread rtimp
Hi Iulian, I was able to successfully compile in eclipse after, on the command line, using sbt avro:generate followed by a sbt clean compile (and then a full clean compile in eclipse). Thanks for your help! -- View this message in context: