Re: [VOTE] Release Apache Spark 1.4.1

2015-06-24 Thread Sean Owen
There are 44 issues still targeted for 1.4.1. None are Blockers; 12 are Critical. ~80% were opened and/or set by committers. Compare with 90 issues resolved for 1.4.1. I'm concerned that committers are targeting lots more for a release even in the short term than realistically can go in. On its

Re: Python UDF performance at large scale

2015-06-24 Thread Davies Liu
Fare points, I also like simpler solutions. The overhead of Python task could be a few of milliseconds, which means we also should eval them as batches (one Python task per batch). Decreasing the batch size for UDF sounds reasonable to me, together with other tricks to reduce the data in

Re: [SparkSQL 1.4]Could not use concat with UDF in where clause

2015-06-24 Thread StanZhai
Hi Michael Armbrust, I have filed an issue on JIRA for this, https://issues.apache.org/jira/browse/SPARK-8588 https://issues.apache.org/jira/browse/SPARK-8588 -- View this message in context:

Re: Python UDF performance at large scale

2015-06-24 Thread Punyashloka Biswal
Hi Davies, In general, do we expect people to use CPython only for heavyweight UDFs that invoke an external library? Are there any examples of using Jython, especially performance comparisons to Java/Scala and CPython? When using Jython, do you expect the driver to send code to the executor as a

Loss of data due to congestion

2015-06-24 Thread anshu shukla
How spark guarantees that no RDD will fail /lost during its life cycle . Is there something like ask in storm or its does it by default . -- Thanks Regards, Anshu Shukla

Re: Python UDF performance at large scale

2015-06-24 Thread Justin Uang
Correct, I was running with a batch size of about 100 when I did the tests, because I was worried about deadlocks. Do you have any concerns regarding the batched synchronous version of communication between the Java and Python processes, and if not, should I file a ticket and starting writing it?

Re: Python UDF performance at large scale

2015-06-24 Thread Davies Liu
From you comment, the 2x improvement only happens when you have the batch size as 1, right? On Wed, Jun 24, 2015 at 12:11 PM, Justin Uang justin.u...@gmail.com wrote: FYI, just submitted a PR to Pyrolite to remove their StopException. https://github.com/irmen/Pyrolite/pull/30 With my

parallelize method v.s. textFile method

2015-06-24 Thread xing
We have a large file and we used to read chunks and then use parallelize method (distData = sc.parallelize(chunk)) and then do the map/reduce chunk by chunk. Recently we read the whole file using textFile method and found the map/reduce job is much faster. Anybody can help us to understand why? We

Re: Problem with version compatibility

2015-06-24 Thread jimfcarroll
Hi Sean, I'm running a Mesos cluster. My driver app is built using maven against the maven 1.4.0 dependency. The Mesos slave machines have the spark distribution installed from the distribution link. I have a hard time understanding how this isn't a standard app deployment but maybe I'm missing

Re: parallelize method v.s. textFile method

2015-06-24 Thread Reynold Xin
If you read the file one by one and then use parallelize, it is read by a single thread on a single machine. On Wednesday, June 24, 2015, xing ehomec...@gmail.com wrote: We have a large file and we used to read chunks and then use parallelize method (distData = sc.parallelize(chunk)) and then

Re: parallelize method v.s. textFile method

2015-06-24 Thread xing
When we compare the performance, we already excluded this part of time difference. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/parallelize-method-v-s-textFile-method-tp12871p12873.html Sent from the Apache Spark Developers List mailing list

Re: Problem with version compatibility

2015-06-24 Thread Sean Owen
They are different classes even. Your problem isn't class-not-found though. You're also comparing different builds really. You should not be including Spark code in your app. On Wed, Jun 24, 2015, 9:48 PM jimfcarroll jimfcarr...@gmail.com wrote: These jars are simply incompatible. You can see

Re: [GraphX] Graph 500 graph generator

2015-06-24 Thread Burak Yavuz
Hi Ryan, If you can get past the paperwork, I'm sure this can make a great Spark Package (http://spark-packages.org). People then can use it for benchmarking purposes, and I'm sure people will be looking for graph generators! Best, Burak On Wed, Jun 24, 2015 at 7:55 AM, Carr, J. Ryan

Re: OK to add committers active on JIRA to JIRA admin role?

2015-06-24 Thread Imran Rashid
+1 (partially b/c I would like jira admin myself) On Tue, Jun 23, 2015 at 3:47 AM, Sean Owen so...@cloudera.com wrote: There are some committers who are active on JIRA and sometimes need to do things that require JIRA admin access -- in particular thinking of adding a new person as

Re: how can I write a language wrapper?

2015-06-24 Thread Shivaram Venkataraman
The SparkR code is in the `R` directory i.e. https://github.com/apache/spark/tree/master/R Shivaram On Wed, Jun 24, 2015 at 8:45 AM, Vasili I. Galchin vigalc...@gmail.com wrote: Matei, Last night I downloaded the Spark bundle. In order to save me time, can you give me the name of the

[GraphX] Graph 500 graph generator

2015-06-24 Thread Carr, J. Ryan
Hi Spark Devs, As part of a project at work, I have written a graph generator for RMAT graphs consistent with the specifications in the Graph 500 benchmark (http://www.graph500.org/specifications). We had originally planned to use the rmatGenerator function in GraphGenerators, but found that

Re: how can I write a language wrapper?

2015-06-24 Thread Vasili I. Galchin
Matei, Last night I downloaded the Spark bundle. In order to save me time, can you give me the name of the SparkR example is and where it is in the Sparc tree? Thanks, Bill On Tuesday, June 23, 2015, Matei Zaharia matei.zaha...@gmail.com wrote: Just FYI, it would be easiest to follow

Re: Loss of data due to congestion

2015-06-24 Thread anshu shukla
Thaks, I am talking about streaming. On 25 Jun 2015 5:37 am, ayan guha guha.a...@gmail.com wrote: Can you elaborate little more? Are you talking about receiver or streaming? On 24 Jun 2015 23:18, anshu shukla anshushuk...@gmail.com wrote: How spark guarantees that no RDD will fail /lost

Error in invoking a custom StandaloneRecoveryModeFactory in java env (Spark v1.3.0)

2015-06-24 Thread Niranda Perera
Hi all, I'm trying to implement a custom StandaloneRecoveryModeFactory in the Java environment. Pls find the implementation here. [1] . I'm new to Scala, hence I'm trying to use Java environment as much as possible. when I start a master with spark.deploy.recoveryMode.factory property to be

Spark SQL 1.3 Exception

2015-06-24 Thread Debasish Das
Hi, I have Impala created table with the following io format and serde: inputFormat:parquet.hive.DeprecatedParquetInputFormat, outputFormat:parquet.hive.DeprecatedParquetOutputFormat, serdeInfo:SerDeInfo(name:null, serializationLib:parquet.hive.serde.ParquetHiveSerDe, parameters:{}) I am trying

Re: [VOTE] Release Apache Spark 1.4.1

2015-06-24 Thread Patrick Wendell
Hey Sean, This is being shipped now because there is a severe bug in 1.4.0 that can cause data corruption for Parquet users. There are no blockers targeted for 1.4.1 - so I don't see that JIRA is inconsistent with shipping a release now. The goal of having every single targeted JIRA cleared by

Problem with version compatibility

2015-06-24 Thread jimfcarroll
Hello all, I have a strange problem. I have a mesos spark cluster with Spark 1.4.0/Hadoop 2.4.0 installed and a client application use maven to include the same versions. However, I'm getting a serialUIDVersion problem on: ERROR Remoting -

Re: Force inner join to shuffle the smallest table

2015-06-24 Thread Stephen Carman
Have you tried shuffle compression? spark.shuffle.compress (true|false) if you have a filesystem capable also I’ve noticed file consolidation helps disk usage a bit. spark.shuffle.consolidateFiles (true|false) Steve On Jun 24, 2015, at 3:27 PM, Ulanov, Alexander

Re: Problem with version compatibility

2015-06-24 Thread jimfcarroll
These jars are simply incompatible. You can see this by looking at that class in both the maven repo for 1.4.0 here: http://central.maven.org/maven2/org/apache/spark/spark-core_2.10/1.4.0/spark-core_2.10-1.4.0.jar as well as the spark-assembly jar inside the .tgz file you can get from the