Re: Multi-Line JSON in SparkSQL

2015-05-04 Thread Emre Sevinc
You can check out the following library: https://github.com/alexholmes/json-mapreduce -- Emre Sevinç On Sun, May 3, 2015 at 10:04 PM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Hi everyone, Is there any way in Spark SQL to load multi-line JSON data efficiently, I think

spark log analyzer sample

2015-05-04 Thread anshu shukla
Exception in thread main java.lang.RuntimeException: org.apache.hadoop.ipc.RemoteException: Server IPC version 9 cannot communicate with client version 4 I am not using any hadoop facility (not even hdfs) then why it is giving this error . -- Thanks Regards, Anshu Shukla

Re: spark log analyzer sample

2015-05-04 Thread Emrehan Tüzün
On Mon, May 4, 2015 at 9:50 AM, anshu shukla anshushuk...@gmail.com wrote: Exception in thread main java.lang.RuntimeException: org.apache.hadoop.ipc.RemoteException: Server IPC version 9 cannot communicate with client version 4 I am not using any hadoop facility (not even hdfs) then why it

Re: Multi-Line JSON in SparkSQL

2015-05-04 Thread Reynold Xin
I took a quick look at that implementation. I'm not sure if it actually handles JSON correctly, because it attempts to find the first { starting from a random point. However, that random point could be in the middle of a string, and thus the first { might just be part of a string, rather than a

Re: Speeding up Spark build during development

2015-05-04 Thread Pramod Biligiri
Using the inbuilt maven and zinc it takes around 10 minutes for each build. Is that reasonable? My maven opts looks like this: $ echo $MAVEN_OPTS -Xmx12000m -XX:MaxPermSize=2048m I'm running it as build/mvn -DskipTests package Should I be tweaking my Zinc/Nailgun config? Pramod On Sun, May 3,

Re: Speeding up Spark build during development

2015-05-04 Thread Emre Sevinc
Hello Pramod, Do you need to build the whole project every time? Generally you don't, e.g., when I was changing some files that belong only to Spark Streaming, I was building only the streaming (of course after having build and installed the whole project, but that was done only once), and then

Re: Multi-Line JSON in SparkSQL

2015-05-04 Thread Joe Halliwell
I think Reynold’s argument shows the impossibility of the general case. But a “maximum object depth” hint could enable a new input format to do its job both efficiently and correctly in the common case where the input is an array of similarly structured objects! I’d certainly be interested

Re: Speeding up Spark build during development

2015-05-04 Thread Pramod Biligiri
No, I just need to build one project at a time. Right now SparkSql. Pramod On Mon, May 4, 2015 at 12:09 AM, Emre Sevinc emre.sev...@gmail.com wrote: Hello Pramod, Do you need to build the whole project every time? Generally you don't, e.g., when I was changing some files that belong only to

Re: Speeding up Spark build during development

2015-05-04 Thread Emre Sevinc
Just to give you an example: When I was trying to make a small change only to the Streaming component of Spark, first I built and installed the whole Spark project (this took about 15 minutes on my 4-core, 4 GB RAM laptop). Then, after having changed files only in Streaming, I ran something like

Re: Multi-Line JSON in SparkSQL

2015-05-04 Thread Olivier Girardot
I was wondering if it's possible to use existing Hive SerDes for this ? Le lun. 4 mai 2015 à 08:36, Joe Halliwell joe.halliw...@gmail.com a écrit : I think Reynold’s argument shows the impossibility of the general case. But a “maximum object depth” hint could enable a new input format to do

Re: Speeding up Spark build during development

2015-05-04 Thread Meethu Mathew
* * ** ** ** ** ** ** Hi, Is it really necessary to run **mvn --projects assembly/ -DskipTests install ? Could you please explain why this is needed? I got the changes after running mvn --projects streaming/ -DskipTests package. Regards, Meethu On Monday 04 May 2015 02:20 PM,

Update Wiki Developer instructions

2015-05-04 Thread Iulian Dragoș
I'd like to update the information about using Eclipse to develop on the Spark project found on this page: https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=38572224 I don't see any way to edit this page (I created an account). Since it's a wiki, I assumed it's supposed to be

Re: Update Wiki Developer instructions

2015-05-04 Thread Sean Owen
I think it's only committers that can edit it. I suppose you can open a JIRA with a suggested text change if it is significant enough to need discussion. If it's trivial, just post it here and someone can take care of it. On Mon, May 4, 2015 at 2:32 PM, Iulian Dragoș iulian.dra...@typesafe.com

[ANNOUNCE] Spark branch-1.4

2015-05-04 Thread Patrick Wendell
Hi Devs, Just an announcement that I've cut Spark's branch 1.4 to form the basis of the 1.4 release. Other than a few stragglers, this represents the end of active feature development for Spark 1.4. Per usual, if committers are merging any features, please be in touch so I can help coordinate.

Re: LDA and PageRank Using GraphX

2015-05-04 Thread Robin East
There is an LDA example in the MLlib examples. You can run it like this: ./bin/run-example mllib.LDAExample --stopwordFile stopwords input documents stop words is a file of stop words, 1 on each line. Input documents are the text of each document, 1 document per line. To see all the options

Re: Update Wiki Developer instructions

2015-05-04 Thread Iulian Dragoș
Ok, here’s how it should be: - Eclipse Luna - Scala IDE 4.0 - Scala Test The easiest way is to download the Scala IDE bundle from the Scala IDE download page http://scala-ide.org/download/sdk.html. It comes pre-installed with ScalaTest. Alternatively, use the provided

Re: Multi-Line JSON in SparkSQL

2015-05-04 Thread Paul Brown
It's not JSON, per se, but data formats like smile ( http://en.wikipedia.org/wiki/Smile_%28data_interchange_format%29) provide support for markers that can't be confused with content and also provide reasonably similar ergonomics. — p...@mult.ifario.us | Multifarious, Inc. |

Re: [discuss] ending support for Java 6?

2015-05-04 Thread shane knapp
...and now the workers all have java6 installed. https://issues.apache.org/jira/browse/SPARK-1437 sadly, the built-in jenkins jdk management doesn't allow us to choose a JDK version within matrix projects... so we need to manage this stuff manually. On Sun, May 3, 2015 at 8:57 AM, shane knapp

Re: [discuss] ending support for Java 6?

2015-05-04 Thread Patrick Wendell
If we just set JAVA_HOME in dev/run-test-jenkins, I think it should work. On Mon, May 4, 2015 at 7:20 PM, shane knapp skn...@berkeley.edu wrote: ...and now the workers all have java6 installed. https://issues.apache.org/jira/browse/SPARK-1437 sadly, the built-in jenkins jdk management

Re: [discuss] ending support for Java 6?

2015-05-04 Thread shane knapp
sgtm On Mon, May 4, 2015 at 11:23 AM, Patrick Wendell pwend...@gmail.com wrote: If we just set JAVA_HOME in dev/run-test-jenkins, I think it should work. On Mon, May 4, 2015 at 7:20 PM, shane knapp skn...@berkeley.edu wrote: ...and now the workers all have java6 installed.

Task scheduling times

2015-05-04 Thread Akshat Aranya
Hi, I have been investigating scheduling delays in Spark and I found some unexplained anomalies. In my use case, I have two stages after collapsing the transformations: the first is a mapPartitions() and the second is a sortByKey(). I found that the task serialization for the first stage takes

Re: Multi-Line JSON in SparkSQL

2015-05-04 Thread Reynold Xin
Joe - I think that's a legit and useful thing to do. Do you want to give it a shot? On Mon, May 4, 2015 at 12:36 AM, Joe Halliwell joe.halliw...@gmail.com wrote: I think Reynold’s argument shows the impossibility of the general case. But a “maximum object depth” hint could enable a new input

Re: Multi-Line JSON in SparkSQL

2015-05-04 Thread Matei Zaharia
I don't know whether this is common, but we might also allow another separator for JSON objects, such as two blank lines. Matei On May 4, 2015, at 2:28 PM, Reynold Xin r...@databricks.com wrote: Joe - I think that's a legit and useful thing to do. Do you want to give it a shot? On Mon,

OOM error with GMMs on 4GB dataset

2015-05-04 Thread Vinay Muttineni
Hi, I am training a GMM with 10 gaussians on a 4 GB dataset(720,000 * 760). The spark (1.3.1) job is allocated 120 executors with 6GB each and the driver also has 6GB. Spark Config Params: .set(spark.hadoop.validateOutputSpecs, false).set(spark.dynamicAllocation.enabled,

Re: Speeding up Spark build during development

2015-05-04 Thread Tathagata Das
In addition to Michael suggestion, in my SBT workflow I also use ~ to automatically kickoff build and unit test. For example, sbt/sbt ~streaming/test-only *BasicOperationsSuite* It will automatically detect any file changes in the project and start of the compilation and testing. So my full

Thanking Test Partners

2015-05-04 Thread Patrick Wendell
Hey All, Community testing during the QA window is an important part of the release cycle in Spark. It helps us deliver higher quality releases by vetting out issues not covered by our unit tests. I was thinking that from now on, it would be nice to recognize the organizations that donate time

Re: Multi-Line JSON in SparkSQL

2015-05-04 Thread Olivier Girardot
@joe, I'd be glad to help if you need. Le lun. 4 mai 2015 à 20:06, Matei Zaharia matei.zaha...@gmail.com a écrit : I don't know whether this is common, but we might also allow another separator for JSON objects, such as two blank lines. Matei On May 4, 2015, at 2:28 PM, Reynold Xin

Re: Speeding up Spark build during development

2015-05-04 Thread Michael Armbrust
FWIW... My Spark SQL development workflow is usually to run build/sbt sparkShell or build/sbt 'sql/test-only testSuiteName'. These commands starts in as little as 30s on my laptop, automatically figure out which subprojects need to be rebuilt, and don't require the expensive assembly creation.

Re: [discuss] DataFrame function namespacing

2015-05-04 Thread Reynold Xin
After talking with people on this thread and offline, I've decided to go with option 1, i.e. putting everything in a single functions object. On Thu, Apr 30, 2015 at 10:04 AM, Ted Yu yuzhih...@gmail.com wrote: IMHO I would go with choice #1 Cheers On Wed, Apr 29, 2015 at 10:03 PM, Reynold