Loading JSON Dataset fails with com.fasterxml.jackson.databind.JsonMappingException

2014-11-30 Thread Peter Vandenabeele
Hi, On spark 1.1.0 in Standalone mode, I am following https://spark.apache.org/docs/1.1.0/sql-programming-guide.html#json-datasets to try to load a simple test JSON file (on my local filesystem, not in hdfs). The file is below and was validated with jsonlint.com: ➜ tmp cat test_4.json {foo:

kafka pipeline exactly once semantics

2014-11-30 Thread Josh J
Hi, In the spark docs http://spark.apache.org/docs/latest/streaming-programming-guide.html#failure-of-a-worker-node it mentions However, output operations (like foreachRDD) have *at-least once* semantics, that is, the transformed data may get written to an external entity more than once in the

Re: Publishing a transformed DStream to Kafka

2014-11-30 Thread Josh J
Is there a way to do this that preserves exactly once semantics for the write to Kafka? On Tue, Sep 2, 2014 at 12:30 PM, Tim Smith secs...@gmail.com wrote: I'd be interested in finding the answer too. Right now, I do: val kafkaOutMsgs = kafkInMessages.map(x=myFunc(x._2,someParam))

Re: Loading JSON Dataset fails with com.fasterxml.jackson.databind.JsonMappingException

2014-11-30 Thread Peter Vandenabeele
On Sun, Nov 30, 2014 at 1:10 PM, Peter Vandenabeele pe...@vandenabeele.com wrote: On spark 1.1.0 in Standalone mode, I am following https://spark.apache.org/docs/1.1.0/sql-programming-guide.html#json-datasets to try to load a simple test JSON file (on my local filesystem, not in hdfs).

Setting network variables in spark-shell

2014-11-30 Thread Brian Dolan
Howdy Folks, What is the correct syntax in 1.0.0 to set networking variables in spark shell? Specifically, I'd like to set the spark.akka.frameSize I'm attempting this: spark-shell -Dspark.akka.frameSize=1 --executor-memory 4g Only to get this within the session:

Re: Setting network variables in spark-shell

2014-11-30 Thread Ritesh Kumar Singh
Spark configuration settings can be found here http://spark.apache.org/docs/latest/configuration.html Hope it helps :) On Sun, Nov 30, 2014 at 9:55 PM, Brian Dolan buddha_...@yahoo.com.invalid wrote: Howdy Folks, What is the correct syntax in 1.0.0 to set networking variables in spark

Re: Setting network variables in spark-shell

2014-11-30 Thread Yanbo
Try to use spark-shell --conf spark.akka.frameSize=1 在 2014年12月1日,上午12:25,Brian Dolan buddha_...@yahoo.com.INVALID 写道: Howdy Folks, What is the correct syntax in 1.0.0 to set networking variables in spark shell? Specifically, I'd like to set the spark.akka.frameSize I'm

Is there any Spark implementation for Item-based Collaborative Filtering?

2014-11-30 Thread shahab
Hi, I just wonder if there is any implementation for Item-based Collaborative Filtering in Spark? best, /Shahab

Re: Is there any Spark implementation for Item-based Collaborative Filtering?

2014-11-30 Thread Jimmy
The latest version of MLlib has it built in no? J Sent from my iPhone On Nov 30, 2014, at 9:36 AM, shahab shahab.mok...@gmail.com wrote: Hi, I just wonder if there is any implementation for Item-based Collaborative Filtering in Spark? best, /Shahab

Re: S3NativeFileSystem inefficient implementation when calling sc.textFile

2014-11-30 Thread David Blewett
You might be interested in the new s3a filesystem in Hadoop 2.6.0 [1]. 1. https://issues.apache.org/jira/plugins/servlet/mobile#issue/HADOOP-10400 On Nov 26, 2014 12:24 PM, Aaron Davidson ilike...@gmail.com wrote: Spark has a known problem where it will do a pass of metadata on a large number

Re: S3NativeFileSystem inefficient implementation when calling sc.textFile

2014-11-30 Thread Aaron Davidson
Note that it does not appear that s3a solves the original problems in this thread, which are on the Spark side or due to the fact that metadata listing in S3 is slow simply due to going over the network. On Sun, Nov 30, 2014 at 10:07 AM, David Blewett da...@dawninglight.net wrote: You might be

Re: Is there any Spark implementation for Item-based Collaborative Filtering?

2014-11-30 Thread Sean Owen
There is an implementation of all-pairs similarity. Have a look at the DIMSUM implementation in RowMatrix. It is an element of what you would need for such a recommender, but not the whole thing. You can also do the model-building part of an ALS-based recommender with ALS in MLlib. So, no not

Re: Is there any Spark implementation for Item-based Collaborative Filtering?

2014-11-30 Thread Pat Ferrel
Actually the spark-itemsimilarity job and related code in the Spark module of Mahout creates all-pairs similarity too. It’s designed to use with a search engine, which provides the query part of the recommender. Integrate the two and you have a near realtime scalable item-based/cooccurrence

Re: Publishing a transformed DStream to Kafka

2014-11-30 Thread francois . garillot
How about writing to a buffer ? Then you would flush the buffer to Kafka if and only if the output operation reports successful completion. In the event of a worker failure, that would not happen. — FG On Sun, Nov 30, 2014 at 2:28 PM, Josh J joshjd...@gmail.com wrote: Is there a way to do

Re: Multiple SparkContexts in same Driver JVM

2014-11-30 Thread Harihar Nahak
try setting in SparkConf.set( 'spark.driver.allowMultipleContexts' , true) On 30 November 2014 at 17:37, lokeshkumar [via Apache Spark User List] ml-node+s1001560n20037...@n3.nabble.com wrote: Hi Forum, Is it not possible to run multiple SparkContexts concurrently without stopping the other

Re: RDDs join problem: incorrect result

2014-11-30 Thread Harihar Nahak
what do you mean by incorrect? could you please share some examples from both the RDD and resultant RDD also If you get any exception paste that too. it helps to debug where is the issue On 27 November 2014 at 17:07, liuboya [via Apache Spark User List] ml-node+s1001560n19928...@n3.nabble.com

Re: GraphX:java.lang.NoSuchMethodError:org.apache.spark.graphx.Graph$.apply

2014-11-30 Thread Harihar Nahak
Hi, If you haven't figure out so far; could you please share some details: how you running GraphX ? also before executing above commands from shell import required GraphX packages On 27 November 2014 at 20:49, liuboya [via Apache Spark User List] ml-node+s1001560n19959...@n3.nabble.com wrote:

How can a function access Executor ID, Function ID and other parameters

2014-11-30 Thread Steve Lewis
I am running on a 15 node cluster and am trying to set partitioning to balance the work across all nodes. I am using an Accumulator to track work by Mac Address but would prefer to use data known to the Spark environment - Executor ID, and Function ID show up in the Spark UI and Task ID and

Re: reduceByKey and empty output files

2014-11-30 Thread Rishi Yadav
How big is your input dataset? On Thursday, November 27, 2014, Praveen Sripati praveensrip...@gmail.com wrote: Hi, When I run the below program, I see two files in the HDFS because the number of partitions in 2. But, one of the file is empty. Why is it so? Is the work not distributed

Re: Edge List File in GraphX

2014-11-30 Thread Harihar Nahak
Graphloade.edgeListFile(fileName) , where file name must be in 1\t2 form. about result NaN there might some issue with the data. I ran it for various combination of data set and it works perfectly fine. On 25 November 2014 at 19:23, pradhandeep [via Apache Spark User List]

Re: kafka pipeline exactly once semantics

2014-11-30 Thread Tobias Pfeiffer
Josh, On Sun, Nov 30, 2014 at 10:17 PM, Josh J joshjd...@gmail.com wrote: I would like to setup a Kafka pipeline whereby I write my data to a single topic 1, then I continue to process using spark streaming and write the transformed results to topic2, and finally I read the results from topic

RE: Unable to compile spark 1.1.0 on windows 8.1

2014-11-30 Thread Judy Nash
I have found the following to work for me on win 8.1: 1) run sbt assembly 2) Use Maven. You can find the maven commands for your build at : docs\building-spark.md -Original Message- From: Ishwardeep Singh [mailto:ishwardeep.si...@impetus.co.in] Sent: Thursday, November 27, 2014 11:31

Re: spark.akka.frameSize setting problem

2014-11-30 Thread Ke Wang
I meet the same problem, did you solve it ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-akka-frameSize-setting-problem-tp3416p20063.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: spark.akka.frameSize setting problem

2014-11-30 Thread Shixiong Zhu
4096MB is greater than Int.MaxValue and it will be overflow in Spark. Please set it less then 4096. Best Regards, Shixiong Zhu 2014-12-01 13:14 GMT+08:00 Ke Wang jkx...@gmail.com: I meet the same problem, did you solve it ? -- View this message in context:

Re: spark.akka.frameSize setting problem

2014-11-30 Thread Shixiong Zhu
Sorry. Should be not greater than 2048. 2047 is the greatest value. Best Regards, Shixiong Zhu 2014-12-01 13:20 GMT+08:00 Shixiong Zhu zsxw...@gmail.com: 4096MB is greater than Int.MaxValue and it will be overflow in Spark. Please set it less then 4096. Best Regards, Shixiong Zhu

Re: spark.akka.frameSize setting problem

2014-11-30 Thread Shixiong Zhu
Created a JIRA to track it: https://issues.apache.org/jira/browse/SPARK-4664 Best Regards, Shixiong Zhu 2014-12-01 13:22 GMT+08:00 Shixiong Zhu zsxw...@gmail.com: Sorry. Should be not greater than 2048. 2047 is the greatest value. Best Regards, Shixiong Zhu 2014-12-01 13:20 GMT+08:00

RE: latest Spark 1.2 thrift server fail with NoClassDefFoundError on Guava

2014-11-30 Thread Judy Nash
Thanks Patrick and Cheng for the suggestions. The issue was Hadoop common jar was added to a classpath. After I removed Hadoop common jar from both master and slave, I was able to bypass the error. This was caused by a local change, so no impact on the 1.2 release. -Original Message-

Re: latest Spark 1.2 thrift server fail with NoClassDefFoundError on Guava

2014-11-30 Thread Patrick Wendell
Thanks Judy. While this is not directly caused by a Spark issue, it is likely other users will run into this. This is an unfortunate consequence of the way that we've shaded Guava in this release, we rely on byte code shading of Hadoop itself as well. And if the user has their own Hadoop classes