Loading JSON Dataset fails with com.fasterxml.jackson.databind.JsonMappingException
Hi, On spark 1.1.0 in Standalone mode, I am following https://spark.apache.org/docs/1.1.0/sql-programming-guide.html#json-datasets to try to load a simple test JSON file (on my local filesystem, not in hdfs). The file is below and was validated with jsonlint.com: ➜ tmp cat test_4.json {foo: [{ bar: { id: 31, name: bar } },{ tux: { id: 42, name: tux } }] } ➜ tmp wc test_4.json 13 19 182 test_4.json Reading the file as text works correctly (reporting a line count of 13). However, trying to read the file with: scala val sqlContext = new org.apache.spark.sql.SQLContext(sc) sqlContext: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@2c4eae94 scala val test_as_json = sqlContext.jsonFile(test_path) ... Gets into this exception: 14/11/30 12:37:10 INFO rdd.HadoopRDD: Input split: file:/Users/peter_v/tmp/test_4.json:0+91 14/11/30 12:37:10 INFO rdd.HadoopRDD: Input split: file:/Users/peter_v/tmp/test_4.json:91+91 14/11/30 12:37:10 ERROR executor.Executor: Exception in task 1.0 in stage 1.0 (TID 3) com.fasterxml.jackson.databind.JsonMappingException: Can not instantiate value of type [map type; class java.util.LinkedHashMap, [simple type, class java.lang.Object] - [simple type, class java.lang.Object]] from String value; no single-String constructor/factory method ... Wat looks strange to me is that the file of 182 characters, seems to be split over 2 workers that take char 0+91 and 91+91 ? (Is that interpretation correct ?? That would yield 2 half JSON files that would each be incomplete??). I presume I am wrong here and something else is at play. Full log of the experiment below. Also, I did see this thread (regarding blank lines that trigger similar problem) http://find.searchhub.org/document/9aaf462d6bca027c#f294a1dd16169ba4 I validated that I have no blank lines in the input (line count = 13) and I also did try the filter function that is suggested there, but still get (presumably) the same error condition. I also did not find immediate hints that this was a resolved issue in Spark 1.1.1 http://spark.apache.org/releases/spark-release-1-1-1.html Thanks for any hints how to resolve this, Peter +++ $ bin/spark-shell Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=128m; support was removed in 8.0 14/11/30 12:34:34 INFO spark.SecurityManager: Changing view acls to: peter_v, 14/11/30 12:34:34 INFO spark.SecurityManager: Changing modify acls to: peter_v, 14/11/30 12:34:34 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(peter_v, ); users with modify permissions: Set(peter_v, ) 14/11/30 12:34:34 INFO spark.HttpServer: Starting HTTP Server 14/11/30 12:34:34 INFO server.Server: jetty-8.y.z-SNAPSHOT 14/11/30 12:34:34 INFO server.AbstractConnector: Started SocketConnector@0.0.0.0:63500 14/11/30 12:34:34 INFO util.Utils: Successfully started service 'HTTP class server' on port 63500. Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.1.0 /_/ Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_25) Type in expressions to have them evaluated. Type :help for more information. 14/11/30 12:34:37 INFO spark.SecurityManager: Changing view acls to: peter_v, 14/11/30 12:34:37 INFO spark.SecurityManager: Changing modify acls to: peter_v, 14/11/30 12:34:37 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(peter_v, ); users with modify permissions: Set(peter_v, ) 14/11/30 12:34:37 INFO slf4j.Slf4jLogger: Slf4jLogger started 14/11/30 12:34:37 INFO Remoting: Starting remoting 14/11/30 12:34:37 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@192.168.0.191:63503] 14/11/30 12:34:37 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sparkDriver@192.168.0.191:63503] 14/11/30 12:34:37 INFO util.Utils: Successfully started service 'sparkDriver' on port 63503. 14/11/30 12:34:37 INFO spark.SparkEnv: Registering MapOutputTracker 14/11/30 12:34:37 INFO spark.SparkEnv: Registering BlockManagerMaster 14/11/30 12:34:37 INFO storage.DiskBlockManager: Created local directory at /var/folders/1q/3_rsfwqd4b93sj7m6rnbzj8hgn/T/spark-local-20141130123437-43b2 14/11/30 12:34:37 INFO util.Utils: Successfully started service 'Connection manager for block manager' on port 63504. 14/11/30 12:34:37 INFO network.ConnectionManager: Bound socket to port 63504 with id = ConnectionManagerId(192.168.0.191,63504) 14/11/30 12:34:37 INFO storage.MemoryStore: MemoryStore started with capacity 265.1 MB 14/11/30 12:34:37 INFO storage.BlockManagerMaster: Trying to register BlockManager 14/11/30 12:34:37 INFO storage.BlockManagerMasterActor: Registering block manager
kafka pipeline exactly once semantics
Hi, In the spark docs http://spark.apache.org/docs/latest/streaming-programming-guide.html#failure-of-a-worker-node it mentions However, output operations (like foreachRDD) have *at-least once* semantics, that is, the transformed data may get written to an external entity more than once in the event of a worker failure. I would like to setup a Kafka pipeline whereby I write my data to a single topic 1, then I continue to process using spark streaming and write the transformed results to topic2, and finally I read the results from topic 2. How do I configure the spark streaming so that I can maintain exactly once semantics when writing to topic 2? Thanks, Josh
Re: Publishing a transformed DStream to Kafka
Is there a way to do this that preserves exactly once semantics for the write to Kafka? On Tue, Sep 2, 2014 at 12:30 PM, Tim Smith secs...@gmail.com wrote: I'd be interested in finding the answer too. Right now, I do: val kafkaOutMsgs = kafkInMessages.map(x=myFunc(x._2,someParam)) kafkaOutMsgs.foreachRDD((rdd,time) = { rdd.foreach(rec = { writer.output(rec) }) } ) //where writer.ouput is a method that takes a string and writer is an instance of a producer class. On Tue, Sep 2, 2014 at 10:12 AM, Massimiliano Tomassi max.toma...@gmail.com wrote: Hello all, after having applied several transformations to a DStream I'd like to publish all the elements in all the resulting RDDs to Kafka. What the best way to do that would be? Just using DStream.foreach and then RDD.foreach ? Is there any other built in utility for this use case? Thanks a lot, Max -- Massimiliano Tomassi e-mail: max.toma...@gmail.com
Re: Loading JSON Dataset fails with com.fasterxml.jackson.databind.JsonMappingException
On Sun, Nov 30, 2014 at 1:10 PM, Peter Vandenabeele pe...@vandenabeele.com wrote: On spark 1.1.0 in Standalone mode, I am following https://spark.apache.org/docs/1.1.0/sql-programming-guide.html#json-datasets to try to load a simple test JSON file (on my local filesystem, not in hdfs). The file is below and was validated with jsonlint.com: ➜ tmp cat test_4.json {foo: [{ bar: { id: 31, name: bar } },{ tux: { id: 42, name: tux } }] } ➜ tmp wc test_4.json 13 19 182 test_4.json I should have read the manual better (#rtfm): jsonFile - loads data from a directory of JSON files where _each line of the files is a JSON object_. (emphasis mine) So, what works is: $ cat test_6.json # == Succes (count() = 3) {foo:bar} {foo:tux} {foo:ping} and what fails is: $ cat test_7.json # == Fail (JsonMappingException) [ {foo:bar}, {foo:tux}, {foo:ping} ] I got confused by the fact that test_6.json is _not_ valid JSON (but works for this) and test_7.json is a _valid_ JSON array (and does not work for this). I will see if I can contribute some note in the documentation. In any case, it might be better to not _name_ the file in the tutorial examples/src/main/resources/people.json because it is actually not a valid JSON file (but a file with a JSON object on each line) Maybe examples/src/main/resources/people.jsons is a better name (equivalent to 'xs' that is used as convention in Scala). Also, just showing an example of people.jsons could avoid future confusion. Thanks, Peter -- Peter Vandenabeele http://www.allthingsdata.io
Setting network variables in spark-shell
Howdy Folks, What is the correct syntax in 1.0.0 to set networking variables in spark shell? Specifically, I'd like to set the spark.akka.frameSize I'm attempting this: spark-shell -Dspark.akka.frameSize=1 --executor-memory 4g Only to get this within the session: System.getProperty(spark.executor.memory) res0: String = 4g System.getProperty(spark.akka.frameSize) res1: String = null I don't believe I am violating protocol, but I have also posted this to SO: http://stackoverflow.com/questions/27215288/how-do-i-set-spark-akka-framesize-in-spark-shell ~~ May All Your Sequences Converge
Re: Setting network variables in spark-shell
Spark configuration settings can be found here http://spark.apache.org/docs/latest/configuration.html Hope it helps :) On Sun, Nov 30, 2014 at 9:55 PM, Brian Dolan buddha_...@yahoo.com.invalid wrote: Howdy Folks, What is the correct syntax in 1.0.0 to set networking variables in spark shell? Specifically, I'd like to set the spark.akka.frameSize I'm attempting this: spark-shell -Dspark.akka.frameSize=1 --executor-memory 4g Only to get this within the session: System.getProperty(spark.executor.memory) res0: String = 4g System.getProperty(spark.akka.frameSize) res1: String = null I don't believe I am violating protocol, but I have also posted this to SO: http://stackoverflow.com/questions/27215288/how-do-i-set-spark-akka-framesize-in-spark-shell ~~ May All Your Sequences Converge
Re: Setting network variables in spark-shell
Try to use spark-shell --conf spark.akka.frameSize=1 在 2014年12月1日,上午12:25,Brian Dolan buddha_...@yahoo.com.INVALID 写道: Howdy Folks, What is the correct syntax in 1.0.0 to set networking variables in spark shell? Specifically, I'd like to set the spark.akka.frameSize I'm attempting this: spark-shell -Dspark.akka.frameSize=1 --executor-memory 4g Only to get this within the session: System.getProperty(spark.executor.memory) res0: String = 4g System.getProperty(spark.akka.frameSize) res1: String = null I don't believe I am violating protocol, but I have also posted this to SO: http://stackoverflow.com/questions/27215288/how-do-i-set-spark-akka-framesize-in-spark-shell ~~ May All Your Sequences Converge
Is there any Spark implementation for Item-based Collaborative Filtering?
Hi, I just wonder if there is any implementation for Item-based Collaborative Filtering in Spark? best, /Shahab
Re: Is there any Spark implementation for Item-based Collaborative Filtering?
The latest version of MLlib has it built in no? J Sent from my iPhone On Nov 30, 2014, at 9:36 AM, shahab shahab.mok...@gmail.com wrote: Hi, I just wonder if there is any implementation for Item-based Collaborative Filtering in Spark? best, /Shahab - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: S3NativeFileSystem inefficient implementation when calling sc.textFile
You might be interested in the new s3a filesystem in Hadoop 2.6.0 [1]. 1. https://issues.apache.org/jira/plugins/servlet/mobile#issue/HADOOP-10400 On Nov 26, 2014 12:24 PM, Aaron Davidson ilike...@gmail.com wrote: Spark has a known problem where it will do a pass of metadata on a large number of small files serially, in order to find the partition information prior to starting the job. This will probably not be repaired by switching the FS impl. However, you can change the FS being used like so (prior to the first usage): sc.hadoopConfiguration.set(fs.s3n.impl, org.apache.hadoop.fs.s3native.NativeS3FileSystem) On Wed, Nov 26, 2014 at 1:47 AM, Tomer Benyamini tomer@gmail.com wrote: Thanks Lalit; Setting the access + secret keys in the configuration works even when calling sc.textFile. Is there a way to select which hadoop s3 native filesystem implementation would be used at runtime using the hadoop configuration? Thanks, Tomer On Wed, Nov 26, 2014 at 11:08 AM, lalit1303 la...@sigmoidanalytics.com wrote: you can try creating hadoop Configuration and set s3 configuration i.e. access keys etc. Now, for reading files from s3 use newAPIHadoopFile and pass the config object here along with key, value classes. - Lalit Yadav la...@sigmoidanalytics.com -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/S3NativeFileSystem-inefficient-implementation-when-calling-sc-textFile-tp19841p19845.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: S3NativeFileSystem inefficient implementation when calling sc.textFile
Note that it does not appear that s3a solves the original problems in this thread, which are on the Spark side or due to the fact that metadata listing in S3 is slow simply due to going over the network. On Sun, Nov 30, 2014 at 10:07 AM, David Blewett da...@dawninglight.net wrote: You might be interested in the new s3a filesystem in Hadoop 2.6.0 [1]. 1. https://issues.apache.org/jira/plugins/servlet/mobile#issue/HADOOP-10400 On Nov 26, 2014 12:24 PM, Aaron Davidson ilike...@gmail.com wrote: Spark has a known problem where it will do a pass of metadata on a large number of small files serially, in order to find the partition information prior to starting the job. This will probably not be repaired by switching the FS impl. However, you can change the FS being used like so (prior to the first usage): sc.hadoopConfiguration.set(fs.s3n.impl, org.apache.hadoop.fs.s3native.NativeS3FileSystem) On Wed, Nov 26, 2014 at 1:47 AM, Tomer Benyamini tomer@gmail.com wrote: Thanks Lalit; Setting the access + secret keys in the configuration works even when calling sc.textFile. Is there a way to select which hadoop s3 native filesystem implementation would be used at runtime using the hadoop configuration? Thanks, Tomer On Wed, Nov 26, 2014 at 11:08 AM, lalit1303 la...@sigmoidanalytics.com wrote: you can try creating hadoop Configuration and set s3 configuration i.e. access keys etc. Now, for reading files from s3 use newAPIHadoopFile and pass the config object here along with key, value classes. - Lalit Yadav la...@sigmoidanalytics.com -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/S3NativeFileSystem-inefficient-implementation-when-calling-sc-textFile-tp19841p19845.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Is there any Spark implementation for Item-based Collaborative Filtering?
There is an implementation of all-pairs similarity. Have a look at the DIMSUM implementation in RowMatrix. It is an element of what you would need for such a recommender, but not the whole thing. You can also do the model-building part of an ALS-based recommender with ALS in MLlib. So, no not directly, but there are related pieces. On Sun, Nov 30, 2014 at 5:36 PM, shahab shahab.mok...@gmail.com wrote: Hi, I just wonder if there is any implementation for Item-based Collaborative Filtering in Spark? best, /Shahab - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Is there any Spark implementation for Item-based Collaborative Filtering?
Actually the spark-itemsimilarity job and related code in the Spark module of Mahout creates all-pairs similarity too. It’s designed to use with a search engine, which provides the query part of the recommender. Integrate the two and you have a near realtime scalable item-based/cooccurrence collaborative filtering type recommender. On Nov 30, 2014, at 12:09 PM, Sean Owen so...@cloudera.com wrote: There is an implementation of all-pairs similarity. Have a look at the DIMSUM implementation in RowMatrix. It is an element of what you would need for such a recommender, but not the whole thing. You can also do the model-building part of an ALS-based recommender with ALS in MLlib. So, no not directly, but there are related pieces. On Sun, Nov 30, 2014 at 5:36 PM, shahab shahab.mok...@gmail.com wrote: Hi, I just wonder if there is any implementation for Item-based Collaborative Filtering in Spark? best, /Shahab - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Publishing a transformed DStream to Kafka
How about writing to a buffer ? Then you would flush the buffer to Kafka if and only if the output operation reports successful completion. In the event of a worker failure, that would not happen. — FG On Sun, Nov 30, 2014 at 2:28 PM, Josh J joshjd...@gmail.com wrote: Is there a way to do this that preserves exactly once semantics for the write to Kafka? On Tue, Sep 2, 2014 at 12:30 PM, Tim Smith secs...@gmail.com wrote: I'd be interested in finding the answer too. Right now, I do: val kafkaOutMsgs = kafkInMessages.map(x=myFunc(x._2,someParam)) kafkaOutMsgs.foreachRDD((rdd,time) = { rdd.foreach(rec = { writer.output(rec) }) } ) //where writer.ouput is a method that takes a string and writer is an instance of a producer class. On Tue, Sep 2, 2014 at 10:12 AM, Massimiliano Tomassi max.toma...@gmail.com wrote: Hello all, after having applied several transformations to a DStream I'd like to publish all the elements in all the resulting RDDs to Kafka. What the best way to do that would be? Just using DStream.foreach and then RDD.foreach ? Is there any other built in utility for this use case? Thanks a lot, Max -- Massimiliano Tomassi e-mail: max.toma...@gmail.com
Re: Multiple SparkContexts in same Driver JVM
try setting in SparkConf.set( 'spark.driver.allowMultipleContexts' , true) On 30 November 2014 at 17:37, lokeshkumar [via Apache Spark User List] ml-node+s1001560n20037...@n3.nabble.com wrote: Hi Forum, Is it not possible to run multiple SparkContexts concurrently without stopping the other one in the spark 1.3.0. I have been trying this out and getting the below error. Caused by: org.apache.spark.SparkException: Only one SparkContext may be running in this JVM (see SPARK-2243). To ignore this error, set spark.driver.allowMultipleContexts = true. The currently running SparkContext was created at: According to this, its not possible to create unless we specify the option spark.driver.allowMultipleContexts = true. So is there a way to create multiple concurrently running SparkContext in same JVM or should we trigger Driver processes in different JVMs to do the same? Also please let me know where the option 'spark.driver.allowMultipleContexts' to be set? I have set it in spark-env.sh SPARK_MASTER_OPTS but no luck. -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/Multiple-SparkContexts-in-same-Driver-JVM-tp20037.html To start a new topic under Apache Spark User List, email ml-node+s1001560n1...@n3.nabble.com To unsubscribe from Apache Spark User List, click here http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=aG5haGFrQHd5bnlhcmRncm91cC5jb218MXwtMTgxOTE5MTkyOQ== . NAML http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- Regards, Harihar Nahak BigData Developer Wynyard Email:hna...@wynyardgroup.com | Extn: 8019 - --Harihar -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Multiple-SparkContexts-in-same-Driver-JVM-tp20037p20055.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: RDDs join problem: incorrect result
what do you mean by incorrect? could you please share some examples from both the RDD and resultant RDD also If you get any exception paste that too. it helps to debug where is the issue On 27 November 2014 at 17:07, liuboya [via Apache Spark User List] ml-node+s1001560n19928...@n3.nabble.com wrote: Hi, I ran into a problem when doing two RDDs join operation. For example, RDDa: RDD[(String,String)] and RDDb:RDD[(String,Int)]. Then, the result RDDc:[String,(String,Int)] = RDDa.join(RDDb). But I find the results in RDDc are incorrect compared with RDDb. What's wrong in join? -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/RDDs-join-problem-incorrect-result-tp19928.html To start a new topic under Apache Spark User List, email ml-node+s1001560n1...@n3.nabble.com To unsubscribe from Apache Spark User List, click here http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=aG5haGFrQHd5bnlhcmRncm91cC5jb218MXwtMTgxOTE5MTkyOQ== . NAML http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- Regards, Harihar Nahak BigData Developer Wynyard Email:hna...@wynyardgroup.com | Extn: 8019 - --Harihar -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDDs-join-problem-incorrect-result-tp19928p20056.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: GraphX:java.lang.NoSuchMethodError:org.apache.spark.graphx.Graph$.apply
Hi, If you haven't figure out so far; could you please share some details: how you running GraphX ? also before executing above commands from shell import required GraphX packages On 27 November 2014 at 20:49, liuboya [via Apache Spark User List] ml-node+s1001560n19959...@n3.nabble.com wrote: I'm waiting online. Who can help me, please? -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-java-lang-NoSuchMethodError-org-apache-spark-graphx-Graph-apply-tp19958p19959.html To start a new topic under Apache Spark User List, email ml-node+s1001560n1...@n3.nabble.com To unsubscribe from Apache Spark User List, click here http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=aG5haGFrQHd5bnlhcmRncm91cC5jb218MXwtMTgxOTE5MTkyOQ== . NAML http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- Regards, Harihar Nahak BigData Developer Wynyard Email:hna...@wynyardgroup.com | Extn: 8019 - --Harihar -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-java-lang-NoSuchMethodError-org-apache-spark-graphx-Graph-apply-tp19958p20057.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
How can a function access Executor ID, Function ID and other parameters
I am running on a 15 node cluster and am trying to set partitioning to balance the work across all nodes. I am using an Accumulator to track work by Mac Address but would prefer to use data known to the Spark environment - Executor ID, and Function ID show up in the Spark UI and Task ID and Attempt ID (assuming these work like Hadoop) would be useful. Does someone know how code running in a function can access these parameters. I think I have asked this group several times about Task IDand Attempt ID without getting a reply. Incidentally the data I collect suggests that my execution is not at all balanced
Re: reduceByKey and empty output files
How big is your input dataset? On Thursday, November 27, 2014, Praveen Sripati praveensrip...@gmail.com wrote: Hi, When I run the below program, I see two files in the HDFS because the number of partitions in 2. But, one of the file is empty. Why is it so? Is the work not distributed equally to all the tasks? textFile.flatMap(lambda line: line.split()).map(lambda word: (word, 1)). *reduceByKey*(lambda a, b: a+b).*repartition(2)* .saveAsTextFile(hdfs://localhost:9000/user/praveen/output/) Thanks, Praveen -- - Rishi
Re: Edge List File in GraphX
Graphloade.edgeListFile(fileName) , where file name must be in 1\t2 form. about result NaN there might some issue with the data. I ran it for various combination of data set and it works perfectly fine. On 25 November 2014 at 19:23, pradhandeep [via Apache Spark User List] ml-node+s1001560n1972...@n3.nabble.com wrote: Hi, Is it necessary for every vertex to have an attribute when we load a graph to GraphX? In other words, if I have an edge list file containing pairs of vertices i.e., 1 2 means that there is an edge between node 1 and node 2. Now, when I run PageRank on this data it return a NaN. Can I use this type of data for any algorithm on GraphX? Thank You -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/Edge-List-File-in-GraphX-tp19724.html To start a new topic under Apache Spark User List, email ml-node+s1001560n1...@n3.nabble.com To unsubscribe from Apache Spark User List, click here http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=aG5haGFrQHd5bnlhcmRncm91cC5jb218MXwtMTgxOTE5MTkyOQ== . NAML http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- Regards, Harihar Nahak BigData Developer Wynyard Email:hna...@wynyardgroup.com | Extn: 8019 - --Harihar -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Edge-List-File-in-GraphX-tp19724p20060.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: kafka pipeline exactly once semantics
Josh, On Sun, Nov 30, 2014 at 10:17 PM, Josh J joshjd...@gmail.com wrote: I would like to setup a Kafka pipeline whereby I write my data to a single topic 1, then I continue to process using spark streaming and write the transformed results to topic2, and finally I read the results from topic 2. Not really related to your question, but you may also want to look into Samza http://samza.incubator.apache.org/ which was built exactly for this kind of processing. Tobias
RE: Unable to compile spark 1.1.0 on windows 8.1
I have found the following to work for me on win 8.1: 1) run sbt assembly 2) Use Maven. You can find the maven commands for your build at : docs\building-spark.md -Original Message- From: Ishwardeep Singh [mailto:ishwardeep.si...@impetus.co.in] Sent: Thursday, November 27, 2014 11:31 PM To: u...@spark.incubator.apache.org Subject: Unable to compile spark 1.1.0 on windows 8.1 Hi, I am trying to compile spark 1.1.0 on windows 8.1 but I get the following exception. [info] Compiling 3 Scala sources to D:\myworkplace\software\spark-1.1.0\project\target\scala-2.10\sbt0.13\classes... [error] D:\myworkplace\software\spark-1.1.0\project\SparkBuild.scala:26: object sbt is not a member of package com.typesafe [error] import com.typesafe.sbt.pom.{PomBuild, SbtPomKeys} [error] ^ [error] D:\myworkplace\software\spark-1.1.0\project\SparkBuild.scala:53: not found: type PomBuild [error] object SparkBuild extends PomBuild { [error] ^ [error] D:\myworkplace\software\spark-1.1.0\project\SparkBuild.scala:121: not found: value SbtPomKeys [error] otherResolvers = SbtPomKeys.mvnLocalRepository(dotM2 = Seq(Resolver.file(dotM2, dotM2))), [error]^ [error] D:\myworkplace\software\spark-1.1.0\project\SparkBuild.scala:165: value projectDefinitions is not a member of AnyRef [error] super.projectDefinitions(baseDirectory).map { x = [error] ^ [error] four errors found [error] (plugins/compile:compile) Compilation failed I have also setup scala 2.10. Need help to resolve this issue. Regards, Ishwardeep -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-compile-spark-1-1-0-on-windows-8-1-tp19996.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: spark.akka.frameSize setting problem
I meet the same problem, did you solve it ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-akka-frameSize-setting-problem-tp3416p20063.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: spark.akka.frameSize setting problem
4096MB is greater than Int.MaxValue and it will be overflow in Spark. Please set it less then 4096. Best Regards, Shixiong Zhu 2014-12-01 13:14 GMT+08:00 Ke Wang jkx...@gmail.com: I meet the same problem, did you solve it ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-akka-frameSize-setting-problem-tp3416p20063.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: spark.akka.frameSize setting problem
Sorry. Should be not greater than 2048. 2047 is the greatest value. Best Regards, Shixiong Zhu 2014-12-01 13:20 GMT+08:00 Shixiong Zhu zsxw...@gmail.com: 4096MB is greater than Int.MaxValue and it will be overflow in Spark. Please set it less then 4096. Best Regards, Shixiong Zhu 2014-12-01 13:14 GMT+08:00 Ke Wang jkx...@gmail.com: I meet the same problem, did you solve it ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-akka-frameSize-setting-problem-tp3416p20063.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: spark.akka.frameSize setting problem
Created a JIRA to track it: https://issues.apache.org/jira/browse/SPARK-4664 Best Regards, Shixiong Zhu 2014-12-01 13:22 GMT+08:00 Shixiong Zhu zsxw...@gmail.com: Sorry. Should be not greater than 2048. 2047 is the greatest value. Best Regards, Shixiong Zhu 2014-12-01 13:20 GMT+08:00 Shixiong Zhu zsxw...@gmail.com: 4096MB is greater than Int.MaxValue and it will be overflow in Spark. Please set it less then 4096. Best Regards, Shixiong Zhu 2014-12-01 13:14 GMT+08:00 Ke Wang jkx...@gmail.com: I meet the same problem, did you solve it ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-akka-frameSize-setting-problem-tp3416p20063.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
RE: latest Spark 1.2 thrift server fail with NoClassDefFoundError on Guava
Thanks Patrick and Cheng for the suggestions. The issue was Hadoop common jar was added to a classpath. After I removed Hadoop common jar from both master and slave, I was able to bypass the error. This was caused by a local change, so no impact on the 1.2 release. -Original Message- From: Patrick Wendell [mailto:pwend...@gmail.com] Sent: Wednesday, November 26, 2014 8:17 AM To: Judy Nash Cc: Denny Lee; Cheng Lian; u...@spark.incubator.apache.org Subject: Re: latest Spark 1.2 thrift server fail with NoClassDefFoundError on Guava Just to double check - I looked at our own assembly jar and I confirmed that our Hadoop configuration class does use the correctly shaded version of Guava. My best guess here is that somehow a separate Hadoop library is ending up on the classpath, possible because Spark put it there somehow. tar xvzf spark-assembly-1.3.0-SNAPSHOT-hadoop2.4.0.jar cd org/apache/hadoop/ javap -v Configuration | grep Precond Warning: Binary file Configuration contains org.apache.hadoop.conf.Configuration #497 = Utf8 org/spark-project/guava/common/base/Preconditions #498 = Class #497 // org/spark-project/guava/common/base/Preconditions #502 = Methodref #498.#501// org/spark-project/guava/common/base/Preconditions.checkArgument:(ZLjava/lang/Object;)V 12: invokestatic #502// Method org/spark-project/guava/common/base/Preconitions.checkArgument:(ZLjava/lang/Object;)V 50: invokestatic #502// Method org/spark-project/guava/common/base/Preconitions.checkArgument:(ZLjava/lang/Object;)V On Wed, Nov 26, 2014 at 11:08 AM, Patrick Wendell pwend...@gmail.com wrote: Hi Judy, Are you somehow modifying Spark's classpath to include jars from Hadoop and Hive that you have running on the machine? The issue seems to be that you are somehow including a version of Hadoop that references the original guava package. The Hadoop that is bundled in the Spark jars should not do this. - Patrick On Wed, Nov 26, 2014 at 1:45 AM, Judy Nash judyn...@exchange.microsoft.com wrote: Looks like a config issue. I ran spark-pi job and still failing with the same guava error Command ran: .\bin\spark-class.cmd org.apache.spark.deploy.SparkSubmit --class org.apache.spark.examples.SparkPi --master spark://headnodehost:7077 --executor-memory 1G --num-executors 1 .\lib\spark-examples-1.2.1-SNAPSHOT-hadoop2.4.0.jar 100 Had used the same build steps on spark 1.1 and had no issue. From: Denny Lee [mailto:denny.g@gmail.com] Sent: Tuesday, November 25, 2014 5:47 PM To: Judy Nash; Cheng Lian; u...@spark.incubator.apache.org Subject: Re: latest Spark 1.2 thrift server fail with NoClassDefFoundError on Guava To determine if this is a Windows vs. other configuration, can you just try to call the Spark-class.cmd SparkSubmit without actually referencing the Hadoop or Thrift server classes? On Tue Nov 25 2014 at 5:42:09 PM Judy Nash judyn...@exchange.microsoft.com wrote: I traced the code and used the following to call: Spark-class.cmd org.apache.spark.deploy.SparkSubmit --class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 spark-internal --hiveconf hive.server2.thrift.port=1 The issue ended up to be much more fundamental however. Spark doesn't work at all in configuration below. When open spark-shell, it fails with the same ClassNotFound error. Now I wonder if this is a windows-only issue or the hive/Hadoop configuration that is having this problem. From: Cheng Lian [mailto:lian.cs@gmail.com] Sent: Tuesday, November 25, 2014 1:50 AM To: Judy Nash; u...@spark.incubator.apache.org Subject: Re: latest Spark 1.2 thrift server fail with NoClassDefFoundError on Guava Oh so you're using Windows. What command are you using to start the Thrift server then? On 11/25/14 4:25 PM, Judy Nash wrote: Made progress but still blocked. After recompiling the code on cmd instead of PowerShell, now I can see all 5 classes as you mentioned. However I am still seeing the same error as before. Anything else I can check for? From: Judy Nash [mailto:judyn...@exchange.microsoft.com] Sent: Monday, November 24, 2014 11:50 PM To: Cheng Lian; u...@spark.incubator.apache.org Subject: RE: latest Spark 1.2 thrift server fail with NoClassDefFoundError on Guava This is what I got from jar tf: org/spark-project/guava/common/base/Preconditions.class org/spark-project/guava/common/math/MathPreconditions.class com/clearspring/analytics/util/Preconditions.class parquet/Preconditions.class I seem to have the line that reported missing, but I am missing this file: com/google/inject/internal/util/$Preconditions.class Any suggestion on how to fix this? Very much appreciate the help as I am very new to Spark and open source technologies. From: Cheng Lian [mailto:lian.cs@gmail.com] Sent: Monday,
Re: latest Spark 1.2 thrift server fail with NoClassDefFoundError on Guava
Thanks Judy. While this is not directly caused by a Spark issue, it is likely other users will run into this. This is an unfortunate consequence of the way that we've shaded Guava in this release, we rely on byte code shading of Hadoop itself as well. And if the user has their own Hadoop classes present it can cause issues. On Sun, Nov 30, 2014 at 10:53 PM, Judy Nash judyn...@exchange.microsoft.com wrote: Thanks Patrick and Cheng for the suggestions. The issue was Hadoop common jar was added to a classpath. After I removed Hadoop common jar from both master and slave, I was able to bypass the error. This was caused by a local change, so no impact on the 1.2 release. -Original Message- From: Patrick Wendell [mailto:pwend...@gmail.com] Sent: Wednesday, November 26, 2014 8:17 AM To: Judy Nash Cc: Denny Lee; Cheng Lian; u...@spark.incubator.apache.org Subject: Re: latest Spark 1.2 thrift server fail with NoClassDefFoundError on Guava Just to double check - I looked at our own assembly jar and I confirmed that our Hadoop configuration class does use the correctly shaded version of Guava. My best guess here is that somehow a separate Hadoop library is ending up on the classpath, possible because Spark put it there somehow. tar xvzf spark-assembly-1.3.0-SNAPSHOT-hadoop2.4.0.jar cd org/apache/hadoop/ javap -v Configuration | grep Precond Warning: Binary file Configuration contains org.apache.hadoop.conf.Configuration #497 = Utf8 org/spark-project/guava/common/base/Preconditions #498 = Class #497 // org/spark-project/guava/common/base/Preconditions #502 = Methodref #498.#501// org/spark-project/guava/common/base/Preconditions.checkArgument:(ZLjava/lang/Object;)V 12: invokestatic #502// Method org/spark-project/guava/common/base/Preconitions.checkArgument:(ZLjava/lang/Object;)V 50: invokestatic #502// Method org/spark-project/guava/common/base/Preconitions.checkArgument:(ZLjava/lang/Object;)V On Wed, Nov 26, 2014 at 11:08 AM, Patrick Wendell pwend...@gmail.com wrote: Hi Judy, Are you somehow modifying Spark's classpath to include jars from Hadoop and Hive that you have running on the machine? The issue seems to be that you are somehow including a version of Hadoop that references the original guava package. The Hadoop that is bundled in the Spark jars should not do this. - Patrick On Wed, Nov 26, 2014 at 1:45 AM, Judy Nash judyn...@exchange.microsoft.com wrote: Looks like a config issue. I ran spark-pi job and still failing with the same guava error Command ran: .\bin\spark-class.cmd org.apache.spark.deploy.SparkSubmit --class org.apache.spark.examples.SparkPi --master spark://headnodehost:7077 --executor-memory 1G --num-executors 1 .\lib\spark-examples-1.2.1-SNAPSHOT-hadoop2.4.0.jar 100 Had used the same build steps on spark 1.1 and had no issue. From: Denny Lee [mailto:denny.g@gmail.com] Sent: Tuesday, November 25, 2014 5:47 PM To: Judy Nash; Cheng Lian; u...@spark.incubator.apache.org Subject: Re: latest Spark 1.2 thrift server fail with NoClassDefFoundError on Guava To determine if this is a Windows vs. other configuration, can you just try to call the Spark-class.cmd SparkSubmit without actually referencing the Hadoop or Thrift server classes? On Tue Nov 25 2014 at 5:42:09 PM Judy Nash judyn...@exchange.microsoft.com wrote: I traced the code and used the following to call: Spark-class.cmd org.apache.spark.deploy.SparkSubmit --class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 spark-internal --hiveconf hive.server2.thrift.port=1 The issue ended up to be much more fundamental however. Spark doesn't work at all in configuration below. When open spark-shell, it fails with the same ClassNotFound error. Now I wonder if this is a windows-only issue or the hive/Hadoop configuration that is having this problem. From: Cheng Lian [mailto:lian.cs@gmail.com] Sent: Tuesday, November 25, 2014 1:50 AM To: Judy Nash; u...@spark.incubator.apache.org Subject: Re: latest Spark 1.2 thrift server fail with NoClassDefFoundError on Guava Oh so you're using Windows. What command are you using to start the Thrift server then? On 11/25/14 4:25 PM, Judy Nash wrote: Made progress but still blocked. After recompiling the code on cmd instead of PowerShell, now I can see all 5 classes as you mentioned. However I am still seeing the same error as before. Anything else I can check for? From: Judy Nash [mailto:judyn...@exchange.microsoft.com] Sent: Monday, November 24, 2014 11:50 PM To: Cheng Lian; u...@spark.incubator.apache.org Subject: RE: latest Spark 1.2 thrift server fail with NoClassDefFoundError on Guava This is what I got from jar tf: org/spark-project/guava/common/base/Preconditions.class org/spark-project/guava/common/math/MathPreconditions.class