Re: StreamingContext.textFileStream issue
Try putting files with different file name and see if the stream is able to detect them. On 25-Apr-2015 3:02 am, Yang Lei [via Apache Spark User List] ml-node+s1001560n22650...@n3.nabble.com wrote: I hit the same issue as if the directory has no files at all when running the sample examples/src/main/python/streaming/hdfs_wordcount.py with a local directory, and adding file into that directory . Appreciate comments on how to resolve this. -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/StreamingContext-textFileStream-issue-tp22501p22650.html To start a new topic under Apache Spark User List, email ml-node+s1001560n1...@n3.nabble.com To unsubscribe from Apache Spark User List, click here http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=cHJhbm5veUBzaWdtb2lkYW5hbHl0aWNzLmNvbXwxfC0xNTI2NTg4NjQ2 . NAML http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Re-StreamingContext-textFileStream-issue-tp22652.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Spark RDD Lifecycle: whether RDD will be reclaimed out of scope
Hi, Yes, Spark automatically removes old RDDs from the cache when you make new ones. Unpersist forces it to remove them right away. On Thu, Apr 23, 2015 at 9:28 AM, Jeffery [via Apache Spark User List] ml-node+s1001560n22618...@n3.nabble.com wrote: Hi, Dear Spark Users/Devs: In a method, I create a new RDD, and cache it, whether Spark will unpersit the RDD automatically after the rdd is out of scope? I am thinking so, but want to make sure with you, the experts :) Thanks, Jeffery Yuan -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-RDD-Lifecycle-whether-RDD-will-be-reclaimed-out-of-scope-tp22618.html To start a new topic under Apache Spark User List, email ml-node+s1001560n1...@n3.nabble.com To unsubscribe from Apache Spark User List, click here http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=cHJhbm5veUBzaWdtb2lkYW5hbHl0aWNzLmNvbXwxfC0xNTI2NTg4NjQ2 . NAML http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-RDD-Lifecycle-whether-RDD-will-be-reclaimed-out-of-scope-tp22618p22625.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: MEMORY_ONLY vs MEMORY_AND_DISK
It depends. If the data size on which the calculation is to be done is very large than caching it with MEMORY_AND_DISK is useful. Even in this case MEMORY_AND_DISK is useful if the computation on the RDD is expensive. If the compution is very small than even for large data sets MEMORY_ONLY can be used. But if data size is small, than using MEMORY_ONLY is a obviously the best option. On Thu, Mar 19, 2015 at 2:35 AM, sergunok [via Apache Spark User List] ml-node+s1001560n22130...@n3.nabble.com wrote: What persistance level is better if RDD to be cached is heavily to be recalculated? Am I right it is MEMORY_AND_DISK? -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/MEMORY-ONLY-vs-MEMORY-AND-DISK-tp22130.html To start a new topic under Apache Spark User List, email ml-node+s1001560n1...@n3.nabble.com To unsubscribe from Apache Spark User List, click here http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=cHJhbm5veUBzaWdtb2lkYW5hbHl0aWNzLmNvbXwxfC0xNTI2NTg4NjQ2 . NAML http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/MEMORY-ONLY-vs-MEMORY-AND-DISK-tp22130p22140.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Unable to read files In Yarn Mode of Spark Streaming ?
You can put the files from any where it just the streaming application only picks the file having a timestamp greater than the batch was started. No you dont need to set any properties for this to work. This is the default behaviour of the spark streaming. On Fri, Mar 13, 2015 at 9:38 AM, CH.KMVPRASAD [via Apache Spark User List] ml-node+s1001560n22025...@n3.nabble.com wrote: while running the application we need to put files into directory ,correct then i can put directly into directory or i need to move from some directory to required directory .. from spark streaming application point of view we need to set any properties ,please help me Thanks Prannoy.. -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-read-files-In-Yarn-Mode-of-Spark-Streaming-tp22008p22025.html To start a new topic under Apache Spark User List, email ml-node+s1001560n1...@n3.nabble.com To unsubscribe from Apache Spark User List, click here http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=cHJhbm5veUBzaWdtb2lkYW5hbHl0aWNzLmNvbXwxfC0xNTI2NTg4NjQ2 . NAML http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-read-files-In-Yarn-Mode-of-Spark-Streaming-tp22008p22026.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Unable to read files In Yarn Mode of Spark Streaming ?
Streaming takes only new files into consideration. Add the file after starting the job. On Thu, Mar 12, 2015 at 2:26 PM, CH.KMVPRASAD [via Apache Spark User List] ml-node+s1001560n2201...@n3.nabble.com wrote: yes ! for testing purpose i defined single file in the specified directory .. -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-read-files-In-Yarn-Mode-of-Spark-Streaming-tp22008p22013.html To start a new topic under Apache Spark User List, email ml-node+s1001560n1...@n3.nabble.com To unsubscribe from Apache Spark User List, click here http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=cHJhbm5veUBzaWdtb2lkYW5hbHl0aWNzLmNvbXwxfC0xNTI2NTg4NjQ2 . NAML http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-read-files-In-Yarn-Mode-of-Spark-Streaming-tp22008p22015.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Unable to read files In Yarn Mode of Spark Streaming ?
Are the files already present in HDFS before you are starting your application ? On Thu, Mar 12, 2015 at 11:11 AM, CH.KMVPRASAD [via Apache Spark User List] ml-node+s1001560n22008...@n3.nabble.com wrote: Hi am successfully executed sparkPi example on yarn mode but i cant able to read files from hdfs in my streaming application using java I tried 'textFileStream' and fileStream methods .. please help me ... for both methods am getting null ... please help me .. thanks for your help.. -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-read-files-In-Yarn-Mode-of-Spark-Streaming-tp22008.html To start a new topic under Apache Spark User List, email ml-node+s1001560n1...@n3.nabble.com To unsubscribe from Apache Spark User List, click here http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=cHJhbm5veUBzaWdtb2lkYW5hbHl0aWNzLmNvbXwxfC0xNTI2NTg4NjQ2 . NAML http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-read-files-In-Yarn-Mode-of-Spark-Streaming-tp22008p22010.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Spark streaming - tracking/deleting processed files
Hi, To keep processing the older file also you can use fileStream instead of textFileStream. It has a parameter to specify to look for already present files. For deleting the processed files one way is to get the list of all files in the dStream. This can be done by using the foreachRDD api of the dStream received from the fileStream(or textFileStream). Suppose the dStream is JavaDStreamString jpDstream = ssc.textFileStream(path/to/your/folder/); jpDstream.print(); jpDstream.foreachRDD( new FunctionJavaRDDString, Void(){ @Override public Void call(JavaRDDString arg0) throws Exception { getContentHigh(arg0,ssc); return null; } } ); public static U void getContentHigh(JavaRDDString ds, JavaStreamingContext ssc){ int lenPartition = ds.rdd().partitions().length; // this gives the number of files the stream picked for(int i=0;ilenPartition;i++) { UnionPartition upp = (UnionPartition) listPartitions[i]; NewHadoopPartition npp = (NewHadoopPartition) upp.parentPartition(); String fPath = npp.serializableHadoopSplit().value().toString(); String[] nT = tmpName.split(:); String name = nT[0]; // name is the path of the file picked for processing. the processing logic can be inside this loop. once //done you can delete the file using the path in the variable name } } Thanks. On Fri, Jan 30, 2015 at 11:37 PM, ganterm [via Apache Spark User List] ml-node+s1001560n21444...@n3.nabble.com wrote: We are running a Spark streaming job that retrieves files from a directory (using textFileStream). One concern we are having is the case where the job is down but files are still being added to the directory. Once the job starts up again, those files are not being picked up (since they are not new or changed while the job is running) but we would like them to be processed. Is there a solution for that? Is there a way to keep track what files have been processed and can we force older files to be picked up? Is there a way to delete the processed files? Thanks! Markus -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-tracking-deleting-processed-files-tp21444.html To start a new topic under Apache Spark User List, email ml-node+s1001560n1...@n3.nabble.com To unsubscribe from Apache Spark User List, click here http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=cHJhbm5veUBzaWdtb2lkYW5hbHl0aWNzLmNvbXwxfC0xNTI2NTg4NjQ2 . NAML http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-tracking-deleting-processed-files-tp21444p21478.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: save spark streaming output to single file on hdfs
Hi, You can use FileUtil.copyMerge API and specify the path to the folder where saveAsTextFile is save the part text file. Suppose your directory is /a/b/c/ use FileUtil.copyMerge(FileSystem of source, a/b/c, FileSystem of destination, Path to the merged file say (a/b/c.txt), true(to delete the original dir,null)) Thanks. On Tue, Jan 13, 2015 at 11:34 PM, jamborta [via Apache Spark User List] ml-node+s1001560n21124...@n3.nabble.com wrote: Hi all, Is there a way to save dstream RDDs to a single file so that another process can pick it up as a single RDD? It seems that each slice is saved to a separate folder, using saveAsTextFiles method. I'm using spark 1.2 with pyspark thanks, -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/save-spark-streaming-output-to-single-file-on-hdfs-tp21124.html To start a new topic under Apache Spark User List, email ml-node+s1001560n1...@n3.nabble.com To unsubscribe from Apache Spark User List, click here http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=cHJhbm5veUBzaWdtb2lkYW5hbHl0aWNzLmNvbXwxfC0xNTI2NTg4NjQ2 . NAML http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/save-spark-streaming-output-to-single-file-on-hdfs-tp21124p21167.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: saveAsTextFile
Hi, Before saving the rdd do a collect to the rdd and print the content of the rdd. Probably its a null value. Thanks. On Sat, Jan 3, 2015 at 5:37 PM, Pankaj Narang [via Apache Spark User List] ml-node+s1001560n20953...@n3.nabble.com wrote: If you can paste the code here I can certainly help. Also confirm the version of spark you are using Regards Pankaj Infoshore Software India -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-tp20951p20953.html To start a new topic under Apache Spark User List, email ml-node+s1001560n1...@n3.nabble.com To unsubscribe from Apache Spark User List, click here http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=cHJhbm5veUBzaWdtb2lkYW5hbHl0aWNzLmNvbXwxfC0xNTI2NTg4NjQ2 . NAML http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-tp20951p21160.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Inserting an element in RDD[String]
Hi, You can take the schema line in another rdd and than do a union of the two rdd . ListString schemaList = new ArrayListString; schemaList.add(xyz); // where xyz is your schema line JavaRDD schemaRDDString = sc.parallize(schemaList) ; //where sc is your sparkcontext JavaRDD newRDDString = schemaRDD.union(yourRDD); // where yourRDD is your another rdd starting of which you want to add the schema line. The code is in java, you can change it to scala Thanks. On Thu, Jan 15, 2015 at 7:46 PM, Hafiz Mujadid [via Apache Spark User List] ml-node+s1001560n21161...@n3.nabble.com wrote: hi experts! I hav an RDD[String] and i want to add schema line at beginning in this rdd. I know RDD is immutable. So is there anyway to have a new rdd with one schema line and contents of previous rdd? Thanks -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/Inserting-an-element-in-RDD-String-tp21161.html To start a new topic under Apache Spark User List, email ml-node+s1001560n1...@n3.nabble.com To unsubscribe from Apache Spark User List, click here http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=cHJhbm5veUBzaWdtb2lkYW5hbHl0aWNzLmNvbXwxfC0xNTI2NTg4NjQ2 . NAML http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Inserting-an-element-in-RDD-String-tp21161p21163.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: java.io.IOException: Mkdirs failed to create file:/some/path/myapp.csv while using rdd.saveAsTextFile(fileAddress) Spark
What path you are giving in the saveAsTextFile ?? Can you show the whole line . On Tue, Jan 13, 2015 at 11:42 AM, shekhar [via Apache Spark User List] ml-node+s1001560n21112...@n3.nabble.com wrote: I still i having this issue with rdd.saveAsTextFile() method. thanks, Shekhar reddy -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/java-io-IOException-Mkdirs-failed-to-create-file-some-path-myapp-csv-while-using-rdd-saveAsTextFile-k-tp20994p21112.html To start a new topic under Apache Spark User List, email ml-node+s1001560n1...@n3.nabble.com To unsubscribe from Apache Spark User List, click here http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=cHJhbm5veUBzaWdtb2lkYW5hbHl0aWNzLmNvbXwxfC0xNTI2NTg4NjQ2 . NAML http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/java-io-IOException-Mkdirs-failed-to-create-file-some-path-myapp-csv-while-using-rdd-saveAsTextFile-k-tp20994p21126.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Failed to save RDD as text file to local file system
Hi, Could you just trying one thing. Make a directory any where out side cloudera and than try the same write. Suppose the directory made is testWrite. do r.saveAsTextFile(/home/testWrite/) I think cloudera/tmp folder do not have a write permission for users hosted other than the cloudera manager itself. Thanks. On Mon, Jan 12, 2015 at 9:51 PM, NingjunWang [via Apache Spark User List] ml-node+s1001560n21105...@n3.nabble.com wrote: Prannoy I tried this r.saveAsTextFile(home/cloudera/tmp/out1), it return without error. But where does it saved to? The folder “/home/cloudera/tmp/out1” is not cretaed. I also tried the following cd /home/cloudera/tmp/ spark-shell scala val r = sc.parallelize(Array(a, b, c)) scala r.saveAsTextFile(out1) It does not return error. But still there is no “out1” folder created under /home/cloudera/tmp/ I tried to give absolute path but then get an error scala r.saveAsTextFile(/home/cloudera/tmp/out1) org.apache.hadoop.security.AccessControlException: Permission denied: user=cloudera, access=WRITE, inode=/:hdfs:supergroup:drwxr-xr-x at org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.checkFsPermission(DefaultAuthorizationProvider.java:257) at org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.check(DefaultAuthorizationProvider.java:238) at org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.check(DefaultAuthorizationProvider.java:216) at org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.checkPermission(DefaultAuthorizationProvider.java:145) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:138) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:6286) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:6268) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkAncestorAccess(FSNamesystem.java:6220) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInternal(FSNamesystem.java:4087) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInt(FSNamesystem.java:4057) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(FSNamesystem.java:4030) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:787) at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.mkdirs(AuthorizationProviderProxyClientProtocol.java:297) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:594) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) Very frustrated. Please advise. Regards, *Ningjun Wang* Consulting Software Engineer LexisNexis 121 Chanlon Road New Providence, NJ 07974-1541 *From:* Prannoy [via Apache Spark User List] [mailto:ml-node+[hidden email] http:///user/SendEmail.jtp?type=nodenode=21105i=0] *Sent:* Monday, January 12, 2015 4:18 AM *To:* Wang, Ningjun (LNG-NPV) *Subject:* Re: Failed to save RDD as text file to local file system Have you tried simple giving the path where you want to save the file ? For instance in your case just do *r.saveAsTextFile(home/cloudera/tmp/out1) * Dont use* file* This will create a folder with name out1. saveAsTextFile always write by making a directory, it does not write data into a single file. Incase you need a single file you can use copyMerge API in FileUtils. *FileUtil.copyMerge(fs, **home/cloudera/tmp/out1, fs,**home/cloudera/tmp/out2 , true, conf,null);* Now out2 will be a single file containing your data. *fs* is the configuration of you local file system. Thanks On Sat, Jan 10, 2015 at 1:36 AM, NingjunWang [via Apache Spark User List] [hidden email] http:///user/SendEmail.jtp?type=nodenode=21093i=0 wrote: No, do you have any idea? Regards, *Ningjun Wang* Consulting Software Engineer LexisNexis 121 Chanlon Road New Providence, NJ 07974-1541 *From:* firemonk9 [via Apache
Re: How to set UI port #?
Set the port using spconf.set(spark.ui.port,); where, is any port spconf is your spark configuration object. On Sun, Jan 11, 2015 at 2:08 PM, YaoPau [via Apache Spark User List] ml-node+s1001560n21083...@n3.nabble.com wrote: I have multiple Spark Streaming jobs running all day, and so when I run my hourly batch job, I always get a java.net.BindException: Address already in use which starts at 4040 then goes to 4041, 4042, 4043 before settling at 4044. That slows down my hourly job, and time is critical. Is there a way I can set it to 4044 by default, or prevent the UI from launching altogether? Jon -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-set-UI-port-tp21083.html To start a new topic under Apache Spark User List, email ml-node+s1001560n1...@n3.nabble.com To unsubscribe from Apache Spark User List, click here http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=cHJhbm5veUBzaWdtb2lkYW5hbHl0aWNzLmNvbXwxfC0xNTI2NTg4NjQ2 . NAML http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-set-UI-port-tp21083p21090.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Failed to save RDD as text file to local file system
Have you tried simple giving the path where you want to save the file ? For instance in your case just do *r.saveAsTextFile(home/cloudera/tmp/out1) * Dont use* file* This will create a folder with name out1. saveAsTextFile always write by making a directory, it does not write data into a single file. Incase you need a single file you can use copyMerge API in FileUtils. *FileUtil.copyMerge(fs, home/cloudera/tmp/out1, fs,home/cloudera/tmp/out2 , true, conf,null);* Now out2 will be a single file containing your data. *fs* is the configuration of you local file system. Thanks On Sat, Jan 10, 2015 at 1:36 AM, NingjunWang [via Apache Spark User List] ml-node+s1001560n21068...@n3.nabble.com wrote: No, do you have any idea? Regards, *Ningjun Wang* Consulting Software Engineer LexisNexis 121 Chanlon Road New Providence, NJ 07974-1541 *From:* firemonk9 [via Apache Spark User List] [mailto:ml-node+[hidden email] http:///user/SendEmail.jtp?type=nodenode=21068i=0] *Sent:* Friday, January 09, 2015 2:56 PM *To:* Wang, Ningjun (LNG-NPV) *Subject:* Re: Failed to save RDD as text file to local file system Have you found any resolution for this issue ? -- *If you reply to this email, your message will be added to the discussion below:* http://apache-spark-user-list.1001560.n3.nabble.com/Failed-to-save-RDD-as-text-file-to-local-file-system-tp21050p21067.html To unsubscribe from Failed to save RDD as text file to local file system, click here. NAML http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/Failed-to-save-RDD-as-text-file-to-local-file-system-tp21050p21068.html To start a new topic under Apache Spark User List, email ml-node+s1001560n1...@n3.nabble.com To unsubscribe from Apache Spark User List, click here http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=cHJhbm5veUBzaWdtb2lkYW5hbHl0aWNzLmNvbXwxfC0xNTI2NTg4NjQ2 . NAML http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Failed-to-save-RDD-as-text-file-to-local-file-system-tp21050p21093.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: [question]Where can I get the log file
Hi, You can access your logs in your /spark_home_directory/logs/ directory . cat the file names and you will get the logs. Thanks. On Thu, Dec 4, 2014 at 2:27 PM, FFeng [via Apache Spark User List] ml-node+s1001560n20344...@n3.nabble.com wrote: I have wrote data to spark log. I get it through the web interface, but I really want to know if I can get these log file on my node. Where are they? Thx. -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/question-Where-can-I-get-the-log-file-tp20344.html To start a new topic under Apache Spark User List, email ml-node+s1001560n1...@n3.nabble.com To unsubscribe from Apache Spark User List, click here http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=cHJhbm5veUBzaWdtb2lkYW5hbHl0aWNzLmNvbXwxfC0xNTI2NTg4NjQ2 . NAML http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/question-Where-can-I-get-the-log-file-tp20344p20347.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: How can I read an avro file in HDFS in Java?
Hi, Try using sc.newAPIHadoopFile(hdfs path to your file, AvroSequenceFileInputFormat.class, AvroKey.class, AvroValue.class, your Configuration) You will get the Avro related classes by importing org.apache.avro.* Thanks. On Tue, Dec 2, 2014 at 9:23 PM, leaviva [via Apache Spark User List] ml-node+s1001560n20173...@n3.nabble.com wrote: How can I read an avro file in HDFS ? I try use newAPIHadoopFile but I don't know how can i use it -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-read-an-avro-file-in-HDFS-in-Java-tp20173.html To start a new topic under Apache Spark User List, email ml-node+s1001560n1...@n3.nabble.com To unsubscribe from Apache Spark User List, click here http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=cHJhbm5veUBzaWdtb2lkYW5hbHl0aWNzLmNvbXwxfC0xNTI2NTg4NjQ2 . NAML http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-read-an-avro-file-in-HDFS-in-Java-tp20173p20285.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: object xxx is not a member of package com
Hi, Add the jars in the external library of you related project. Right click on package or class - Build Path - Configure Build Path - Java Build Path - Select the Libraries tab - Add external library - Browse to com.xxx.yyy.zzz._ - ok Clean and build your project, most probably you will be able to pull the classes from com.xxx.yyy.zzz._ package. Thanks. On Wed, Dec 3, 2014 at 4:29 AM, flyson [via Apache Spark User List] ml-node+s1001560n20205...@n3.nabble.com wrote: Hello everyone, Could anybody tell me how to import and call the 3rd party java classes from inside spark? Here's my case: I have a jar file (the directory layout is com.xxx.yyy.zzz) which contains some java classes, and I need to call some of them in spark code. I used the statement import com.xxx.yyy.zzz._ on top of the impacted spark file and set the location of the jar file in the CLASSPATH environment, and use .sbt/sbt assembly to build the project. As a result, I got an error saying object xxx is not a member of package com. I thought that this could be related to the dependencies, but couldn't figure it out. Any suggestion/solution from you would be appreciated! Thanks! -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/object-xxx-is-not-a-member-of-package-com-tp20205.html To start a new topic under Apache Spark User List, email ml-node+s1001560n1...@n3.nabble.com To unsubscribe from Apache Spark User List, click here http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=cHJhbm5veUBzaWdtb2lkYW5hbHl0aWNzLmNvbXwxfC0xNTI2NTg4NjQ2 . NAML http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/object-xxx-is-not-a-member-of-package-com-tp20205p20288.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: How to use FlumeInputDStream in spark cluster?
Hi, BindException comes when two processes are using the same port. In your spark configuration just set (spark.ui.port,x), to some other port. x can be any number say 12345. BindException will not break your job in either case. Just to fix it change the port number. Thanks. On Fri, Nov 28, 2014 at 1:30 PM, pamtang [via Apache Spark User List] ml-node+s1001560n1999...@n3.nabble.com wrote: I'm seeing the same issue on CDH 5.2 with Spark 1.1. FlumeEventCount works fine on a Standalone cluster but throw BindException on YARN mode. Is there a solution to this problem or FlumeInputDStream will not be working in a cluster environment? -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-FlumeInputDStream-in-spark-cluster-tp1604p19997.html To start a new topic under Apache Spark User List, email ml-node+s1001560n1...@n3.nabble.com To unsubscribe from Apache Spark User List, click here http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=cHJhbm5veUBzaWdtb2lkYW5hbHl0aWNzLmNvbXwxfC0xNTI2NTg4NjQ2 . NAML http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-FlumeInputDStream-in-spark-cluster-tp1604p1.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: read both local path and HDFS path
Hi, The configuration you provide is just to access the HDFS when you give an HDFS path. When you provide a HDFS path with the HDFS nameservice, like in your case hmaster155:9000 it goes inside the HDFS to look for the file. For accessing local file just give the local path of the file. Go to the file in the local and do a pwd. This will give you the full path of the file. Just give that path as your local path for the file and you will do good. Thanks. On Fri, Nov 28, 2014 at 8:57 AM, tuyuri [via Apache Spark User List] ml-node+s1001560n19990...@n3.nabble.com wrote: I have setup a Spark cluster config with HDFS and I know that default file path will be read by Spark all in HDFS example : /ad-cpc/2014-11-28/ Spark will read in : hdfs://hmaster155:9000/ad-cpc/2014-11-28/ sometimes I wonder how can i force Spark read a file in local without reConfig my cluster ( to not use hdfs). please help me !!! -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/read-both-local-path-and-HDFS-path-tp19990.html To start a new topic under Apache Spark User List, email ml-node+s1001560n1...@n3.nabble.com To unsubscribe from Apache Spark User List, click here http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=cHJhbm5veUBzaWdtb2lkYW5hbHl0aWNzLmNvbXwxfC0xNTI2NTg4NjQ2 . NAML http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/read-both-local-path-and-HDFS-path-tp19990p19995.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Persist streams to text files
Hi , You can use FileUtil.copemerge API and specify the path to the folder where saveAsTextFile is save the part text file. Suppose your directory is /a/b/c/ use FileUtil.copeMerge(FileSystem of source, a/b/c, FileSystem of destination, Path to the merged file say (a/b/c.txt), true(to delete the original dir,null)) Thanks. On Fri, Nov 21, 2014 at 11:31 AM, Jishnu Prathap [via Apache Spark User List] ml-node+s1001560n19449...@n3.nabble.com wrote: Hi I am also having similar problem.. any fix suggested.. *Originally Posted by GaganBM* Hi, I am trying to persist the DStreams to text files. When I use the inbuilt API 'saveAsTextFiles' as : stream.saveAsTextFiles(resultDirectory) this creates a number of subdirectories, for each batch, and within each sub directory, it creates bunch of text files for each RDD (I assume). I am wondering if I can have single text files for each batch. Is there any API for that ? Or else, a single output file for the entire stream ? I tried to manually write from each RDD stream to a text file as : stream.foreachRDD(rdd ={ rdd.foreach(element = { fileWriter.write(element) }) }) where 'fileWriter' simply makes use of a Java BufferedWriter to write strings to a file. However, this fails with exception : DStreamCheckpointData.writeObject used java.io.BufferedWriter java.io.NotSerializableException: java.io.BufferedWriter at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) . Any help on how to proceed with this ? The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email. www.wipro.com -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/Re-Persist-streams-to-text-files-tp19449.html To start a new topic under Apache Spark User List, email ml-node+s1001560n1...@n3.nabble.com To unsubscribe from Apache Spark User List, click here http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=cHJhbm5veUBzaWdtb2lkYW5hbHl0aWNzLmNvbXwxfC0xNTI2NTg4NjQ2 . NAML http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Re-Persist-streams-to-text-files-tp19449p19457.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Cores on Master
Hi, You can also set the cores in the spark application itself . http://spark.apache.org/docs/1.0.1/spark-standalone.html On Wed, Nov 19, 2014 at 6:11 AM, Pat Ferrel-2 [via Apache Spark User List] ml-node+s1001560n19238...@n3.nabble.com wrote: OK hacking the start-slave.sh did it On Nov 18, 2014, at 4:12 PM, Pat Ferrel [hidden email] http://user/SendEmail.jtp?type=nodenode=19238i=0 wrote: This seems to work only on a ‘worker’ not the master? So I’m back to having no way to control cores on the master? On Nov 18, 2014, at 3:24 PM, Pat Ferrel [hidden email] http://user/SendEmail.jtp?type=nodenode=19238i=1 wrote: Looks like I can do this by not using start-all.sh but starting each worker separately passing in a '--cores n' to the master? No config/env way? On Nov 18, 2014, at 3:14 PM, Pat Ferrel [hidden email] http://user/SendEmail.jtp?type=nodenode=19238i=2 wrote: I see the default and max cores settings but these seem to control total cores per cluster. My cobbled together home cluster needs the Master to not use all its cores or it may lock up (it does other things). Is there a way to control max cores used for a particular cluster machine in standalone mode? - To unsubscribe, e-mail: [hidden email] http://user/SendEmail.jtp?type=nodenode=19238i=3 For additional commands, e-mail: [hidden email] http://user/SendEmail.jtp?type=nodenode=19238i=4 - To unsubscribe, e-mail: [hidden email] http://user/SendEmail.jtp?type=nodenode=19238i=5 For additional commands, e-mail: [hidden email] http://user/SendEmail.jtp?type=nodenode=19238i=6 - To unsubscribe, e-mail: [hidden email] http://user/SendEmail.jtp?type=nodenode=19238i=7 For additional commands, e-mail: [hidden email] http://user/SendEmail.jtp?type=nodenode=19238i=8 - To unsubscribe, e-mail: [hidden email] http://user/SendEmail.jtp?type=nodenode=19238i=9 For additional commands, e-mail: [hidden email] http://user/SendEmail.jtp?type=nodenode=19238i=10 -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/Cores-on-Master-tp19230p19238.html To start a new topic under Apache Spark User List, email ml-node+s1001560n1...@n3.nabble.com To unsubscribe from Apache Spark User List, click here http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=cHJhbm5veUBzaWdtb2lkYW5hbHl0aWNzLmNvbXwxfC0xNTI2NTg4NjQ2 . NAML http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Cores-on-Master-tp19230p19475.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Slow performance in spark streaming
Hi, Spark runs in local with a speed less than in cluster. Cluster machines usually have a high configuration and also the tasks are distrubuted in workers in order to get a faster result. So you will always find a difference in speed when running in local and when running in cluster. Try running the same in a cluster and evaluate the speed there. Thanks On Thu, Nov 20, 2014 at 6:52 PM, Blackeye [via Apache Spark User List] ml-node+s1001560n1937...@n3.nabble.com wrote: I am using spark streaming 1.1.0 locally (not in a cluster). I created a simple app that parses the data (about 10.000 entries), stores it in a stream and then makes some transformations on it. Here is the code: *def main(args : Array[String]){ val master = local[8] val conf = new SparkConf().setAppName(Tester).setMaster(master) val sc = new StreamingContext(conf, Milliseconds(11)) val stream = sc.receiverStream(new MyReceiver(localhost, )) val parsedStream = parse(stream) parsedStream.foreachRDD(rdd = println(rdd.first()+\nRULE STARTS +System.currentTimeMillis())) val result1 = parsedStream.filter(entry = entry.symbol.contains(walking) entry.symbol.contains(true) entry.symbol.contains(id0)).map(_.time) val result2 = parsedStream.filter(entry = entry.symbol == disappear entry.symbol.contains(id0)).map(_.time) val result3 = result1 .transformWith(result2, (rdd1, rdd2: RDD[Int]) = rdd1.subtract(rdd2)) result3.foreachRDD(rdd = println(rdd.first()+\nRULE ENDS +System.currentTimeMillis())) sc.start()sc.awaitTermination() } def parse(stream: DStream[String]) = { stream.flatMap { line = val entries = line.split(assert).filter(entry = !entry.isEmpty) entries.map { tuple = val pattern = \s*[(](.+)[,]\s*([0-9]+)+\s*[)]\s*[)]\s*[,|\.]\s*.r tuple match { case pattern(symbol, time) = new Data(symbol, time.toInt) } } } } case class Data (symbol: String, time: Int)* I have a batch duration of 110.000 milliseconds in order to receive all the data in one batch. I believed that, even locally, the spark is very fast. In this case, it takes about 3.5sec to execute the rule (between RULE STARTS and RULE ENDS). Am I doing something wrong or this is the expected time? Any advise -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/Slow-performance-in-spark-streaming-tp19371.html To start a new topic under Apache Spark User List, email ml-node+s1001560n1...@n3.nabble.com To unsubscribe from Apache Spark User List, click here http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=cHJhbm5veUBzaWdtb2lkYW5hbHl0aWNzLmNvbXwxfC0xNTI2NTg4NjQ2 . NAML http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Slow-performance-in-spark-streaming-tp19371p19476.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Parsing a large XML file using Spark
Hi, Parallel processing of xml files may be an issue due to the tags in the xml file. The xml file has to be intact as while parsing it matches the start and end entity and if its distributed in parts to workers possibly it may or may not find start and end tags within the same worker which will give an exception. Thanks. On Wed, Nov 19, 2014 at 6:26 AM, ssimanta [via Apache Spark User List] ml-node+s1001560n19239...@n3.nabble.com wrote: If there a one big XML file (e.g., Wikipedia dump 44GB or the larger dump that all revision information also) that is stored in HDFS, is it possible to parse it in parallel/faster using Spark? Or do we have to use something like a PullParser or Iteratee? My current solution is to read the single XML file in the first pass - write it to HDFS and then read the small files in parallel on the Spark workers. Thanks -Soumya -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/Parsing-a-large-XML-file-using-Spark-tp19239.html To start a new topic under Apache Spark User List, email ml-node+s1001560n1...@n3.nabble.com To unsubscribe from Apache Spark User List, click here http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=cHJhbm5veUBzaWdtb2lkYW5hbHl0aWNzLmNvbXwxfC0xNTI2NTg4NjQ2 . NAML http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Parsing-a-large-XML-file-using-Spark-tp19239p19477.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Execute Spark programs from local machine on Yarn-hadoop cluster
Hi naveen, I dont think this is possible. If you are setting the master with your cluster details you cannot execute any job from your local machine. You have to execute the jobs inside your yarn machine so that sparkconf is able to connect with all the provided details. If this is not the case such give a detail explaintation of what exactly you are trying to do :) Thanks. On Fri, Nov 21, 2014 at 8:11 PM, Naveen Kumar Pokala [via Apache Spark User List] ml-node+s1001560n19482...@n3.nabble.com wrote: Hi, I am executing my spark jobs on yarn cluster by forming conf object in the following way. SparkConf conf = *new* SparkConf().setAppName(NewJob).setMaster( yarn-cluster); Now I want to execute spark jobs from my local machine how to do that. What I mean is there a way to give IP address, port all the details to connect a master(YARN) on some other network from my local spark Program. -Naveen -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/Execute-Spark-programs-from-local-machine-on-Yarn-hadoop-cluster-tp19482.html To start a new topic under Apache Spark User List, email ml-node+s1001560n1...@n3.nabble.com To unsubscribe from Apache Spark User List, click here http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=cHJhbm5veUBzaWdtb2lkYW5hbHl0aWNzLmNvbXwxfC0xNTI2NTg4NjQ2 . NAML http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Execute-Spark-programs-from-local-machine-on-Yarn-hadoop-cluster-tp19482p19484.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Spark Streaming Application Got killed after 2 hours
Hi Saj, What is the size of the input data that you are putting on the stream ? Have you tried running the same application with different set of data ? Its weird that exactly after 2 hours the streaming stops. Try running the same application with different data of different size to look if it has something to do with memory issue. Can you also provide a detailed error log . Thanks. On Sun, Nov 16, 2014 at 11:49 AM, SAJ [via Apache Spark User List] ml-node+s1001560n19021...@n3.nabble.com wrote: Hi All, I am trying to run spark streaming application to run for 24/7 but exactly after 2 hours it got killed.I have again tried but again it got killed in 2 hours.Following are the error log in worker 14/11/15 13:53:24 INFO network.ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(ip-x,38863) 14/11/15 13:53:24 INFO network.ConnectionManager: Removing SendingConnection to ConnectionManagerId(ip-x,38863) 14/11/15 13:53:24 ERROR network.SendingConnection: Exception while reading SendingConnection to ConnectionManagerI Does anybody faced these same issue. Thanks Regards, SAJ -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Application-Got-killed-after-2-hours-tp19021.html To start a new topic under Apache Spark User List, email ml-node+s1001560n1...@n3.nabble.com To unsubscribe from Apache Spark User List, click here http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=cHJhbm5veUBzaWdtb2lkYW5hbHl0aWNzLmNvbXwxfC0xNTI2NTg4NjQ2 . NAML http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Application-Got-killed-after-2-hours-tp19021p19023.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: saveAsTextFile error
Hi Niko, Have you tried it running keeping the wordCounts.print() ?? Possibly the import to the package *org.apache.spark.streaming._* is not there so during sbt package it is unable to locate the saveAsTextFile API. Go to https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/NetworkWordCount.scala to check if all the needed packages are there. Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-error-tp18960p19006.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org