Re: StreamingContext.textFileStream issue

2015-04-24 Thread Prannoy
Try putting files with different file name and see if the stream is able to
detect them.
On 25-Apr-2015 3:02 am, Yang Lei [via Apache Spark User List] 
ml-node+s1001560n22650...@n3.nabble.com wrote:

 I hit the same issue as if the directory has no files at all when
 running the sample examples/src/main/python/streaming/hdfs_wordcount.py
 with a local directory, and adding file into that directory . Appreciate
 comments on how to resolve this.

 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://apache-spark-user-list.1001560.n3.nabble.com/StreamingContext-textFileStream-issue-tp22501p22650.html
  To start a new topic under Apache Spark User List, email
 ml-node+s1001560n1...@n3.nabble.com
 To unsubscribe from Apache Spark User List, click here
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=cHJhbm5veUBzaWdtb2lkYW5hbHl0aWNzLmNvbXwxfC0xNTI2NTg4NjQ2
 .
 NAML
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Re-StreamingContext-textFileStream-issue-tp22652.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark RDD Lifecycle: whether RDD will be reclaimed out of scope

2015-04-23 Thread Prannoy
Hi,

Yes, Spark automatically removes old RDDs from the cache when you make new
ones. Unpersist forces it to remove them right away.

On Thu, Apr 23, 2015 at 9:28 AM, Jeffery [via Apache Spark User List] 
ml-node+s1001560n22618...@n3.nabble.com wrote:

 Hi, Dear Spark Users/Devs:

 In a method, I create a new RDD, and cache it, whether Spark will unpersit
 the RDD automatically after the rdd is out of scope?

 I am thinking so, but want to make sure with you, the experts :)

 Thanks,
 Jeffery Yuan

 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-RDD-Lifecycle-whether-RDD-will-be-reclaimed-out-of-scope-tp22618.html
  To start a new topic under Apache Spark User List, email
 ml-node+s1001560n1...@n3.nabble.com
 To unsubscribe from Apache Spark User List, click here
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=cHJhbm5veUBzaWdtb2lkYW5hbHl0aWNzLmNvbXwxfC0xNTI2NTg4NjQ2
 .
 NAML
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-RDD-Lifecycle-whether-RDD-will-be-reclaimed-out-of-scope-tp22618p22625.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: MEMORY_ONLY vs MEMORY_AND_DISK

2015-03-18 Thread Prannoy
It depends. If the data size on which the calculation is to be done is very
large than caching it with MEMORY_AND_DISK is useful. Even in this
case MEMORY_AND_DISK
is useful if the computation on the RDD is expensive. If the compution is
very small than even for large data sets MEMORY_ONLY can be used.  But if
data size is small, than using MEMORY_ONLY is a obviously the best option.

On Thu, Mar 19, 2015 at 2:35 AM, sergunok [via Apache Spark User List] 
ml-node+s1001560n22130...@n3.nabble.com wrote:

 What persistance level is better if RDD to be cached is heavily to be
 recalculated?
 Am I right it is MEMORY_AND_DISK?

 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://apache-spark-user-list.1001560.n3.nabble.com/MEMORY-ONLY-vs-MEMORY-AND-DISK-tp22130.html
  To start a new topic under Apache Spark User List, email
 ml-node+s1001560n1...@n3.nabble.com
 To unsubscribe from Apache Spark User List, click here
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=cHJhbm5veUBzaWdtb2lkYW5hbHl0aWNzLmNvbXwxfC0xNTI2NTg4NjQ2
 .
 NAML
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/MEMORY-ONLY-vs-MEMORY-AND-DISK-tp22130p22140.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Unable to read files In Yarn Mode of Spark Streaming ?

2015-03-13 Thread Prannoy
You can put the files from any where it just the streaming application only
picks the file having a timestamp greater than the batch was started.

No you dont need to set any properties for this to work. This is the
default behaviour of the spark streaming.



On Fri, Mar 13, 2015 at 9:38 AM, CH.KMVPRASAD [via Apache Spark User List] 
ml-node+s1001560n22025...@n3.nabble.com wrote:

 while running  the application we need to put files into directory
 ,correct
 then i can put directly into directory or i need to move from some
 directory to required directory ..

 from spark streaming application point of view we need to set any
 properties ,please help me


 Thanks Prannoy..


 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-read-files-In-Yarn-Mode-of-Spark-Streaming-tp22008p22025.html
  To start a new topic under Apache Spark User List, email
 ml-node+s1001560n1...@n3.nabble.com
 To unsubscribe from Apache Spark User List, click here
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=cHJhbm5veUBzaWdtb2lkYW5hbHl0aWNzLmNvbXwxfC0xNTI2NTg4NjQ2
 .
 NAML
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-read-files-In-Yarn-Mode-of-Spark-Streaming-tp22008p22026.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Unable to read files In Yarn Mode of Spark Streaming ?

2015-03-12 Thread Prannoy
Streaming takes only new files into consideration. Add the file after
starting the job.

On Thu, Mar 12, 2015 at 2:26 PM, CH.KMVPRASAD [via Apache Spark User List] 
ml-node+s1001560n2201...@n3.nabble.com wrote:

 yes !
 for testing purpose i defined single file in the specified directory
 ..



 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-read-files-In-Yarn-Mode-of-Spark-Streaming-tp22008p22013.html
  To start a new topic under Apache Spark User List, email
 ml-node+s1001560n1...@n3.nabble.com
 To unsubscribe from Apache Spark User List, click here
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=cHJhbm5veUBzaWdtb2lkYW5hbHl0aWNzLmNvbXwxfC0xNTI2NTg4NjQ2
 .
 NAML
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-read-files-In-Yarn-Mode-of-Spark-Streaming-tp22008p22015.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Unable to read files In Yarn Mode of Spark Streaming ?

2015-03-12 Thread Prannoy
Are the files already present in HDFS before you are starting your
application ?

On Thu, Mar 12, 2015 at 11:11 AM, CH.KMVPRASAD [via Apache Spark User List]
ml-node+s1001560n22008...@n3.nabble.com wrote:

 Hi am successfully executed sparkPi example on yarn mode but i cant able
 to read files from hdfs in my streaming application using java
 I tried 'textFileStream' and fileStream methods ..

 please help me ...
 for both methods am  getting null ...

 please help me  ..
 thanks for your help..

 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-read-files-In-Yarn-Mode-of-Spark-Streaming-tp22008.html
  To start a new topic under Apache Spark User List, email
 ml-node+s1001560n1...@n3.nabble.com
 To unsubscribe from Apache Spark User List, click here
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=cHJhbm5veUBzaWdtb2lkYW5hbHl0aWNzLmNvbXwxfC0xNTI2NTg4NjQ2
 .
 NAML
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-read-files-In-Yarn-Mode-of-Spark-Streaming-tp22008p22010.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark streaming - tracking/deleting processed files

2015-02-03 Thread Prannoy
Hi,

To keep processing the older file also you can use fileStream instead of
textFileStream. It has a parameter to specify to look for already present
files.

For deleting the processed files one way is to get the list of all files in
the dStream. This can be done by using the foreachRDD api of the dStream
received from the fileStream(or textFileStream).

Suppose the dStream is

JavaDStreamString jpDstream = ssc.textFileStream(path/to/your/folder/);

jpDstream.print();

 jpDstream.foreachRDD(

 new FunctionJavaRDDString, Void(){

  @Override

  public Void call(JavaRDDString arg0) throws Exception {

  getContentHigh(arg0,ssc);

  return null;

  }

 }

 );

 public static U void getContentHigh(JavaRDDString ds,
JavaStreamingContext ssc){

int lenPartition = ds.rdd().partitions().length; // this gives the number
of files the stream picked

for(int i=0;ilenPartition;i++) {

 UnionPartition upp = (UnionPartition) listPartitions[i];

   NewHadoopPartition npp = (NewHadoopPartition) upp.parentPartition();

String fPath = npp.serializableHadoopSplit().value().toString();

String[] nT =  tmpName.split(:);

String name = nT[0]; // name is the path of the file picked for processing.
the processing logic can be inside this loop. once //done you can delete
the file using the path in the variable name


}

}


Thanks.

On Fri, Jan 30, 2015 at 11:37 PM, ganterm [via Apache Spark User List] 
ml-node+s1001560n21444...@n3.nabble.com wrote:

 We are running a Spark streaming job that retrieves files from a directory
 (using textFileStream).
 One concern we are having is the case where the job is down but files are
 still being added to the directory.
 Once the job starts up again, those files are not being picked up (since
 they are not new or changed while the job is running) but we would like
 them to be processed.
 Is there a solution for that? Is there a way to keep track what files have
 been processed and can we force older files to be picked up? Is there a
 way to delete the processed files?

 Thanks!
 Markus

 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-tracking-deleting-processed-files-tp21444.html
  To start a new topic under Apache Spark User List, email
 ml-node+s1001560n1...@n3.nabble.com
 To unsubscribe from Apache Spark User List, click here
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=cHJhbm5veUBzaWdtb2lkYW5hbHl0aWNzLmNvbXwxfC0xNTI2NTg4NjQ2
 .
 NAML
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-tracking-deleting-processed-files-tp21444p21478.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: save spark streaming output to single file on hdfs

2015-01-15 Thread Prannoy
Hi,

You can use FileUtil.copyMerge API and specify the path to the folder where
saveAsTextFile is save the part text file.

Suppose your directory is /a/b/c/

use FileUtil.copyMerge(FileSystem of source, a/b/c, FileSystem of
destination, Path to the merged file say (a/b/c.txt), true(to delete the
original dir,null))

Thanks.

On Tue, Jan 13, 2015 at 11:34 PM, jamborta [via Apache Spark User List] 
ml-node+s1001560n21124...@n3.nabble.com wrote:

 Hi all,

 Is there a way to save dstream RDDs to a single file so that another
 process can pick it up as a single RDD?
 It seems that each slice is saved to a separate folder, using
 saveAsTextFiles method.

 I'm using spark 1.2 with pyspark

 thanks,





 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://apache-spark-user-list.1001560.n3.nabble.com/save-spark-streaming-output-to-single-file-on-hdfs-tp21124.html
  To start a new topic under Apache Spark User List, email
 ml-node+s1001560n1...@n3.nabble.com
 To unsubscribe from Apache Spark User List, click here
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=cHJhbm5veUBzaWdtb2lkYW5hbHl0aWNzLmNvbXwxfC0xNTI2NTg4NjQ2
 .
 NAML
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/save-spark-streaming-output-to-single-file-on-hdfs-tp21124p21167.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: saveAsTextFile

2015-01-15 Thread Prannoy
Hi,

Before saving the rdd do a collect to the rdd and print the content of the
rdd. Probably its a null value.

Thanks.

On Sat, Jan 3, 2015 at 5:37 PM, Pankaj Narang [via Apache Spark User List] 
ml-node+s1001560n20953...@n3.nabble.com wrote:

 If you can paste the code here I can certainly help.

 Also confirm the version of spark you are using

 Regards
 Pankaj
 Infoshore Software
 India

 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-tp20951p20953.html
  To start a new topic under Apache Spark User List, email
 ml-node+s1001560n1...@n3.nabble.com
 To unsubscribe from Apache Spark User List, click here
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=cHJhbm5veUBzaWdtb2lkYW5hbHl0aWNzLmNvbXwxfC0xNTI2NTg4NjQ2
 .
 NAML
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-tp20951p21160.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Inserting an element in RDD[String]

2015-01-15 Thread Prannoy
Hi,

You can take the schema line in another rdd and than do a union of the two
rdd .

ListString schemaList = new ArrayListString;
schemaList.add(xyz);

// where xyz is your schema line

JavaRDD schemaRDDString = sc.parallize(schemaList) ;

//where sc is your sparkcontext

 JavaRDD newRDDString = schemaRDD.union(yourRDD);

// where yourRDD is your another rdd starting of which you want to add the
schema line.

The code is in java, you can change it to scala 

Thanks.





On Thu, Jan 15, 2015 at 7:46 PM, Hafiz Mujadid [via Apache Spark User List]
ml-node+s1001560n21161...@n3.nabble.com wrote:

 hi experts!

 I hav an RDD[String] and i want to add schema line at beginning in this
 rdd. I know RDD is immutable. So is there anyway to have a new rdd with one
 schema line and contents of previous rdd?


 Thanks

 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://apache-spark-user-list.1001560.n3.nabble.com/Inserting-an-element-in-RDD-String-tp21161.html
  To start a new topic under Apache Spark User List, email
 ml-node+s1001560n1...@n3.nabble.com
 To unsubscribe from Apache Spark User List, click here
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=cHJhbm5veUBzaWdtb2lkYW5hbHl0aWNzLmNvbXwxfC0xNTI2NTg4NjQ2
 .
 NAML
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Inserting-an-element-in-RDD-String-tp21161p21163.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: java.io.IOException: Mkdirs failed to create file:/some/path/myapp.csv while using rdd.saveAsTextFile(fileAddress) Spark

2015-01-13 Thread Prannoy
What path you are giving in the saveAsTextFile ?? Can you show the whole
line .

On Tue, Jan 13, 2015 at 11:42 AM, shekhar [via Apache Spark User List] 
ml-node+s1001560n21112...@n3.nabble.com wrote:

 I still i having this issue with rdd.saveAsTextFile() method.


 thanks,
 Shekhar reddy

 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://apache-spark-user-list.1001560.n3.nabble.com/java-io-IOException-Mkdirs-failed-to-create-file-some-path-myapp-csv-while-using-rdd-saveAsTextFile-k-tp20994p21112.html
  To start a new topic under Apache Spark User List, email
 ml-node+s1001560n1...@n3.nabble.com
 To unsubscribe from Apache Spark User List, click here
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=cHJhbm5veUBzaWdtb2lkYW5hbHl0aWNzLmNvbXwxfC0xNTI2NTg4NjQ2
 .
 NAML
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/java-io-IOException-Mkdirs-failed-to-create-file-some-path-myapp-csv-while-using-rdd-saveAsTextFile-k-tp20994p21126.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Failed to save RDD as text file to local file system

2015-01-13 Thread Prannoy
Hi,

Could you just trying one thing. Make a directory any where out side
cloudera and than try the same write.

Suppose the directory made is testWrite.

do r.saveAsTextFile(/home/testWrite/)

I think cloudera/tmp folder do not have a write permission for users hosted
other than the cloudera manager itself.

Thanks.

On Mon, Jan 12, 2015 at 9:51 PM, NingjunWang [via Apache Spark User List] 
ml-node+s1001560n21105...@n3.nabble.com wrote:

  Prannoy



 I tried this r.saveAsTextFile(home/cloudera/tmp/out1), it return
 without error. But where does it saved to? The folder
 “/home/cloudera/tmp/out1” is not cretaed.



 I also tried the following

 cd /home/cloudera/tmp/

 spark-shell

 scala val r = sc.parallelize(Array(a, b, c))

 scala r.saveAsTextFile(out1)



 It does not return error. But still there is no “out1” folder created
 under /home/cloudera/tmp/



 I tried to give absolute path but then get an error



 scala r.saveAsTextFile(/home/cloudera/tmp/out1)

 org.apache.hadoop.security.AccessControlException: Permission denied:
 user=cloudera, access=WRITE, inode=/:hdfs:supergroup:drwxr-xr-x

 at
 org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.checkFsPermission(DefaultAuthorizationProvider.java:257)

 at
 org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.check(DefaultAuthorizationProvider.java:238)

 at
 org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.check(DefaultAuthorizationProvider.java:216)

 at
 org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.checkPermission(DefaultAuthorizationProvider.java:145)

 at
 org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:138)

 at
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:6286)

 at
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:6268)

 at
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkAncestorAccess(FSNamesystem.java:6220)

 at
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInternal(FSNamesystem.java:4087)

 at
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInt(FSNamesystem.java:4057)

 at
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(FSNamesystem.java:4030)

 at
 org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:787)

 at
 org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.mkdirs(AuthorizationProviderProxyClientProtocol.java:297)

 at
 org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:594)

 at
 org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)

 at
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587)

 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)

 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)

 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)

 at java.security.AccessController.doPrivileged(Native Method)

 at javax.security.auth.Subject.doAs(Subject.java:415)

 at
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)

 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)



 Very frustrated. Please advise.





 Regards,



 *Ningjun Wang*

 Consulting Software Engineer

 LexisNexis

 121 Chanlon Road

 New Providence, NJ 07974-1541



 *From:* Prannoy [via Apache Spark User List] [mailto:ml-node+[hidden
 email] http:///user/SendEmail.jtp?type=nodenode=21105i=0]
 *Sent:* Monday, January 12, 2015 4:18 AM
 *To:* Wang, Ningjun (LNG-NPV)
 *Subject:* Re: Failed to save RDD as text file to local file system



 Have you tried simple giving the path where you want to save the file ?



 For instance in your case just do



 *r.saveAsTextFile(home/cloudera/tmp/out1) *



 Dont use* file*



 This will create a folder with name out1. saveAsTextFile always write by
 making a directory, it does not write data into a single file.



 Incase you need a single file you can use copyMerge API in FileUtils.



 *FileUtil.copyMerge(fs, **home/cloudera/tmp/out1, fs,**home/cloudera/tmp/out2 
 ,
 true, conf,null);*

 Now out2 will be a single file containing your data.

 *fs* is the configuration of you local file system.

 Thanks





 On Sat, Jan 10, 2015 at 1:36 AM, NingjunWang [via Apache Spark User List] 
 [hidden
 email] http:///user/SendEmail.jtp?type=nodenode=21093i=0 wrote:

 No, do you have any idea?



 Regards,



 *Ningjun Wang*

 Consulting Software Engineer

 LexisNexis

 121 Chanlon Road

 New Providence, NJ 07974-1541



 *From:* firemonk9 [via Apache

Re: How to set UI port #?

2015-01-12 Thread Prannoy
Set the port using

spconf.set(spark.ui.port,);

where,  is any port

spconf is your spark configuration object.

On Sun, Jan 11, 2015 at 2:08 PM, YaoPau [via Apache Spark User List] 
ml-node+s1001560n21083...@n3.nabble.com wrote:

 I have multiple Spark Streaming jobs running all day, and so when I run my
 hourly batch job, I always get a java.net.BindException: Address already
 in use which starts at 4040 then goes to 4041, 4042, 4043 before settling
 at 4044.

 That slows down my hourly job, and time is critical.  Is there a way I can
 set it to 4044 by default, or prevent the UI from launching altogether?

 Jon

 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://apache-spark-user-list.1001560.n3.nabble.com/How-to-set-UI-port-tp21083.html
  To start a new topic under Apache Spark User List, email
 ml-node+s1001560n1...@n3.nabble.com
 To unsubscribe from Apache Spark User List, click here
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=cHJhbm5veUBzaWdtb2lkYW5hbHl0aWNzLmNvbXwxfC0xNTI2NTg4NjQ2
 .
 NAML
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-set-UI-port-tp21083p21090.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Failed to save RDD as text file to local file system

2015-01-12 Thread Prannoy
Have you tried simple giving the path where you want to save the file ?

For instance in your case just do

*r.saveAsTextFile(home/cloudera/tmp/out1) *

Dont use* file*

This will create a folder with name out1. saveAsTextFile always write by
making a directory, it does not write data into a single file.

Incase you need a single file you can use copyMerge API in FileUtils.

*FileUtil.copyMerge(fs, home/cloudera/tmp/out1, fs,home/cloudera/tmp/out2 ,
true, conf,null);*

Now out2 will be a single file containing your data.

*fs* is the configuration of you local file system.

Thanks



On Sat, Jan 10, 2015 at 1:36 AM, NingjunWang [via Apache Spark User List] 
ml-node+s1001560n21068...@n3.nabble.com wrote:

  No, do you have any idea?



 Regards,



 *Ningjun Wang*

 Consulting Software Engineer

 LexisNexis

 121 Chanlon Road

 New Providence, NJ 07974-1541



 *From:* firemonk9 [via Apache Spark User List] [mailto:ml-node+[hidden
 email] http:///user/SendEmail.jtp?type=nodenode=21068i=0]
 *Sent:* Friday, January 09, 2015 2:56 PM
 *To:* Wang, Ningjun (LNG-NPV)
 *Subject:* Re: Failed to save RDD as text file to local file system



 Have you found any resolution for this issue ?
  --

 *If you reply to this email, your message will be added to the discussion
 below:*


 http://apache-spark-user-list.1001560.n3.nabble.com/Failed-to-save-RDD-as-text-file-to-local-file-system-tp21050p21067.html

 To unsubscribe from Failed to save RDD as text file to local file system, 
 click
 here.
 NAML
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml


 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://apache-spark-user-list.1001560.n3.nabble.com/Failed-to-save-RDD-as-text-file-to-local-file-system-tp21050p21068.html
  To start a new topic under Apache Spark User List, email
 ml-node+s1001560n1...@n3.nabble.com
 To unsubscribe from Apache Spark User List, click here
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=cHJhbm5veUBzaWdtb2lkYW5hbHl0aWNzLmNvbXwxfC0xNTI2NTg4NjQ2
 .
 NAML
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Failed-to-save-RDD-as-text-file-to-local-file-system-tp21050p21093.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: [question]Where can I get the log file

2014-12-04 Thread Prannoy
Hi,

You can access your logs in your /spark_home_directory/logs/ directory .

cat the file names and you will get the logs.

Thanks.

On Thu, Dec 4, 2014 at 2:27 PM, FFeng [via Apache Spark User List] 
ml-node+s1001560n20344...@n3.nabble.com wrote:

 I have wrote data to spark log.
 I get it through the web interface, but I really want to know if I can get
 these log file on my node.
 Where are they?
 Thx.

 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://apache-spark-user-list.1001560.n3.nabble.com/question-Where-can-I-get-the-log-file-tp20344.html
  To start a new topic under Apache Spark User List, email
 ml-node+s1001560n1...@n3.nabble.com
 To unsubscribe from Apache Spark User List, click here
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=cHJhbm5veUBzaWdtb2lkYW5hbHl0aWNzLmNvbXwxfC0xNTI2NTg4NjQ2
 .
 NAML
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/question-Where-can-I-get-the-log-file-tp20344p20347.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: How can I read an avro file in HDFS in Java?

2014-12-03 Thread Prannoy
Hi,
Try using
 sc.newAPIHadoopFile(hdfs path to your file,
AvroSequenceFileInputFormat.class, AvroKey.class, AvroValue.class,
your Configuration)

You will get the Avro related classes by importing org.apache.avro.*

Thanks.

On Tue, Dec 2, 2014 at 9:23 PM, leaviva [via Apache Spark User List] 
ml-node+s1001560n20173...@n3.nabble.com wrote:

 How can I read an avro file in HDFS ?
 I try use newAPIHadoopFile but I don't know how can i use it

 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-read-an-avro-file-in-HDFS-in-Java-tp20173.html
  To start a new topic under Apache Spark User List, email
 ml-node+s1001560n1...@n3.nabble.com
 To unsubscribe from Apache Spark User List, click here
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=cHJhbm5veUBzaWdtb2lkYW5hbHl0aWNzLmNvbXwxfC0xNTI2NTg4NjQ2
 .
 NAML
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-read-an-avro-file-in-HDFS-in-Java-tp20173p20285.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: object xxx is not a member of package com

2014-12-03 Thread Prannoy
Hi,

Add the jars in the external library of you related project.

Right click on package or class - Build Path - Configure Build Path -
Java Build Path - Select the Libraries tab - Add external library -
Browse to com.xxx.yyy.zzz._ - ok
Clean and build your project, most probably you will be able to pull the
classes from com.xxx.yyy.zzz._ package.

Thanks.

On Wed, Dec 3, 2014 at 4:29 AM, flyson [via Apache Spark User List] 
ml-node+s1001560n20205...@n3.nabble.com wrote:

 Hello everyone,

 Could anybody tell me how to import and call the 3rd party java classes
 from inside spark?
 Here's my case:
 I have a jar file (the directory layout is com.xxx.yyy.zzz) which contains
 some java classes, and I need to call some of them in spark code.
 I used the statement import com.xxx.yyy.zzz._ on top of the impacted
 spark file and set the location of the jar file in the CLASSPATH
 environment, and use .sbt/sbt assembly to build the project. As a result,
 I got an error saying object xxx is not a member of package com.

 I thought that this could be related to the dependencies, but couldn't
 figure it out. Any suggestion/solution from you would be appreciated!

 Thanks!

 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://apache-spark-user-list.1001560.n3.nabble.com/object-xxx-is-not-a-member-of-package-com-tp20205.html
  To start a new topic under Apache Spark User List, email
 ml-node+s1001560n1...@n3.nabble.com
 To unsubscribe from Apache Spark User List, click here
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=cHJhbm5veUBzaWdtb2lkYW5hbHl0aWNzLmNvbXwxfC0xNTI2NTg4NjQ2
 .
 NAML
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/object-xxx-is-not-a-member-of-package-com-tp20205p20288.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: How to use FlumeInputDStream in spark cluster?

2014-11-28 Thread Prannoy
Hi,

BindException comes when two processes are using the same port. In your
spark configuration just set (spark.ui.port,x),
to some other port. x can be any number say 12345. BindException will
not break your job in either case. Just to fix it change the port number.

Thanks.

On Fri, Nov 28, 2014 at 1:30 PM, pamtang [via Apache Spark User List] 
ml-node+s1001560n1999...@n3.nabble.com wrote:

 I'm seeing the same issue on CDH 5.2 with Spark 1.1. FlumeEventCount works
 fine on a Standalone cluster but throw BindException on YARN mode. Is there
 a solution to this problem or FlumeInputDStream will not be working in a
 cluster environment?

 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-FlumeInputDStream-in-spark-cluster-tp1604p19997.html
  To start a new topic under Apache Spark User List, email
 ml-node+s1001560n1...@n3.nabble.com
 To unsubscribe from Apache Spark User List, click here
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=cHJhbm5veUBzaWdtb2lkYW5hbHl0aWNzLmNvbXwxfC0xNTI2NTg4NjQ2
 .
 NAML
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-FlumeInputDStream-in-spark-cluster-tp1604p1.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: read both local path and HDFS path

2014-11-27 Thread Prannoy
Hi,

The configuration you provide is just to access the HDFS when you give an
HDFS path. When you provide a HDFS path with the HDFS nameservice, like in
your case hmaster155:9000 it goes inside the HDFS to look for the file. For
accessing local file just give the local path of the file. Go to the file
in the local and do a pwd. This will give you the full path of the file.
Just give that path as your local path for the file and you will do good.

Thanks.

On Fri, Nov 28, 2014 at 8:57 AM, tuyuri [via Apache Spark User List] 
ml-node+s1001560n19990...@n3.nabble.com wrote:


 I have setup a Spark cluster config with HDFS and I know that default file
 path will be read by Spark all in HDFS example :

 /ad-cpc/2014-11-28/ Spark will read in :
 hdfs://hmaster155:9000/ad-cpc/2014-11-28/
 sometimes I wonder how can i force Spark read a file in local without
 reConfig my cluster ( to not use hdfs).

 please help me !!!

 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://apache-spark-user-list.1001560.n3.nabble.com/read-both-local-path-and-HDFS-path-tp19990.html
  To start a new topic under Apache Spark User List, email
 ml-node+s1001560n1...@n3.nabble.com
 To unsubscribe from Apache Spark User List, click here
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=cHJhbm5veUBzaWdtb2lkYW5hbHl0aWNzLmNvbXwxfC0xNTI2NTg4NjQ2
 .
 NAML
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/read-both-local-path-and-HDFS-path-tp19990p19995.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Persist streams to text files

2014-11-21 Thread Prannoy
Hi ,

You can use FileUtil.copemerge API and specify the path to the folder where
saveAsTextFile is save the part text file.

Suppose your directory is /a/b/c/

use FileUtil.copeMerge(FileSystem of source, a/b/c, FileSystem of
destination, Path to the merged file say (a/b/c.txt), true(to delete the
original dir,null))

Thanks.

On Fri, Nov 21, 2014 at 11:31 AM, Jishnu Prathap [via Apache Spark User
List] ml-node+s1001560n19449...@n3.nabble.com wrote:

  Hi I am also having similar problem.. any fix suggested..



 *Originally Posted by GaganBM*

 Hi,

 I am trying to persist the DStreams to text files. When I use the inbuilt
 API 'saveAsTextFiles' as :

 stream.saveAsTextFiles(resultDirectory)

 this creates a number of subdirectories, for each batch, and within each
 sub directory, it creates bunch of text files for each RDD (I assume).

 I am wondering if I can have single text files for each batch. Is there
 any API for that ? Or else, a single output file for the entire stream ?

 I tried to manually write from each RDD stream to a text file as :

 stream.foreachRDD(rdd ={
   rdd.foreach(element = {
   fileWriter.write(element)
   })
   })

 where 'fileWriter' simply makes use of a Java BufferedWriter to write
 strings to a file. However, this fails with exception :

 DStreamCheckpointData.writeObject used
 java.io.BufferedWriter
 java.io.NotSerializableException: java.io.BufferedWriter
 at
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183)
 at
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
 .

 Any help on how to proceed with this ?

 The information contained in this electronic message and any attachments
 to this message are intended for the exclusive use of the addressee(s) and
 may contain proprietary, confidential or privileged information. If you are
 not the intended recipient, you should not disseminate, distribute or copy
 this e-mail. Please notify the sender immediately and destroy all copies of
 this message and any attachments.

 WARNING: Computer viruses can be transmitted via email. The recipient
 should check this email and any attachments for the presence of viruses.
 The company accepts no liability for any damage caused by any virus
 transmitted by this email.

 www.wipro.com


 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://apache-spark-user-list.1001560.n3.nabble.com/Re-Persist-streams-to-text-files-tp19449.html
  To start a new topic under Apache Spark User List, email
 ml-node+s1001560n1...@n3.nabble.com
 To unsubscribe from Apache Spark User List, click here
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=cHJhbm5veUBzaWdtb2lkYW5hbHl0aWNzLmNvbXwxfC0xNTI2NTg4NjQ2
 .
 NAML
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Re-Persist-streams-to-text-files-tp19449p19457.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Cores on Master

2014-11-21 Thread Prannoy
Hi,

You can also set the cores in the spark application itself .

http://spark.apache.org/docs/1.0.1/spark-standalone.html

On Wed, Nov 19, 2014 at 6:11 AM, Pat Ferrel-2 [via Apache Spark User List] 
ml-node+s1001560n19238...@n3.nabble.com wrote:

 OK hacking the start-slave.sh did it

 On Nov 18, 2014, at 4:12 PM, Pat Ferrel [hidden email]
 http://user/SendEmail.jtp?type=nodenode=19238i=0 wrote:

 This seems to work only on a ‘worker’ not the master? So I’m back to
 having no way to control cores on the master?

 On Nov 18, 2014, at 3:24 PM, Pat Ferrel [hidden email]
 http://user/SendEmail.jtp?type=nodenode=19238i=1 wrote:

 Looks like I can do this by not using start-all.sh but starting each
 worker separately passing in a '--cores n' to the master? No config/env
 way?

 On Nov 18, 2014, at 3:14 PM, Pat Ferrel [hidden email]
 http://user/SendEmail.jtp?type=nodenode=19238i=2 wrote:

 I see the default and max cores settings but these seem to control total
 cores per cluster.

 My cobbled together home cluster needs the Master to not use all its cores
 or it may lock up (it does other things). Is there a way to control max
 cores used for a particular cluster machine in standalone mode?
 -
 To unsubscribe, e-mail: [hidden email]
 http://user/SendEmail.jtp?type=nodenode=19238i=3
 For additional commands, e-mail: [hidden email]
 http://user/SendEmail.jtp?type=nodenode=19238i=4



 -
 To unsubscribe, e-mail: [hidden email]
 http://user/SendEmail.jtp?type=nodenode=19238i=5
 For additional commands, e-mail: [hidden email]
 http://user/SendEmail.jtp?type=nodenode=19238i=6



 -
 To unsubscribe, e-mail: [hidden email]
 http://user/SendEmail.jtp?type=nodenode=19238i=7
 For additional commands, e-mail: [hidden email]
 http://user/SendEmail.jtp?type=nodenode=19238i=8



 -
 To unsubscribe, e-mail: [hidden email]
 http://user/SendEmail.jtp?type=nodenode=19238i=9
 For additional commands, e-mail: [hidden email]
 http://user/SendEmail.jtp?type=nodenode=19238i=10



 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://apache-spark-user-list.1001560.n3.nabble.com/Cores-on-Master-tp19230p19238.html
  To start a new topic under Apache Spark User List, email
 ml-node+s1001560n1...@n3.nabble.com
 To unsubscribe from Apache Spark User List, click here
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=cHJhbm5veUBzaWdtb2lkYW5hbHl0aWNzLmNvbXwxfC0xNTI2NTg4NjQ2
 .
 NAML
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Cores-on-Master-tp19230p19475.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Slow performance in spark streaming

2014-11-21 Thread Prannoy
Hi,

Spark runs in local with a speed less than in cluster. Cluster machines
usually have a high configuration and also the tasks are distrubuted in
workers in order to get a faster result. So you will always find a
difference in speed when running in local and when running in cluster. Try
running the same in a cluster and evaluate the speed there.

Thanks

On Thu, Nov 20, 2014 at 6:52 PM, Blackeye [via Apache Spark User List] 
ml-node+s1001560n1937...@n3.nabble.com wrote:

 I am using spark streaming 1.1.0 locally (not in a cluster). I created a
 simple app that parses the data (about 10.000 entries), stores it in a
 stream and then makes some transformations on it. Here is the code:


















































 *def main(args : Array[String]){ val master = local[8] val conf
 = new SparkConf().setAppName(Tester).setMaster(master) val sc = new
 StreamingContext(conf, Milliseconds(11)) val stream =
 sc.receiverStream(new MyReceiver(localhost, )) val parsedStream =
 parse(stream) parsedStream.foreachRDD(rdd =
 println(rdd.first()+\nRULE STARTS +System.currentTimeMillis())) val
 result1 = parsedStream.filter(entry =
 entry.symbol.contains(walking) entry.symbol.contains(true) 
 entry.symbol.contains(id0)).map(_.time) val result2 =
 parsedStream.filter(entry = entry.symbol == disappear 
 entry.symbol.contains(id0)).map(_.time) val result3 = result1
   .transformWith(result2, (rdd1, rdd2: RDD[Int]) =
 rdd1.subtract(rdd2)) result3.foreachRDD(rdd =
 println(rdd.first()+\nRULE ENDS +System.currentTimeMillis()))
  sc.start()sc.awaitTermination() } def parse(stream: DStream[String]) =
 { stream.flatMap { line = val entries =
 line.split(assert).filter(entry = !entry.isEmpty) entries.map {
 tuple = val pattern =
 \s*[(](.+)[,]\s*([0-9]+)+\s*[)]\s*[)]\s*[,|\.]\s*.r tuple
 match {   case pattern(symbol, time) =   new
 Data(symbol, time.toInt) }  } } } case class Data
 (symbol: String, time: Int)*

 I have a batch duration of 110.000 milliseconds in order to receive all
 the data in one batch. I believed that, even locally, the spark is very
 fast. In this case, it takes about 3.5sec to execute the rule (between
 RULE STARTS and RULE ENDS). Am I doing something wrong or this is the
 expected time? Any advise

 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://apache-spark-user-list.1001560.n3.nabble.com/Slow-performance-in-spark-streaming-tp19371.html
  To start a new topic under Apache Spark User List, email
 ml-node+s1001560n1...@n3.nabble.com
 To unsubscribe from Apache Spark User List, click here
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=cHJhbm5veUBzaWdtb2lkYW5hbHl0aWNzLmNvbXwxfC0xNTI2NTg4NjQ2
 .
 NAML
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Slow-performance-in-spark-streaming-tp19371p19476.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Parsing a large XML file using Spark

2014-11-21 Thread Prannoy
Hi,

Parallel processing of xml files may be an issue due to the tags in the xml
file. The xml file has to be intact as while parsing it matches the start
and end entity and if its distributed in parts to workers possibly it may
or may not find start and end tags within the same worker which will give
an exception.

Thanks.

On Wed, Nov 19, 2014 at 6:26 AM, ssimanta [via Apache Spark User List] 
ml-node+s1001560n19239...@n3.nabble.com wrote:

 If there a one big XML file (e.g., Wikipedia dump 44GB or the larger dump
 that all revision information also) that is stored in HDFS, is it possible
 to parse it in parallel/faster using Spark? Or do we have to use something
 like a PullParser or Iteratee?

 My current solution is to read the single XML file in the first pass -
 write it to HDFS and then read the small files in parallel on the Spark
 workers.

 Thanks
 -Soumya





 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://apache-spark-user-list.1001560.n3.nabble.com/Parsing-a-large-XML-file-using-Spark-tp19239.html
  To start a new topic under Apache Spark User List, email
 ml-node+s1001560n1...@n3.nabble.com
 To unsubscribe from Apache Spark User List, click here
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=cHJhbm5veUBzaWdtb2lkYW5hbHl0aWNzLmNvbXwxfC0xNTI2NTg4NjQ2
 .
 NAML
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Parsing-a-large-XML-file-using-Spark-tp19239p19477.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Execute Spark programs from local machine on Yarn-hadoop cluster

2014-11-21 Thread Prannoy
Hi naveen,

I dont think this is possible. If you are setting the master with your
cluster details you cannot execute any job from your local machine. You
have to execute the jobs inside your yarn machine so that sparkconf is able
to connect with all the provided details.

If this is not the case such give a detail explaintation of what exactly
you are trying to do :)

Thanks.

On Fri, Nov 21, 2014 at 8:11 PM, Naveen Kumar Pokala [via Apache Spark User
List] ml-node+s1001560n19482...@n3.nabble.com wrote:

 Hi,



 I am executing my spark jobs on yarn cluster by forming conf object in the
 following way.



 SparkConf conf = *new* SparkConf().setAppName(NewJob).setMaster(
 yarn-cluster);



 Now I want to execute spark jobs from my local machine how to do that.



 What I mean is there a way to give IP address, port all the details to
 connect a master(YARN) on some other network from my local spark Program.



 -Naveen


 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://apache-spark-user-list.1001560.n3.nabble.com/Execute-Spark-programs-from-local-machine-on-Yarn-hadoop-cluster-tp19482.html
  To start a new topic under Apache Spark User List, email
 ml-node+s1001560n1...@n3.nabble.com
 To unsubscribe from Apache Spark User List, click here
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=cHJhbm5veUBzaWdtb2lkYW5hbHl0aWNzLmNvbXwxfC0xNTI2NTg4NjQ2
 .
 NAML
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Execute-Spark-programs-from-local-machine-on-Yarn-hadoop-cluster-tp19482p19484.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark Streaming Application Got killed after 2 hours

2014-11-16 Thread Prannoy
Hi Saj,

What is the size of the input data that you are putting on the stream ?
Have you tried running the same application with different set of data ?
Its weird that exactly after 2 hours the streaming stops. Try running the
same application with different data of different size to look if it has
something to do with memory issue. Can you also provide a detailed error
log .

Thanks.

On Sun, Nov 16, 2014 at 11:49 AM, SAJ [via Apache Spark User List] 
ml-node+s1001560n19021...@n3.nabble.com wrote:

 Hi All,
 I am trying to run spark streaming application to run for 24/7 but exactly
 after 2 hours it got killed.I have again tried but again it got killed in 2
 hours.Following are the error log in worker

 14/11/15 13:53:24 INFO network.ConnectionManager: Removing
 ReceivingConnection to ConnectionManagerId(ip-x,38863)
 14/11/15 13:53:24 INFO network.ConnectionManager: Removing
 SendingConnection to ConnectionManagerId(ip-x,38863)
 14/11/15 13:53:24 ERROR network.SendingConnection: Exception while reading
 SendingConnection to ConnectionManagerI

 Does anybody faced these same issue.
 Thanks  Regards,
 SAJ

 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Application-Got-killed-after-2-hours-tp19021.html
  To start a new topic under Apache Spark User List, email
 ml-node+s1001560n1...@n3.nabble.com
 To unsubscribe from Apache Spark User List, click here
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=cHJhbm5veUBzaWdtb2lkYW5hbHl0aWNzLmNvbXwxfC0xNTI2NTg4NjQ2
 .
 NAML
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Application-Got-killed-after-2-hours-tp19021p19023.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: saveAsTextFile error

2014-11-15 Thread Prannoy
Hi Niko, 

Have you tried it running keeping the wordCounts.print() ?? Possibly the
import to the package *org.apache.spark.streaming._* is not there so during
sbt package it is unable to locate the saveAsTextFile API. 

Go to
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/NetworkWordCount.scala
to check if all the needed packages are there. 

Thanks.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-error-tp18960p19006.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org