Re: Sparse Vector ArrayIndexOutOfBoundsException

2015-12-04 Thread Yanbo Liang
Could you also print the length of featureSet? I suspect it less than 62. The first argument of Vectors.sparse() is the length of this sparse vector not the length of non-null elements. Yanbo 2015-12-03 22:30 GMT+08:00 nabegh : > I'm trying to run a SVM classifier on unlabeled

Predictive Modeling

2015-12-04 Thread Chintan Bhatt
Hi, I'm very much interested to make a predictive model using crime data (from 2001-present. It is big .csv file (about 1.5 GB) )in spark on hortonworks. Can anyone tell me how to start? -- CHINTAN BHATT Assistant Professor, U & P U Patel

Re: Avoid Shuffling on Partitioned Data

2015-12-04 Thread Fengdong Yu
Yes. it results to a shuffle. > On Dec 4, 2015, at 6:04 PM, Stephen Boesch wrote: > > @Yu Fengdong: Your approach - specifically the groupBy results in a shuffle > does it not? > > 2015-12-04 2:02 GMT-08:00 Fengdong Yu

Questions about Spark Shuffle and Heap

2015-12-04 Thread Jianneng Li
Hi, On the Spark Configuration page ( http://spark.apache.org/docs/1.5.2/configuration.html), the documentation for spark.shuffle.memoryFraction mentions that the fraction is taken from the Java heap. However, the documentation for spark.shuffle.io.preferDirectBufs implies that off-heap memory

Is it possible to pass additional parameters to a python function when used inside RDD.filter method?

2015-12-04 Thread Abhishek Shivkumar
Hi, I am using spark with python and I have a filter constraint as follows: |my_rdd.filter(my_func)| where my_func is a method I wrote to filter the rdd items based on my own logic. I have defined the my_func as follows: |def my_func(my_item): { ... }| Now, I want to pass another

Avoid Shuffling on Partitioned Data

2015-12-04 Thread Yiannis Gkoufas
Hi there, I have my data stored in HDFS partitioned by month in Parquet format. The directory looks like this: -month=201411 -month=201412 -month=201501 - I want to compute some aggregates for every timestamp. How is it possible to achieve that by taking advantage of the existing

Re: Avoid Shuffling on Partitioned Data

2015-12-04 Thread Stephen Boesch
@Yu Fengdong: Your approach - specifically the groupBy results in a shuffle does it not? 2015-12-04 2:02 GMT-08:00 Fengdong Yu : > There are many ways, one simple is: > > such as: you want to know how many rows for each month: > > >

Getting error when trying to start master node after building spark 1.3

2015-12-04 Thread Mich Talebzadeh
Hi, I am trying to make Hive work with Spark. I have been told that I need to use Spark 1.3 and build it from source code WITHOUT HIVE libraries. I have built it as follows: ./make-distribution.sh --name "hadoop2-without-hive" --tgz

Re: Turning off DTD Validation using XML Utils package - Spark

2015-12-04 Thread Darin McBeath
ok, a new capability has been added to spark-xml-utils (1.3.0) to address this request. Essentially, the capability to specify 'processor' features has been added (through a new getInstance function). Here is a list of the features that can be set

Re: Avoid Shuffling on Partitioned Data

2015-12-04 Thread Fengdong Yu
There are many ways, one simple is: such as: you want to know how many rows for each month: sqlContext.read.parquet(“……../month=*”).select($“month").groupBy($”month”).count the output looks like: monthcount 201411100 201412200 hopes help. > On Dec 4, 2015, at 5:53 PM, Yiannis

Re: newbie best practices: is spark-ec2 intended to be used to manage long-lasting infrastructure ?

2015-12-04 Thread Sean Owen
There is no way to upgrade a running cluster here. You can stop a cluster, and simply start a new cluster in the same way you started the original cluster. That ought to be simple; the only issue I suppose is that you have down-time since you have to shut the whole thing down, but maybe that's

Re: SparkR in Spark 1.5.2 jsonFile Bug Found

2015-12-04 Thread Yanbo Liang
I have created SPARK-12146 to track this issue. 2015-12-04 9:16 GMT+08:00 Felix Cheung : > It looks like this has been broken around Spark 1.5. > > Please see JIRA SPARK-10185. This has been fixed in pyspark but > unfortunately SparkR was missed. I have confirmed this

Re: Spark Streaming from S3

2015-12-04 Thread Steve Loughran
On 3 Dec 2015, at 19:31, Michele Freschi > wrote: Hi Steve, I’m on hadoop 2.7.1 using the s3n switch to s3a. It's got better performance on big files (including a better forward seek that doesn't close connections; a faster close() on

anyone who can help me out with thi error please

2015-12-04 Thread Mich Talebzadeh
Hi, I am trying to make Hive work with Spark. I have been told that I need to use Spark 1.3 and build it from source code WITHOUT HIVE libraries. I have built it as follows: ./make-distribution.sh --name "hadoop2-without-hive" --tgz

has someone seen this error please?

2015-12-04 Thread Mich Talebzadeh
Hi, I am trying to make Hive work with Spark. I have been told that I need to use Spark 1.3 and build it from source code WITHOUT HIVE libraries. I have built it as follows: ./make-distribution.sh --name "hadoop2-without-hive" --tgz

Re: Spark Streaming Specify Kafka Partition

2015-12-04 Thread Cody Koeninger
So createDirectStream will give you a JavaInputDStream of R, where R is the return type you chose for your message handler. If you want a JavaPairInputDStream, you may have to call .mapToPair in order to convert the stream, even if the type you chose for R was already Tuple2 (note that I try to

RDD functions

2015-12-04 Thread Sateesh Karuturi
Hello Spark experts... Iam new to Apache Spark..Can anyone send me the proper Documentation to learn RDD functions. Thanks in advance...

How to get the list of available Transformations and actions for a RDD in Spark-Shell

2015-12-04 Thread Gokula Krishnan D
Hello All - In spark-shell when we press tab after . ; we could see the possible list of transformations and actions. But unable to see all the list. is there any other way to get the rest of the list. I'm mainly looking for sortByKey() val sales_RDD = sc.textFile("Data/Scala/phone_sales.txt")

Re: How to get the list of available Transformations and actions for a RDD in Spark-Shell

2015-12-04 Thread ayan guha
sortByKey() is a property of pairRDD as it requires key value pair to work. I think in scala their are transformation such as .toPairRDD(). On Sat, Dec 5, 2015 at 12:01 AM, Gokula Krishnan D wrote: > Hello All - > > In spark-shell when we press tab after . ; we could see

Re: How to get the list of available Transformations and actions for a RDD in Spark-Shell

2015-12-04 Thread Gokula Krishnan D
Thanks Ayan for the updates. But in my example, I hope "sales_map" is a pair_RDD , isn't it?. Thanks & Regards, Gokula Krishnan* (Gokul)* On Fri, Dec 4, 2015 at 8:16 AM, ayan guha wrote: > sortByKey() is a property of pairRDD as it requires key value pair to > work. I

Re: Is it possible to pass additional parameters to a python function when used inside RDD.filter method?

2015-12-04 Thread Praveen Chundi
Passing a lambda function should work. my_rrd.filter(lambda x: myfunc(x,newparam)) Best regards, Praveen Chundi On 04.12.2015 13:19, Abhishek Shivkumar wrote: Hi, I am using spark with python and I have a filter constraint as follows: |my_rdd.filter(my_func)| where my_func is a method I

Re: Python API Documentation Mismatch

2015-12-04 Thread Roberto Pagliari
Hi Yanbo, You mean pyspark.mllib.recommendation right? That is the one used in the official tutorial. Thank you, From: Yanbo Liang > Date: Friday, 4 December 2015 03:17 To: Felix Cheung >

[no subject]

2015-12-04 Thread Sateesh Karuturi
user-sc.1449231970.fbaoamghkloiongfhbbg-sateesh.karuturi9= gmail@spark.apache.org

Re: Is it possible to pass additional parameters to a python function when used inside RDD.filter method?

2015-12-04 Thread Abhishek Shivkumar
Excellent. that did work - thanks. On 4 December 2015 at 12:35, Praveen Chundi wrote: > Passing a lambda function should work. > > my_rrd.filter(lambda x: myfunc(x,newparam)) > > Best regards, > Praveen Chundi > > > On 04.12.2015 13:19, Abhishek Shivkumar wrote: > > Hi, >

Re: newbie best practices: is spark-ec2 intended to be used to manage long-lasting infrastructure ?

2015-12-04 Thread Michal Klos
If you are running on AWS I would recommend using s3 instead of hdfs as a general practice if you are maintaining state or data there. This way you can treat your spark clusters as ephemeral compute resources that you can swap out easily -- eg if something breaks just spin up a fresh cluster

Re: RDD functions

2015-12-04 Thread Ndjido Ardo BAR
Hi Michal, I think the following link could interest you. You gonna find there a lot of examples! http://homepage.cs.latrobe.edu.au/zhe/ZhenHeSparkRDDAPIExamples.html cheers, Ardo On Fri, Dec 4, 2015 at 2:31 PM, Michal Klos wrote: >

Re: RDD functions

2015-12-04 Thread Michal Klos
http://spark.apache.org/docs/latest/programming-guide.html#rdd-operations M > On Dec 4, 2015, at 8:21 AM, Sateesh Karuturi > wrote: > > Hello Spark experts... > Iam new to Apache Spark..Can anyone send me the proper Documentation to learn > RDD functions. >

Spark UI - Streaming Tab

2015-12-04 Thread patcharee
Hi, We tried to get the streaming tab interface on Spark UI - https://databricks.com/blog/2015/07/08/new-visualizations-for-understanding-spark-streaming-applications.html Tested on version 1.5.1, 1.6.0-snapshot, but no such interface for streaming applications at all. Any suggestions? Do we

Spark applications metrics

2015-12-04 Thread patcharee
Hi How can I see the summary of data read / write, shuffle read / write, etc of an Application, not per stage? Thanks, Patcharee - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail:

Re: Spark UI - Streaming Tab

2015-12-04 Thread PhuDuc Nguyen
I believe the "Streaming" tab is dynamic - it appears once you have a streaming job running, not when the cluster is simply up. It does not depend on 1.6 and has been in there since at least 1.0. HTH, Duc On Fri, Dec 4, 2015 at 7:28 AM, patcharee wrote: > Hi, > > We

Spark Streaming Shuffle to Disk

2015-12-04 Thread spearson23
I'm running a Spark Streaming job on 1.3.1 which contains an updateStateByKey. The job works perfectly fine, but at some point (after a few runs), it starts shuffling to disk no matter how much memory I give the executors. I have tried changing --executor-memory on spark-submit,

Re: How to get the list of available Transformations and actions for a RDD in Spark-Shell

2015-12-04 Thread Ted Yu
Did a quick test: rdd2: org.apache.spark.rdd.RDD[(Int, Int)] = MapPartitionsRDD[1] at map at :29 I think sales_map is MapPartitionsRDD FYI On Fri, Dec 4, 2015 at 6:18 AM, Gokula Krishnan D wrote: > Thanks Ayan for the updates. > > But in my example, I hope "sales_map" is

ROW_TIMESTAMP support with UNSIGNED_LONG

2015-12-04 Thread pierre lacave
Hi I am trying to use the ROW_TIMESTAMP mapping featured in 4.6 as described in https://phoenix.apache.org/rowtimestamp.html However when inserting a timestamp in nanosecond I get the following exception saying the value cannot be less than zero? Inserting micros,micros or sec result in same

Fwd: Can't run Spark Streaming Kinesis example

2015-12-04 Thread Brian London
On my local system (8 core MBP) the Kinesis ASL example isn't working out of the box on a fresh build (Spark 1.5.2). I can see records going into the kinesis stream but the receiver is returning empty DStreams. The behavior is similar to an issue that's been discussed previously:

How to access a RDD (that has been broadcasted) inside the filter method of another RDD?

2015-12-04 Thread Abhishek Shivkumar
Hi, I have RDD1 that is broadcasted. I have a user defined method for the filter functionality of RDD2, written as follows: RDD2.filter(my_func) I want to access the values of RDD1 inside my_func. Is that possible? Should I pass RDD1 as a parameter into my_func? Thanks Abhishek S --

Regarding Join between two graphs

2015-12-04 Thread hastimal
Hello I have two graphRDDs one is property Graph and another one is connected Component graph like: * /var propGraph = Graph(vertexArray,edgeArray).cache()/* with triplets: /((0,),(14,null),)

Re: Spark UI - Streaming Tab

2015-12-04 Thread patcharee
I ran streaming jobs, but no streaming tab appeared for those jobs. Patcharee On 04. des. 2015 18:12, PhuDuc Nguyen wrote: I believe the "Streaming" tab is dynamic - it appears once you have a streaming job running, not when the cluster is simply up. It does not depend on 1.6 and has been in

is Multiple Spark Contexts is supported in spark 1.5.0 ?

2015-12-04 Thread prateek arora
Hi I want to create multiple sparkContext in my application. i read so many articles they suggest " usage of multiple contexts is discouraged, since SPARK-2243 is still not resolved." i want to know that Is spark 1.5.0 supported to create multiple contexts without error ? and if supported then

Re: is Multiple Spark Contexts is supported in spark 1.5.0 ?

2015-12-04 Thread Ted Yu
See Josh's response in this thread: http://search-hadoop.com/m/q3RTt1z1hUw4TiG1=Re+Question+about+yarn+cluster+mode+and+spark+driver+allowMultipleContexts Cheers On Fri, Dec 4, 2015 at 9:46 AM, prateek arora wrote: > Hi > > I want to create multiple sparkContext in

Re: newbie best practices: is spark-ec2 intended to be used to manage long-lasting infrastructure ?

2015-12-04 Thread Sabarish Sasidharan
#2: if using hdfs it's on the disks. You can use the HDFS command line to browse your data. And then use s3distcp or simply distcp to copy data from hdfs to S3. Or even use hdfs get commands to copy to local disk and then use S3 cli to copy to s3 #3. Cost of accessing data in S3 from Ec2 nodes,

Re: understanding and disambiguating CPU-core related properties

2015-12-04 Thread Leonidas Patouchas
Regarding your 2nd question, there is great article from cloudera regurding this: http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2. They focus on yarn setup but the big picture applies everywere. In general, I believe that you have to know your data in order to

Re: is Multiple Spark Contexts is supported in spark 1.5.0 ?

2015-12-04 Thread Michael Armbrust
On Fri, Dec 4, 2015 at 11:24 AM, Anfernee Xu wrote: > If multiple users are looking at the same data set, then it's good choice > to share the SparkContext. > > But my usercases are different, users are looking at different data(I use > custom Hadoop InputFormat to load

Higher Processing times in Spark Streaming with kafka Direct

2015-12-04 Thread SRK
Hi, Our processing times in Spark Streaming with kafka Direct approach seems to have increased considerably with increase in the Site traffic. Would increasing the number of kafka partitions decrease the processing times? Any suggestions on tuning to reduce the processing times would be of

Re: JMXSink for YARN deployment

2015-12-04 Thread spearson23
We use a metrics.property file on YARN by submitting applications like this: spark-submit --conf spark.metrics.conf=metrics.properties --class CLASS_NAME --master yarn-cluster --files /PATH/TO/metrics.properties /PATH/TO/CODE.JAR /PATH/TO/CONFIG.FILE APP_NAME -- View this message in context:

Re: Spark SQL IN Clause

2015-12-04 Thread Xiao Li
https://github.com/apache/spark/pull/9055 This JIRA explains how to convert IN to Joins. Thanks, Xiao Li 2015-12-04 11:27 GMT-08:00 Michael Armbrust : > The best way to run this today is probably to manually convert the query > into a join. I.e. create a dataframe

Re: JMXSink for YARN deployment

2015-12-04 Thread spearson23
Run "spark-submit --help" to see all available options. To get JMX to work you need to: spark-submit --driver-java-options "-Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.port=JMX_PORT"

Exception in thread "main" java.lang.IncompatibleClassChangeError:

2015-12-04 Thread Prem Sure
Getting below exception while executing below program in eclipse. any clue on whats wrong here would be helpful *public* *class* WordCount { *private* *static* *final* FlatMapFunction *WORDS_EXTRACTOR* = *new* *FlatMapFunction()* { @Override *public* Iterable

Re: SparkSQL API to insert DataFrame into a static partition?

2015-12-04 Thread Isabelle Phan
Thanks all for your reply! I tested both approaches: registering the temp table then executing SQL vs. saving to HDFS filepath directly. The problem with the second approach is that I am inserting data into a Hive table, so if I create a new partition with this method, Hive metadata is not

RE: Broadcasting a parquet file using spark and python

2015-12-04 Thread Shuai Zheng
Hi all, Sorry to re-open this thread. I have a similar issue, one big parquet file left outer join quite a few smaller parquet files. But the running is extremely slow and even OOM sometimes (with 300M , I have two questions here: 1, If I use outer join, will Spark SQL auto use

Re: Spark ML Random Forest output.

2015-12-04 Thread Vishnu Viswanath
Hi, As per my understanding the probability matrix is giving the probability that that particular item can belong to each class. So the one with highest probability is your predicted class. Since you have converted you label to index label, according the model the classes are 0.0 to 9.0 and I

Spark ML Random Forest output.

2015-12-04 Thread Eugene Morozov
Hello, I've got an input dataset of handwritten digits and working java code that uses random forest classification algorithm to determine the numbers. My test set is just some lines from the same input dataset - just to be sure I'm doing the right thing. My understanding is that having correct

Re: Higher Processing times in Spark Streaming with kafka Direct

2015-12-04 Thread u...@moosheimer.com
Hi, processing time depends on what you are doing with the events. Increasing the number of partitions could be an idea if you write more messages to the topic than you read currently via Spark. Can you write more details? Mit freundlichen Grüßen / best regards Kay-Uwe Moosheimer > Am

Re: Spark SQL IN Clause

2015-12-04 Thread Ted Yu
Thanks for the pointer, Xiao. I found that leftanti join type is no longer in sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/joinTypes.scala FYI On Fri, Dec 4, 2015 at 12:04 PM, Xiao Li wrote: > https://github.com/apache/spark/pull/9055 > > This JIRA

Re: Not all workers seem to run in a standalone cluster setup by spark-ec2 script

2015-12-04 Thread Kyohey Hamaguchi
Andy, Thank you for replying. I am specifying exactly like it to --master. I just had missed it when writing that email. 2015年12月5日(土) 9:27 Andy Davidson : > Hi Kyohey > > I think you need to pass the argument --master $MASTER_URL \ > > > master_URL is

Re: Not all workers seem to run in a standalone cluster setup by spark-ec2 script

2015-12-04 Thread Andy Davidson
Hi Kyohey I think you need to pass the argument --master $MASTER_URL \ master_URL is something like spark://ec2-54-215-112-121.us-west-1.compute.amazonaws.com:7077 Its the public url to your master Andy From: Kyohey Hamaguchi Date: Friday, December 4, 2015

spark.authenticate=true YARN mode doesn't work

2015-12-04 Thread prasadreddy
Hi All, I am running Spark YARN and trying to enable authentication by setting spark.authenticate=true. After enable authentication I am not able to Run Spark word count or any other programs. Any help will be appreciated. Thanks -- View this message in context:

Re: spark.authenticate=true YARN mode doesn't work

2015-12-04 Thread Ted Yu
Which release are you using ? Please take a look at https://spark.apache.org/docs/latest/running-on-yarn.html There're several config parameters related to security: spark.yarn.keytab spark.yarn.principal ... FYI On Fri, Dec 4, 2015 at 5:47 PM, prasadreddy wrote: > Hi

Re: spark-ec2 vs. EMR

2015-12-04 Thread Jonathan Kelly
Sending this to the list again because I'm pretty sure it didn't work the first time. A colleague just realized he was having the same problem with the list not accepting his posts, but unsubscribing and re-subscribing seemed to fix the issue for him. I've just unsubscribed and re-subscribed too,

Re: Improve saveAsTextFile performance

2015-12-04 Thread Ram VISWANADHA
That didn’t work :( Any help I have documented some steps here. http://stackoverflow.com/questions/34048340/spark-saveastextfile-last-stage-almost-never-finishes Best Regards, Ram From: Sahil Sareen > Date: Wednesday, December 2, 2015 at 10:18 PM

Re: spark.authenticate=true YARN mode doesn't work

2015-12-04 Thread Ted Yu
Did you try setting "spark.authenticate.secret" ? Cheers On Fri, Dec 4, 2015 at 7:07 PM, Prasad Reddy wrote: > Hi Ted, > > Thank you for the reply. > > I am using 1.5.2. > > I am implementing SASL encryption. Authentication is required to implement > SASL Encryption. > >

the way to compare any two adjacent elements in one rdd

2015-12-04 Thread Zhiliang Zhu
Hi All, I would like to compare any two adjacent elements in one given rdd, just as the single machine code part: int a[N] = {...};for (int i=0; i < N - 1; ++i) {   compareFun(a[i], a[i+1]);}... mapPartitions may work for some situations, however, it could not compare elements in different  

Re: the way to compare any two adjacent elements in one rdd

2015-12-04 Thread Zhiliang Zhu
Hi DB Tsai, Thanks very much for your kind reply! Sorry that for one more issue, as tested it seems that filter could only return JavaRDD but not any JavaRDD , is it ?Then it is not much convenient to do general filter for RDD, mapPartitions could work some, but if some partition will left and

Re: the way to compare any two adjacent elements in one rdd

2015-12-04 Thread DB Tsai
This is tricky. You need to shuffle the ending and beginning elements using mapPartitionWithIndex. Sincerely, DB Tsai -- Web: https://www.dbtsai.com PGP Key ID: 0xAF08DF8D On Fri, Dec 4, 2015 at 10:30 PM, Zhiliang Zhu

MLlib training time question

2015-12-04 Thread Haoyue Wang
Hi all, I'm doing some experiment with Spark MLlib (version 1.5.0). I train LogisticRegressionModel on a 2.06GB dataset (# of data: 2396130, # of features: 3231961, # of classes: 2, format: LibSVM). I deployed Spark to a 4 nodes cluster, each node's spec: CPU: Intel(R) Xeon(R) CPU E5-2650 0 @

Re: spark.authenticate=true YARN mode doesn't work

2015-12-04 Thread Prasad Reddy
I did tried. Same problem. as you said earlier. spark.yarn.keytab spark.yarn.principal are required. On Fri, Dec 4, 2015 at 7:25 PM, Ted Yu wrote: > Did you try setting "spark.authenticate.secret" ? > > Cheers > > On Fri, Dec 4, 2015 at 7:07 PM, Prasad Reddy

Any role for volunteering

2015-12-04 Thread Deepak Sharma
Hi All Sorry for spamming your inbox. I am really keen to work on a big data project full time(preferably remote from India) , if not I am open to volunteering as well. Please do let me know if there is any such opportunity available -- Thanks Deepak

Re: is Multiple Spark Contexts is supported in spark 1.5.0 ?

2015-12-04 Thread prateek arora
Hi Ted Thanks for the information . is there any way that two different spark application share there data ? Regards Prateek On Fri, Dec 4, 2015 at 9:54 AM, Ted Yu wrote: > See Josh's response in this thread: > > >

Is Temporary Access Credential (AccessKeyId, SecretAccessKey + SecurityToken) support by Spark?

2015-12-04 Thread Lin, Hao
Hi, Does anyone knows if Spark run in AWS is supported by temporary access credential (AccessKeyId, SecretAccessKey + SecurityToken) to access S3? I only see references to specify fs.s3.awsAccessKeyId and fs.s3.awsSecretAccessKey, without any mention of security token. Apparently this is only

Oozie SparkAction not able to use spark conf values

2015-12-04 Thread Rajadayalan Perumalsamy
Hi We are trying to change our existing oozie workflows to use SparkAction instead of ShellAction. We are passing spark configuration in spark-opts with --conf, but these values are not accessible in Spark and it is throwing error. Please note we are able to use SparkAction successfully in

Re: is Multiple Spark Contexts is supported in spark 1.5.0 ?

2015-12-04 Thread prateek arora
Thanks ... Is there any way my second application run in parallel and wait for fetching data from hbase or any other data storeage system ? Regards Prateek On Fri, Dec 4, 2015 at 10:24 AM, Ted Yu wrote: > How about using NoSQL data store such as HBase :-) > > On Fri, Dec

Spark SQL IN Clause

2015-12-04 Thread Madabhattula Rajesh Kumar
Hi, How to use/best practices "IN" clause in Spark SQL. Use Case :- Read the table based on number. I have a List of numbers. For example, 1million. Regards, Rajesh

Re: is Multiple Spark Contexts is supported in spark 1.5.0 ?

2015-12-04 Thread Ted Yu
How about using NoSQL data store such as HBase :-) On Fri, Dec 4, 2015 at 10:17 AM, prateek arora wrote: > Hi Ted > Thanks for the information . > is there any way that two different spark application share there data ? > > Regards > Prateek > > On Fri, Dec 4, 2015

Re: ROW_TIMESTAMP support with UNSIGNED_LONG

2015-12-04 Thread pierre lacave
Sorry wrong list, please ignore On 4 Dec 2015 5:51 p.m., "pierre lacave" wrote: > > Hi I am trying to use the ROW_TIMESTAMP mapping featured in 4.6 as > described in https://phoenix.apache.org/rowtimestamp.html > > However when inserting a timestamp in nanosecond I get the

Re: Spark SQL IN Clause

2015-12-04 Thread Ted Yu
Have you seen this JIRA ? [SPARK-8077] [SQL] Optimization for TreeNodes with large numbers of children >From the numbers Michael published, 1 million numbers would still need 250 seconds to parse. On Fri, Dec 4, 2015 at 10:14 AM, Madabhattula Rajesh Kumar < mrajaf...@gmail.com> wrote: > Hi, >

Re: Is Temporary Access Credential (AccessKeyId, SecretAccessKey + SecurityToken) support by Spark?

2015-12-04 Thread Michal Klos
We were looking into this as well --- the answer looks like "no" Here's the ticket: https://issues.apache.org/jira/browse/HADOOP-9680 m On Fri, Dec 4, 2015 at 1:41 PM, Lin, Hao wrote: > Hi, > > > > Does anyone knows if Spark run in AWS is supported by temporary access >

RE: Is Temporary Access Credential (AccessKeyId, SecretAccessKey + SecurityToken) support by Spark?

2015-12-04 Thread Lin, Hao
Thanks, I will keep an eye on it. From: Michal Klos [mailto:michal.klo...@gmail.com] Sent: Friday, December 04, 2015 1:50 PM To: Lin, Hao Cc: user Subject: Re: Is Temporary Access Credential (AccessKeyId, SecretAccessKey + SecurityToken) support by Spark? We were looking into this as well ---

Re: is Multiple Spark Contexts is supported in spark 1.5.0 ?

2015-12-04 Thread Michael Armbrust
To be clear, I don't think there is ever a compelling reason to create more than one SparkContext in a single application. The context is threadsafe and can launch many jobs in parallel from multiple threads. Even if there wasn't global state that made it unsafe to do so, creating more than one

Re: Spark UI - Streaming Tab

2015-12-04 Thread Josh Rosen
The Streaming tab is only supported in the live UI, not in the History Server. On Fri, Dec 4, 2015 at 9:31 AM, patcharee wrote: > I ran streaming jobs, but no streaming tab appeared for those jobs. > > Patcharee > > > > On 04. des. 2015 18:12, PhuDuc Nguyen wrote: > >

Re: Spark SQL IN Clause

2015-12-04 Thread Michael Armbrust
The best way to run this today is probably to manually convert the query into a join. I.e. create a dataframe that has all the numbers in it, and join/outer join it with the other table. This way you avoid parsing a gigantic string. On Fri, Dec 4, 2015 at 10:36 AM, Ted Yu

Re: Not all workers seem to run in a standalone cluster setup by spark-ec2 script

2015-12-04 Thread Nicholas Chammas
Quick question: Are you processing gzipped files by any chance? It's a common stumbling block people hit. See: http://stackoverflow.com/q/27531816/877069 Nick On Fri, Dec 4, 2015 at 2:28 PM Kyohey Hamaguchi wrote: > Hi, > > I have setup a Spark standalone-cluster, which

Re: is Multiple Spark Contexts is supported in spark 1.5.0 ?

2015-12-04 Thread Mark Hamstra
Where it could start to make some sense is if you wanted a single application to be able to work with more than one Spark cluster -- but that's a pretty weird or unusual thing to do, and I'm pretty sure it wouldn't work correctly at present. On Fri, Dec 4, 2015 at 11:10 AM, Michael Armbrust

Re: is Multiple Spark Contexts is supported in spark 1.5.0 ?

2015-12-04 Thread Anfernee Xu
If multiple users are looking at the same data set, then it's good choice to share the SparkContext. But my usercases are different, users are looking at different data(I use custom Hadoop InputFormat to load data from my data source based on the user input), the data might not have any overlap.

Not all workers seem to run in a standalone cluster setup by spark-ec2 script

2015-12-04 Thread Kyohey Hamaguchi
Hi, I have setup a Spark standalone-cluster, which involves 5 workers, using spark-ec2 script. After submitting my Spark application, I had noticed that just one worker seemed to run the application and other 4 workers were doing nothing. I had confirmed this by checking CPU and memory usage on

How to modularize Spark Streaming Jobs?

2015-12-04 Thread SRK
Hi, What is the way to modularize Spark Streaming jobs something along the lines of what Spring XD does? Thanks, Swetha -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-modularize-Spark-Streaming-Jobs-tp25569.html Sent from the Apache Spark User