S3NativeFileSystem inefficient implementation when calling sc.textFile

2014-11-26 Thread Tomer Benyamini
Hello, I'm building a spark app required to read large amounts of log files from s3. I'm doing so in the code by constructing the file list, and passing it to the context as following: val myRDD = sc.textFile(s3n://mybucket/file1, s3n://mybucket/file2, ... , s3n://mybucket/fileN) When running

Re: configure to run multiple tasks on a core

2014-11-26 Thread Sean Owen
What about running, say, 2 executors per machine, each of which thinks it should use all cores? You can also multi-thread your map function manually, directly, within your code, with careful use of a java.util.concurrent.Executor On Wed, Nov 26, 2014 at 6:57 AM, yotto yotto.k...@autodesk.com

Spark SQL performance and data size constraints

2014-11-26 Thread SK
Hi, I use the following code to read in data and extract the unique users using Spark SQL. The data is 1.2 TB and I am running this on a cluster with 3 TB memory. It appears that there is enough memory, but the program just freezes after sometime where it maps the rdd to the case class Play. (If

Auto BroadcastJoin optimization failed in latest Spark

2014-11-26 Thread Jianshi Huang
Hi, I've confirmed that the latest Spark with either Hive 0.12 or 0.13.1 fails optimizing auto broadcast join in my query. I have a query that joins a huge fact table with 15 tiny dimension tables. I'm currently using an older version of Spark which was built on Oct. 12. Anyone else has met

Re: S3NativeFileSystem inefficient implementation when calling sc.textFile

2014-11-26 Thread lalit1303
you can try creating hadoop Configuration and set s3 configuration i.e. access keys etc. Now, for reading files from s3 use newAPIHadoopFile and pass the config object here along with key, value classes. - Lalit Yadav la...@sigmoidanalytics.com -- View this message in context:

Re: Lifecycle of RDD in spark-streaming

2014-11-26 Thread lalit1303
Hi Mukesh, Once you create a streming job, a DAG is created which contains your job plan i.e. all map transformation and all action operations to be performed on each batch of streaming application. So, once your job is started, the input dstream take the data input from specified source and all

Re: Accessing posterior probability of Naive Baye's prediction

2014-11-26 Thread Sean Owen
You can call Scala code from Java, even when it involves overloaded operators, since they are also just methods with names like $plus and $times. In this case, it's not quite feasible since the Scala API is complex and would end up forcing you to manually supply some other implementation details

Re: S3NativeFileSystem inefficient implementation when calling sc.textFile

2014-11-26 Thread Tomer Benyamini
Thanks Lalit; Setting the access + secret keys in the configuration works even when calling sc.textFile. Is there a way to select which hadoop s3 native filesystem implementation would be used at runtime using the hadoop configuration? Thanks, Tomer On Wed, Nov 26, 2014 at 11:08 AM, lalit1303

Starting the thrift server

2014-11-26 Thread Daniel Haviv
Hi, I'm trying to start the thrift server but failing: Exception in thread main java.lang.NoClassDefFoundError: org/apache/tez/dag/api/SessionNotRunning at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:353) at

Re: Remapping columns from a schemaRDD

2014-11-26 Thread Daniel Haviv
Is there some place I can read more about it ? I can't find any reference. I actully want to flatten these structures and not return them from the UDF. Thanks, Daniel On Tue, Nov 25, 2014 at 8:44 PM, Michael Armbrust mich...@databricks.com wrote: Maps should just be scala maps, structs are

Re: why MatrixFactorizationModel private?

2014-11-26 Thread jamborta
many thanks for adding this so quickly. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/why-MatrixFactorizationModel-private-tp19763p19855.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Spark Job submit

2014-11-26 Thread Naveen Kumar Pokala
Hi. Is there a way to submit spark job on Hadoop-YARN cluster from java code. -Naveen

Re: Accessing posterior probability of Naive Baye's prediction

2014-11-26 Thread jatinpreet
Hi Sean, The values brzPi and brzTheta are of the form breeze.linalg.DenseVectorDouble. So would I have to convert them back to simple vectors and use a library to perform addition/multiplication? If yes, can you please point me to the conversion logic and vector operation library for Java?

Re: Issue with Spark latest 1.2.0 build - ClassCastException from [B to SerializableWritable

2014-11-26 Thread lokeshkumar
The above issue happens while trying to do the below activity on JavaRDD (calling take() on rdd) JavaRDDString loadedRDD = sc.textFile(...); String[] tokens = loadedRDD.take(1).get(0).split(,); -- View this message in context:

RMSE in MovieLensALS increases or stays stable as iterations increase.

2014-11-26 Thread Kostas Kloudas
Hi all, I am getting familiarized with Mllib and a thing I noticed is that running the MovieLensALS example on the movieLens dataset for increasing number of iterations does not decrease the rmse. The results for 0.6% training set and 0.4% test are below. For training set to 0.8%, the

Re: Issue with Spark latest 1.2.0 build - ClassCastException from [B to SerializableWritable

2014-11-26 Thread Sean Owen
I'll take a wild guess that you have mismatching versions of Spark at play. Your cluster has one build and you're accidentally including another version. I think this code path has changed recently (

Re: RMSE in MovieLensALS increases or stays stable as iterations increase.

2014-11-26 Thread Nick Pentreath
copying user group - I keep replying directly vs reply all :) On Wed, Nov 26, 2014 at 2:03 PM, Nick Pentreath nick.pentre...@gmail.com wrote: ALS will be guaranteed to decrease the squared error (therefore RMSE) in each iteration, on the *training* set. This does not hold for the *test* set

Having problem with Spark streaming with Kinesis

2014-11-26 Thread A.K.M. Ashrafuzzaman
Hi guys, When we are using Kinesis with 1 shard then it works fine. But when we use more that 1 then it falls into an infinite loop and no data is processed by the spark streaming. In the kinesis dynamo DB, I can see that it keeps increasing the leaseCounter. But it do start processing. I am

Re: Issue with Spark latest 1.2.0 build - ClassCastException from [B to SerializableWritable

2014-11-26 Thread lokeshkumar
Hi Sean Thanks for reply, We upgraded our spark cluster from 1.1.0 to 1.2.0. And we also thought that this issue might be due to mis matching spark jar versions. But we double checked and re installed our app completely in a new system with spark-1.2.0 distro, but still no result. Facing the same

Number of executors and tasks

2014-11-26 Thread Praveen Sripati
Hi, I am running Spark in the stand alone mode. 1) I have a file of 286MB in HDFS (block size is 64MB) and so is split into 5 blocks. When I have the file in HDFS, 5 tasks are generated and so 5 files in the output. My understanding is that there will be a separate partition for each block and

Re: Running spark in standlone alone mode, saveAsTextFile() runs for forever

2014-11-26 Thread lalit1303
try repartition of rdd to say 2x number of cores available before saveAsTextFile - Lalit Yadav la...@sigmoidanalytics.com -- View this message in context:

SchemaRDD compute function

2014-11-26 Thread Jörg Schad
Hi, I have a short question regarding the compute() of an SchemaRDD. For SchemaRDD the actual queryExecution seems to be triggered via collect(), while the compute triggers only the compute() of the parent and copies the data (Please correct me if I am wrong!). Is this compute() triggered at all

ExternalAppendOnlyMap: Thread spilling in-memory map of to disk many times slowly

2014-11-26 Thread Romi Kuntsman
Hello, I have a large data calculation in Spark, distributed across serveral nodes. In the end, I want to write to a single output file. For this I do: output.coalesce(1, false).saveAsTextFile(filename). What happens is all the data from the workers flows to a single worker, and that one

how to force graphx to execute transfomtation

2014-11-26 Thread Hlib Mykhailenko
Hello, I work with Graphx. When I call graph.partitionBy(..) nothing happens, because, as I understood, that all transformation are lazy and partitionBy is built using transformations. Is there way how to force spark to actually execute this transformation and not use any action? --

Re: how to force graphx to execute transfomtation

2014-11-26 Thread Jörg Schad
Hi, can't you just use graph.partitionBy(..).collect()? Cheers, Joerg On Wed, Nov 26, 2014 at 2:25 PM, Hlib Mykhailenko hlib.mykhaile...@inria.fr wrote: Hello, I work with Graphx. When I call graph.partitionBy(..) nothing happens, because, as I understood, that all transformation are lazy

Re: RMSE in MovieLensALS increases or stays stable as iterations increase.

2014-11-26 Thread Kostas Kloudas
Once again, the error even with the training dataset increases. The results are: Running 1 iterations For 1 iter.: Test RMSE = 1.2447121194304893 Training RMSE = 1.2394166987104076 (34.751317636 s). Running 5 iterations For 5 iter.: Test RMSE = 1.3253957117600659 Training RMSE =

Re: RMSE in MovieLensALS increases or stays stable as iterations increase.

2014-11-26 Thread Sean Owen
How are you computing RMSE? and how are you training the model -- not with trainImplicit right? I wonder if you are somehow optimizing something besides RMSE. On Wed, Nov 26, 2014 at 2:36 PM, Kostas Kloudas kklou...@gmail.com wrote: Once again, the error even with the training dataset increases.

Re: RMSE in MovieLensALS increases or stays stable as iterations increase.

2014-11-26 Thread Kostas Kloudas
For the training I am using the code in the MovieLensALS example with trainImplicit set to false and for the training RMSE I use the val rmseTr = computeRmse(model, training, params.implicitPrefs). The computeRmse() method is provided in the MovieLensALS class. Thanks a lot, Kostas On

Spark Mesos integration bug?

2014-11-26 Thread contractor
Hi, We have been running Spark 1.0.2 with Mesos 0.20.1 in fine grained mode and for the most part it has been working well. We have been using mesos://zk://server1:2181,server2:2181,server3:2181/mesos as the spark master URL and this works great to get the Mesos leader. Unfortunately, this

SparkContext.textfile() cannot load file using UNC path on windows

2014-11-26 Thread Wang, Ningjun (LNG-NPV)
SparkContext.textfile() cannot load file using UNC path on windows I run the following on Windows XP val conf = new SparkConf().setAppName(testproj1.ClassificationEngine).setMaster(local) val sc = new SparkContext(conf)

Re: latest Spark 1.2 thrift server fail with NoClassDefFoundError on Guava

2014-11-26 Thread Patrick Wendell
Hi Judy, Are you somehow modifying Spark's classpath to include jars from Hadoop and Hive that you have running on the machine? The issue seems to be that you are somehow including a version of Hadoop that references the original guava package. The Hadoop that is bundled in the Spark jars should

Jetty as spark straming input

2014-11-26 Thread Guy Doulberg
Hi guys I started playing with spark streaming, and I came up with an idea that I wonder if it a valid idea. Building a jetty input stream, which is basically a jetty server that each http request it gets it streams . What do you think of this idea? Thanks, Guy

Re: latest Spark 1.2 thrift server fail with NoClassDefFoundError on Guava

2014-11-26 Thread Patrick Wendell
Just to double check - I looked at our own assembly jar and I confirmed that our Hadoop configuration class does use the correctly shaded version of Guava. My best guess here is that somehow a separate Hadoop library is ending up on the classpath, possible because Spark put it there somehow. tar

Re: Spark Job submit

2014-11-26 Thread Akhil Das
How about? - Create a SparkContext - setMaster as *yarn-cluster* - Create a JavaSparkContext with the above SparkContext And that will submit it to the yarn cluster. Thanks Best Regards On Wed, Nov 26, 2014 at 4:20 PM, Naveen Kumar Pokala npok...@spcapitaliq.com wrote: Hi. Is there a

RE: Inaccurate Estimate of weights model from StreamingLinearRegressionWithSGD

2014-11-26 Thread Bui, Tri
Liang, Can you do me a favor and run the predictOnvalues on a sample test data, and see if it is working on your end, it is not working for me. It keeps predicting 0. My code: val conf = new SparkConf().setMaster(local[2]).setAppName(StreamingLinearRegression) val ssc = new

Re: Having problem with Spark streaming with Kinesis

2014-11-26 Thread Akhil Das
I have it working without any issues (tried with 5 shrads), except my java version was 1.7. Here's the piece of code that i used. System.setProperty(AWS_ACCESS_KEY_ID, this.kConf.getOrElse(access_key, )) System.setProperty(AWS_SECRET_KEY, this.kConf.getOrElse(secret, )) val streamName

Unable to generate assembly jar which includes jdbc-thrift server

2014-11-26 Thread vdiwakar.malladi
Hi, When I'm trying to build spark assembly to include the dependencies related to thrift server, build is getting failed by throwing the following error. Could any one help me on this. [ERROR] Failed to execute goal on project spark-assembly_2.10: Could not resolve dependencies for project

Re: Number of executors and tasks

2014-11-26 Thread Akhil Das
1. On HDFS files are treated as ~64mb in block size. When you put the same file in local file system (ext3/ext4) it will be treated as different (in your case it looks like ~32mb) and that's why you are seeing 9 output files. 2. You could set *num-executors *to increase the number of executor

RE: Inaccurate Estimate of weights model from StreamingLinearRegressionWithSGD

2014-11-26 Thread Bui, Tri
Thanks Yanbo! Modified code below: val conf = new SparkConf().setMaster(local[2]).setAppName(StreamingLinearRegression) val ssc = new StreamingContext(conf, Seconds(args(2).toLong)) val trainingData = ssc.textFileStream(args(0)).map(LabeledPoint.parse) val testData =

Re: Number of executors and tasks

2014-11-26 Thread Akhil Das
This one would give you a better understanding http://stackoverflow.com/questions/24622108/apache-spark-the-number-of-cores-vs-the-number-of-executors Thanks Best Regards On Wed, Nov 26, 2014 at 10:32 PM, Akhil Das ak...@sigmoidanalytics.com wrote: 1. On HDFS files are treated as ~64mb in

Executor failover

2014-11-26 Thread Akshat Aranya
Hi, I have a question regarding failure of executors: how does Spark reassign partitions or tasks when executors fail? Is it necessary that new executors have the same executor IDs as the ones that were lost, or are these IDs irrelevant for failover?

Re: S3NativeFileSystem inefficient implementation when calling sc.textFile

2014-11-26 Thread Aaron Davidson
Spark has a known problem where it will do a pass of metadata on a large number of small files serially, in order to find the partition information prior to starting the job. This will probably not be repaired by switching the FS impl. However, you can change the FS being used like so (prior to

Re: Having problem with Spark streaming with Kinesis

2014-11-26 Thread Aniket Bhatnagar
What's your cluster size? For streamig to work, it needs shards + 1 executors. On Wed, Nov 26, 2014, 5:53 PM A.K.M. Ashrafuzzaman ashrafuzzaman...@gmail.com wrote: Hi guys, When we are using Kinesis with 1 shard then it works fine. But when we use more that 1 then it falls into an infinite

Re: How to insert complex types like mapstring,mapstring,int in spark sql

2014-11-26 Thread Takuya UESHIN
Hi, I guess this is fixed by https://github.com/apache/spark/pull/3110 which is not for complex type casting but makes inserting into hive table be able to handle complex types ignoring nullability. I also sent a pull-request (https://github.com/apache/spark/pull/3150) for complex type casting

This is just a test

2014-11-26 Thread NingjunWang
Test message -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/This-is-just-a-test-tp19895.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe,

Re: SchemaRDD compute function

2014-11-26 Thread Michael Armbrust
Exactly how the query is executed actually depends on a couple of factors as we do a bunch of optimizations based on the top physical operator and the final RDD operation that is performed. In general the compute function is only used when you are doing SQL followed by other RDD operations (map,

Re: RMSE in MovieLensALS increases or stays stable as iterations increase.

2014-11-26 Thread Sean Owen
I also modified the example to try 1, 5, 9, ... iterations as you did, and also ran with the same default parameters. I used the sample_movielens_data.txt file. Is that what you're using? My result is: Iteration 1 Test RMSE = 1.426079653593016 Train RMSE = 1.5013155094216357 Iteration 5 Test

Re: Lifecycle of RDD in spark-streaming

2014-11-26 Thread tian zhang
I have found this paper seems to answer most of questions about life duration.https://www.cs.berkeley.edu/~matei/papers/2012/hotcloud_spark_streaming.pdf Tian On Tuesday, November 25, 2014 4:02 AM, Mukesh Jha me.mukesh@gmail.com wrote: Hey Experts, I wanted to understand in

RDD saveAsObjectFile write to local file and HDFS

2014-11-26 Thread firemonk9
When I am running spark locally, RDD saveAsObjectFile writes the file to local file system (ex : path /data/temp.txt) and when I am running spark on YARN cluster, RDD saveAsObjectFile writes the file to hdfs. (ex : path /data/temp.txt ) Is there a way to explictly mention local file system

Re: RDD saveAsObjectFile write to local file and HDFS

2014-11-26 Thread Daniel Haviv
Prepend file:// to the path Daniel On 26 בנוב׳ 2014, at 20:15, firemonk9 dhiraj.peech...@gmail.com wrote: When I am running spark locally, RDD saveAsObjectFile writes the file to local file system (ex : path /data/temp.txt) and when I am running spark on YARN cluster, RDD

Re: RDD saveAsObjectFile write to local file and HDFS

2014-11-26 Thread Du Li
Add ³file://³ in front of your path. On 11/26/14, 10:15 AM, firemonk9 dhiraj.peech...@gmail.com wrote: When I am running spark locally, RDD saveAsObjectFile writes the file to local file system (ex : path /data/temp.txt) and when I am running spark on YARN cluster, RDD saveAsObjectFile

Re: Jetty as spark straming input

2014-11-26 Thread rektide
On Wed, Nov 26, 2014 at 04:06:40PM +, Guy Doulberg wrote: Hi guys I started playing with spark streaming, and I came up with an idea that I wonder if it a valid idea. Building a jetty input stream, which is basically a jetty server that each http request it gets it streams .

How can a function access Executor ID, Function ID and other parameters known to the Spark Environment

2014-11-26 Thread Steve Lewis
I am running on a 15 node cluster and am trying to set partitioning to balance the work across all nodes. I am using an Accumulator to track work by Mac Address but would prefer to use data known to the Spark environment - Executor ID, and Function ID show up in the Spark UI and Task ID and

Re: RMSE in MovieLensALS increases or stays stable as iterations increase.

2014-11-26 Thread Xiangrui Meng
The training RMSE may increase due to regularization. Squared loss only represents part of the global loss. If you watch the sum of the squared loss and the regularization, it should be non-increasing. -Xiangrui On Wed, Nov 26, 2014 at 9:53 AM, Sean Owen so...@cloudera.com wrote: I also modified

Re: how to force graphx to execute transfomtation

2014-11-26 Thread Ankur Dave
At 2014-11-26 05:25:10 -0800, Hlib Mykhailenko hlib.mykhaile...@inria.fr wrote: I work with Graphx. When I call graph.partitionBy(..) nothing happens, because, as I understood, that all transformation are lazy and partitionBy is built using transformations. Is there way how to force spark

Re: RDD Cache Cleanup

2014-11-26 Thread sranga
Just to close out this one, I noticed that the cache partition size was quite low for each of the RDDs (1 - 14). Increasing the number of partitions (~400) resolved this for me. -- View this message in context:

Re: Is Spark? or GraphX runs fast? a performance comparison on Page Rank

2014-11-26 Thread Harihar Nahak
Hi Guys, is there any one experience the same thing as above? - --Harihar -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-Spark-or-GraphX-runs-fast-a-performance-comparison-on-Page-Rank-tp19710p19909.html Sent from the Apache Spark User List

Re: Spark Job submit

2014-11-26 Thread Sandy Ryza
I think that actually would not work - yarn-cluster mode expects a specific deployment path that uses SparkSubmit. Setting master as yarn-client should work. -Sandy On Wed, Nov 26, 2014 at 8:32 AM, Akhil Das ak...@sigmoidanalytics.com wrote: How about? - Create a SparkContext - setMaster as

streaming tasks unevenly distributed among the executors?

2014-11-26 Thread yuz1988
Hello, I am a new user to spark, after I've run the spark streaming example 'StatefulNetworkWordCount', I was confused on what was shown on the webUI as follows: http://apache-spark-user-list.1001560.n3.nabble.com/file/n19911/uneven.jpg It seems like almost all the tasks were assigned to the

RE: Auto BroadcastJoin optimization failed in latest Spark

2014-11-26 Thread Cheng, Hao
Are all of your join keys the same? and I guess the join type are all “Left” join, https://github.com/apache/spark/pull/3362 probably is what you need. And, SparkSQL doesn’t support the multiway-join (and multiway-broadcast join) currently, https://github.com/apache/spark/pull/3270 should be

RE: Spark SQL performance and data size constraints

2014-11-26 Thread Cheng, Hao
Spark SQL doesn't support the DISTINCT well currently, particularly the case you described, it will leads all of the data fall into a single node and keep them in memory only. Dev community actually has solutions for this, it probably will be solved after the release of Spark 1.2.

RE: configure to run multiple tasks on a core

2014-11-26 Thread Yotto Koga
Thanks Sean. That worked out well. For anyone who happens onto this post and wants to do the same, these are the steps I took to do as Sean suggested... (Note this is for a stand alone cluster) login to the master ~/spark/sbin/stop-all.sh edit ~/spark/conf/spark-env.sh modify the line

Re: Lifecycle of RDD in spark-streaming

2014-11-26 Thread Bill Jay
Just add one more point. If Spark streaming knows when the RDD will not be used any more, I believe Spark will not try to retrieve data it will not use any more. However, in practice, I often encounter the error of cannot compute split. Based on my understanding, this is because Spark cleared out

SchemaRDD.saveAsTable() when schema contains arrays and was loaded from a JSON file using schema auto-detection

2014-11-26 Thread Kelly, Jonathan
I've noticed some strange behavior when I try to use SchemaRDD.saveAsTable() with a SchemaRDD that I¹ve loaded from a JSON file that contains elements with nested arrays. For example, with a file test.json that contains the single line: {values:[1,2,3]} and with code like the following:

Re: Lifecycle of RDD in spark-streaming

2014-11-26 Thread Tathagata Das
Can you elaborate on the usage pattern that lead to cannot compute split ? Are you using the RDDs generated by DStream, outside the DStream logic? Something like running interactive Spark jobs (independent of the Spark Streaming ones) on RDDs generated by DStreams? If that is the case, what is

Re: Kryo NPE with Array

2014-11-26 Thread Simone Franzini
I guess I already have the answer of what I have to do here, which is to configure the kryo object with the strategy as above. Now the question becomes: how can I pass this custom kryo configuration to the spark kryo serializer / kryo registrator? I've had a look at the code but I am still fairly

Re: SchemaRDD.saveAsTable() when schema contains arrays and was loaded from a JSON file using schema auto-detection

2014-11-26 Thread Kelly, Jonathan
After playing around with this a little more, I discovered that: 1. If test.json contains something like {values:[null,1,2,3]}, the schema auto-determined by SchemaRDD.jsonFile() will have element: integer (containsNull = true), and then SchemaRDD.saveAsTable()/SchemaRDD.insertInto() will work

Re: configure to run multiple tasks on a core

2014-11-26 Thread Matei Zaharia
Instead of SPARK_WORKER_INSTANCES you can also set SPARK_WORKER_CORES, to have one worker that thinks it has more cores. Matei On Nov 26, 2014, at 5:01 PM, Yotto Koga yotto.k...@autodesk.com wrote: Thanks Sean. That worked out well. For anyone who happens onto this post and wants to do

Re: IDF model error

2014-11-26 Thread Shivani Rao
Thanks Yanbo, I wonder why does SSV does not complain when i create using new SSV(4, Array(1, 3, 5, 7)? Is there no error check for this even in the breeze sparse vector's constructor? That is very strange Shivani On Tue, Nov 25, 2014 at 7:25 PM, Yanbo Liang yanboha...@gmail.com wrote: Hi

can't get smallint field from hive on spark

2014-11-26 Thread 诺铁
hi, don't know whether this question should be asked here, if not, please point me out, thanks. we are currently using hive on spark, when reading a small int field, it reports error: Cannot get field 'i16Val' because union is currently set to i32Val I googled and find only source code of

Re: S3NativeFileSystem inefficient implementation when calling sc.textFile

2014-11-26 Thread Michael Armbrust
In the past I have worked around this problem by avoiding sc.textFile(). Instead I read the data directly inside of a Spark job. Basically, you start with an RDD where each entry is a file in S3 and then flatMap that with something that reads the files and returns the lines. Here's an example:

Re: can't get smallint field from hive on spark

2014-11-26 Thread Michael Armbrust
This has been fixed in Spark 1.1.1 and Spark 1.2 https://issues.apache.org/jira/browse/SPARK-3704 On Wed, Nov 26, 2014 at 7:10 PM, 诺铁 noty...@gmail.com wrote: hi, don't know whether this question should be asked here, if not, please point me out, thanks. we are currently using hive on

RE: configure to run multiple tasks on a core

2014-11-26 Thread Yotto Koga
Indeed. That's nice. Thanks! yotto From: Matei Zaharia [matei.zaha...@gmail.com] Sent: Wednesday, November 26, 2014 6:11 PM To: Yotto Koga Cc: Sean Owen; user@spark.apache.org Subject: Re: configure to run multiple tasks on a core Instead of

RDDs join problem: incorrect result

2014-11-26 Thread liuboya
Hi, I ran into a problem when doing two RDDs join operation. For example, RDDa: RDD[(String,String)] and RDDb:RDD[(String,Int)]. Then, the result RDDc:[String,(String,Int)] = RDDa.join(RDDb). But I find the results in RDDc are incorrect compared with RDDb. What's wrong in join? -- View this

Re: Having problem with Spark streaming with Kinesis

2014-11-26 Thread Aniket Bhatnagar
Did you set spark master as local[*]? If so, then it means that nunber of executors is equal to number of cores of the machine. Perhaps your mac machine has more cores (certainly more than number of kinesis shards +1). Try explicitly setting master as local[N] where N is number of kinesis shards

Re: How to insert complex types like mapstring,mapstring,int in spark sql

2014-11-26 Thread Cheng Lian
Thanks Takuya! Will take a look into it later. And sorry for not being able to review all the PRs in time recently (mostly because of rushing Spark 1.2 release and Thanksgiving :) ). On 11/27/14 1:35 AM, Takuya UESHIN wrote: Hi, I guess this is fixed by

Re: Unable to generate assembly jar which includes jdbc-thrift server

2014-11-26 Thread Cheng Lian
What’s the command line you used to build Spark? Notice that you need to add |-Phive-thriftserver| to build the JDBC Thrift server. This profile was once removed in in v1.1.0, but added back in v1.2.0 because of dependency issue introduced by Scala 2.11 support. On 11/27/14 12:53 AM,

Re: Unable to generate assembly jar which includes jdbc-thrift server

2014-11-26 Thread vdiwakar.malladi
Thanks for your response. I'm using the following command. mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive -DskipTests clean package Regards. -- View this message in context:

Re: Submiting Spark application through code

2014-11-26 Thread sivarani
I am trying to submit spark streaming program, when i submit batch process its working.. but when i do the same with spark streaming.. it throws Anyone please help 14/11/26 17:42:25 INFO server.AbstractConnector: Started SocketConnector@0.0.0.0:50016 14/11/26 17:42:25 INFO server.Server:

Undirected Graphs in GraphX-Pregel

2014-11-26 Thread Deep Pradhan
Hi, I was going through this paper on Pregel titled, Pregel: A System for Large-Scale Graph Processing. In the second section named Model Of Computation, it says that the input to a Pregel computation is a directed graph. Is it the same in the Pregel abstraction of GraphX too? Do we always

Re: Unable to generate assembly jar which includes jdbc-thrift server

2014-11-26 Thread Cheng Lian
What version are you trying to build? I was at first assuming you're using the most recent master, but from your first mail it seems that you were trying to build Spark v1.1.0? On 11/27/14 12:57 PM, vdiwakar.malladi wrote: Thanks for your response. I'm using the following command. mvn

Re: Unable to generate assembly jar which includes jdbc-thrift server

2014-11-26 Thread vdiwakar.malladi
Yes, I'm building it from Spark 1.1.0 Thanks in advance. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-generate-assembly-jar-which-includes-jdbc-thrift-server-tp19887p19937.html Sent from the Apache Spark User List mailing list archive at

Re: SchemaRDD.saveAsTable() when schema contains arrays and was loaded from a JSON file using schema auto-detection

2014-11-26 Thread Yin Huai
Hello Jonathan, There was a bug regarding casting data types before inserting into a Hive table. Hive does not have the notion of containsNull for array values. So, for a Hive table, the containsNull will be always true for an array and we should ignore this field for Hive. This issue has been

Re: can't get smallint field from hive on spark

2014-11-26 Thread Yin Huai
For hive on spark, did you mean the thrift server of Spark SQL or https://issues.apache.org/jira/browse/HIVE-7292? If you meant the latter one, I think Hive's mailing list will be a good place to ask (see https://hive.apache.org/mailing_lists.html). Thanks, Yin On Wed, Nov 26, 2014 at 10:49 PM,

Re: Unable to generate assembly jar which includes jdbc-thrift server

2014-11-26 Thread Cheng Lian
Hm, then the command line you used should be fine. Actually just tried it locally and it’s fine. Make sure to run it in the root directory of Spark source tree (don’t |cd| into assembly). On 11/27/14 1:35 PM, vdiwakar.malladi wrote: Yes, I'm building it from Spark 1.1.0 Thanks in advance.

Opening Spark on IntelliJ IDEA

2014-11-26 Thread Taeyun Kim
Hi, I'm trying to open the Spark source code with IntelliJ IDEA. I opened pom.xml on the Spark source code root directory. Project tree is displayed in the Project tool window. But, when I open a source file, say org.apache.spark.deploy.yarn.ClientBase.scala, a lot of red marks shows on the

Re: Spark Streaming with Python

2014-11-26 Thread Nicholas Chammas
What version of Spark are you running? A Python API for Spark Streaming is only available via GitHub at the moment and has not been released in any version of Spark. On Tue, Nov 25, 2014 at 10:23 AM, Venkat, Ankam ankam.ven...@centurylink.com wrote: Any idea how to resolve this? Regards,

[no subject]

2014-11-26 Thread rapelly kartheek
Hi, I've been fiddling with spark/*/storage/blockManagerMasterActor.getPeers() definition in the context of blockManagerMaster.askDriverWithReply() sending a request GetPeers(). 1) I couldn't understand what the 'selfIndex' is used for?. 2) Also, I tried modifying the 'peers' array by just

RE: Opening Spark on IntelliJ IDEA

2014-11-26 Thread Taeyun Kim
Hi, An information about the error. On File | Project Structure window, the following error message is displayed with pink background: Library 'Maven: org.scala-lang:scala-compiler-bundle:2.10.4' is not used Can it be a hint? From: Taeyun Kim [mailto:taeyun@innowireless.com]

Re: Inaccurate Estimate of weights model from StreamingLinearRegressionWithSGD

2014-11-26 Thread Yanbo Liang
Hi Tri, Maybe my latest responds for your problem is lost, whatever, the following code snippet can run correctly. val model = new StreamingLinearRegressionWithSGD().setInitialWeights(Vectors.zeros(args(3).toInt)) model.algorithm.setIntercept(true) Because that all setXXX() function in

Re: Unable to generate assembly jar which includes jdbc-thrift server

2014-11-26 Thread Cheng Lian
I see. As what the exception stated, Maven can’t find |unzip| to help building PySpark. So you need a Windows version of |unzip| (probably from MinGW or Cygwin?) On 11/27/14 2:10 PM, vdiwakar.malladi wrote: Thanks for your prompt responses. I'm generating assembly jar file from windows 7

Re: can't get smallint field from hive on spark

2014-11-26 Thread 诺铁
I mean the later... thanks On Thu, Nov 27, 2014 at 1:42 PM, Yin Huai huaiyin@gmail.com wrote: For hive on spark, did you mean the thrift server of Spark SQL or https://issues.apache.org/jira/browse/HIVE-7292? If you meant the latter one, I think Hive's mailing list will be a good place to

updateStateByKey

2014-11-26 Thread Sunil Yarram
I have a use case where it requires a huge number of keys' state to be stored and updated with the latest values from the stream. I am planning to use updateStateByKey with checkpointing. I would like to know the performance implication on updateStateByKey as the keys stored in the state grows

Re: Spark Job submit

2014-11-26 Thread Akhil Das
Try to add your cluster's core-site.xml, yarn-site.xml, and hdfs-site.xml to the CLASSPATH (and on SPARK_CLASSPATH) and submit the job. Thanks Best Regards On Thu, Nov 27, 2014 at 12:24 PM, Naveen Kumar Pokala npok...@spcapitaliq.com wrote: Code is in my windows machine and cluster is in some

Re: Lifecycle of RDD in spark-streaming

2014-11-26 Thread Bill Jay
Hi TD, I am using Spark Streaming to consume data from Kafka and do some aggregation and ingest the results into RDS. I do use foreachRDD in the program. I am planning to use Spark streaming in our production pipeline and it performs well in generating the results. Unfortunately, we plan to have

Re: GraphX:java.lang.NoSuchMethodError:org.apache.spark.graphx.Graph$.apply

2014-11-26 Thread liuboya
I'm waiting online. Who can help me, please? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-java-lang-NoSuchMethodError-org-apache-spark-graphx-Graph-apply-tp19958p19959.html Sent from the Apache Spark User List mailing list archive at Nabble.com.