Re: spark streaming job to hbase write

2015-07-17 Thread Shushant Arora
Is this map creation happening on client side ? But how does it know which RS will contain that row key in put operation until asking the .Meta. table . Does Hbase client first gets that ranges of keys of each Reagionservers and then group put objects based on Region servers ? On Fri, Jul 17,

RE: java.lang.NoClassDefFoundError: Could not initialize class org.fusesource.jansi.internal.Kernel32

2015-07-17 Thread Wang, Ningjun (LNG-NPV)
Does anybody have any idea what cuase this problem? Thanks. Ningjun From: Wang, Ningjun (LNG-NPV) Sent: Wednesday, July 15, 2015 11:09 AM To: user@spark.apache.org Subject: java.lang.NoClassDefFoundError: Could not initialize class org.fusesource.jansi.internal.Kernel32 I just installed spark

Re: spark streaming job to hbase write

2015-07-17 Thread Ted Yu
It resorts to the following method for finding region location: private RegionLocations locateRegionInMeta(TableName tableName, byte[] row, boolean useCache, boolean retry, int replicaId) throws IOException { Note: useCache value is true in this call path. Meaning the client

Re: Spark APIs memory usage?

2015-07-17 Thread Akhil Das
Can you paste the code? How much memory does your system have and how big is your dataset? Did you try df.persist(StorageLevel.MEMORY_AND_DISK)? Thanks Best Regards On Fri, Jul 17, 2015 at 5:14 PM, Harit Vishwakarma harit.vishwaka...@gmail.com wrote: Thanks, Code is running on a single

Re: Spark APIs memory usage?

2015-07-17 Thread Harit Vishwakarma
1. load 3 matrices of size ~ 1 X 1 using numpy. 2. rdd2 = rdd1.values().flatMap( fun ) # rdd1 has roughly 10^7 tuples 3. df = sqlCtx.createDataFrame(rdd2) 4. df.save() # in parquet format It throws exception in createDataFrame() call. I don't know what exactly it is creating ? everything

Re: Problem in Understanding concept of Physical Cores

2015-07-17 Thread Aniruddh Sharma
Dear Community Request to help on below queries they are unanswered. Thanks and Regards Aniruddh On Wed, Jul 15, 2015 at 12:37 PM, Aniruddh Sharma asharma...@gmail.com wrote: Hi TD, Request your guidance on below 5 queries. Following is the context of them that I would use to evaluate

Re: Spark APIs memory usage?

2015-07-17 Thread Harit Vishwakarma
Thanks, Code is running on a single machine. And it still doesn't answer my question. On Fri, Jul 17, 2015 at 4:52 PM, ayan guha guha.a...@gmail.com wrote: You can bump up number of partitions while creating the rdd you are using for df On 17 Jul 2015 21:03, Harit Vishwakarma

Is it possible to set the number of cores per executor on standalone cluster?

2015-07-17 Thread Zheng, Xudong
Is it possible to set the number of cores per executor on standalone cluster? Because we find that, cores distribution may be very skewed on executor at some time, so the workload is skewed, that make our job become slow. Thanks! -- 郑旭东 Zheng, Xudong

Re: spark streaming job to hbase write

2015-07-17 Thread Shushant Arora
Thanks ! My key is random (hexadecimal). So hot spot should not be created. Is there any concept of bulk put. Say I want to raise a one put request for a 1000 size batch which will hit a region server instead of individual put for each key. Htable.put(ListPut) Does this handles batching of put

Re: spark streaming job to hbase write

2015-07-17 Thread Ted Yu
Internally AsyncProcess uses a Map which is keyed by server name: MapServerName, MultiActionRow actionsByServer = new HashMapServerName, MultiActionRow(); Here MultiAction would group Put's in your example which are destined for the same server. Cheers On Fri, Jul 17, 2015 at 5:15

Re: Spark APIs memory usage?

2015-07-17 Thread Akhil Das
I suspect its the numpy filling up Memory. Thanks Best Regards On Fri, Jul 17, 2015 at 5:46 PM, Harit Vishwakarma harit.vishwaka...@gmail.com wrote: 1. load 3 matrices of size ~ 1 X 1 using numpy. 2. rdd2 = rdd1.values().flatMap( fun ) # rdd1 has roughly 10^7 tuples 3. df =

RE: Select all columns except some

2015-07-17 Thread Saif.A.Ellafi
Hello, thank you for your time. Seq[String] works perfectly fine. I also tried running a for loop through all elements to see if any access to a value was broken, but no, they are alright. For now, I solved it properly calling this. Sadly, it takes a lot of time, but works: var data_sas =

Adding meetup groups to Community page - Moscow, Slovenia, Zagreb

2015-07-17 Thread Andrew Vykhodtsev
Dear all, The page https://spark.apache.org/community.html Says : If you'd like your meetup added, email user@spark.apache.org. So here I am emailing, could please someone add three new groups to the page Moscow : http://www.meetup.com/Apache-Spark-in-Moscow/ Slovenija (Ljubljana)

Nullpointer when saving as table with a timestamp column type

2015-07-17 Thread Brandon White
So I have a very simple dataframe that looks like df: [name:String, Place:String, time: time:timestamp] I build this java.sql.Timestamp from a string and it works really well expect when I call saveAsTable(tableName) on this df. Without the timestamp, it saves fine but with the timestamp, it

Re: Spark streaming Processing time keeps increasing

2015-07-17 Thread Tathagata Das
Responses inline. On Thu, Jul 16, 2015 at 9:27 PM, N B nb.nos...@gmail.com wrote: Hi TD, Yes, we do have the invertible function provided. However, I am not sure I understood how to use the filterFunction. Is there an example somewhere showing its usage? The header comment on the function

use S3-Compatible Storage with spark

2015-07-17 Thread Schmirr Wurst
Hi, I wonder how to use S3 compatible Storage in Spark ? If I'm using s3n:// url schema, the it will point to amazon, is there a way I can specify the host somewhere ? - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org

Re: Java 8 vs Scala

2015-07-17 Thread Marius Danciu
If you takes time to actually learn Scala starting from its fundamental concepts AND quite importantly get familiar with general functional programming concepts, you'd immediately realize the things that you'd really miss going back to Java (8). On Fri, Jul 17, 2015 at 8:14 AM Wojciech Pituła

Spark Master HA on YARN

2015-07-17 Thread Bhaskar Dutta
Hi, Is Spark master high availability supported on YARN (yarn-client mode) analogous to https://spark.apache.org/docs/1.4.0/spark-standalone.html#high-availability? Thanks Bhaskie

Unread block data error

2015-07-17 Thread Jem Tucker
Hi, I have been running a batch of data through my application for the last couple of days and this morning discovered it had fallen over with the following error. java.lang.IllegalStateException: unread block data at

Re: it seem like the exactly once feature not work on spark1.4

2015-07-17 Thread JoneZhang
I see now. There are three steps in SparkStreaming + Kafka date processing 1.Receiving the data 2.Transforming the data 3.Pushing out the data SparkStreaming + Kafka only provide an exactly-once guarantee on step 1 2 We need to ensure exactly once on step 3 by myself. More details see base

Re: Spark 1.3.1 + Hive: write output to CSV with header on S3

2015-07-17 Thread spark user
Hi Roberto  I have question regarding HiveContext .  when you create HiveContext where you define Hive connection properties ?   Suppose Hive is not in local machine i need to connect , how HiveConext will know the data base info like url ,username and password ? String username = ; String

Re: ALS run method versus ALS train versus ALS fit and transform

2015-07-17 Thread Carol McDonald
the new ALS()...run() form is underneath both of the first two. I am not sure what you mean by underneath, so basically the mllib ALS()...run() does the same thing as the mllib ALS train() ? On Wed, Jul 15, 2015 at 2:02 PM, Sean Owen so...@cloudera.com wrote: The first two examples are from

Re: ALS run method versus ALS train versus ALS fit and transform

2015-07-17 Thread Sean Owen
Yes, just have a look at the method in the source code. It calls new ALS()run(). It's a convenience wrapper only. On Fri, Jul 17, 2015 at 4:59 PM, Carol McDonald cmcdon...@maprtech.com wrote: the new ALS()...run() form is underneath both of the first two. I am not sure what you mean by

RE: The auxService:spark_shuffle does not exist

2015-07-17 Thread Andrew Lee
I have encountered the same problem after following the document. Here's my spark-defaults.confspark.shuffle.service.enabled true spark.dynamicAllocation.enabled true spark.dynamicAllocation.executorIdleTimeout 60 spark.dynamicAllocation.cachedExecutorIdleTimeout 120

spark-shell with Yarn failed

2015-07-17 Thread Amjad ALSHABANI
Hello, First of all I m a newbie in Spark , I m trying to start the spark-shell with yarn cluster by running: $ spark-shell --master yarn-client Sometimes it goes well, but most of the time I got an error: Container exited with a non-zero exit code 10 Failing this attempt. Failing the

Re: use S3-Compatible Storage with spark

2015-07-17 Thread Sujit Pal
Hi Schmirr, The part after the s3n:// is your bucket name and folder name, ie s3n://${bucket_name}/${folder_name}[/${subfolder_name}]*. Bucket names are unique across S3, so the resulting path is also unique. There is no concept of hostname in s3 urls as far as I know. -sujit On Fri, Jul 17,

Re: Spark 1.3.1 + Hive: write output to CSV with header on S3

2015-07-17 Thread Michael Armbrust
Using a hive-site.xml file on the classpath. On Fri, Jul 17, 2015 at 8:37 AM, spark user spark_u...@yahoo.com.invalid wrote: Hi Roberto I have question regarding HiveContext . when you create HiveContext where you define Hive connection properties ? Suppose Hive is not in local machine i

streaming and piping to R, sending all data in window to pipe()

2015-07-17 Thread PAULI, KEVIN CHRISTIAN [AG-Contractor/1000]
Spark newbie here, using Spark 1.3.1. I’m consuming a stream and trying to pipe the data from the entire window to R for analysis. The R algorithm needs the entire dataset from the stream (everything in the window) in order to function properly; it can’t be broken up. So I tried doing a

exception raised during large spark job against cassandra ring

2015-07-17 Thread Bosung Seo
Hello, I have been having trouble getting large Spark jobs to complete against my Cassandra ring. I’m finding that the CPU goes to 100% on one of the nodes, and then, after many hours, the job fails. Here are my Spark settings: .set(*spark.cassandra.input.split.size_in_mb*, *128*)

??????Nullpointer when saving as table with a timestamp column type

2015-07-17 Thread ??
df: [name:String, Place:String, time: time:timestamp] why not df: [name:String, Place:String, time:timestamp] -- -- ??: Brandon White;bwwintheho...@gmail.com; : 2015??7??17??(??) 2:18 ??: useruser@spark.apache.org; :

Re: it seem like the exactly once feature not work on spark1.4

2015-07-17 Thread Tathagata Das
Yes. More information in my talk - https://www.youtube.com/watch?v=d5UJonrruHk On Fri, Jul 17, 2015 at 1:15 AM, JoneZhang joyoungzh...@gmail.com wrote: I see now. There are three steps in SparkStreaming + Kafka date processing 1.Receiving the data 2.Transforming the data 3.Pushing out the

Spark 1.3.1 + Hive: write output to CSV with header on S3

2015-07-17 Thread Roberto Coluccio
Hello community, I'm currently using Spark 1.3.1 with Hive support for outputting processed data on an external Hive table backed on S3. I'm using a manual specification of the delimiter, but I'd want to know if is there any clean way to write in CSV format: *val* sparkConf = *new* SparkConf()

Re: Spark streaming Processing time keeps increasing

2015-07-17 Thread N B
Hi TD, Thanks for the response. I do believe I understand the concept and the need for the filterfunction now. I made the requisite code changes and keeping it running overnight to see the effect of it. Hopefully this should fix our issue. However, there was one place where I encountered a

Re: what is : ParquetFileReader: reading summary file ?

2015-07-17 Thread Cheng Lian
Yeah, Spark SQL Parquet support need to do some metadata discovery when firstly importing a folder containing Parquet files, and discovered metadata is cached. Cheng On 7/17/15 1:56 PM, shsh...@tsmc.com wrote: Hi all, our scenario is to generate lots of folders containinig parquet file and

Re: pyspark 1.4 udf change date values

2015-07-17 Thread Luis Guerra
Sure, I have created JIRA SPARK-9131 - UDF change data values https://issues.apache.org/jira/browse/SPARK-9131 On Thu, Jul 16, 2015 at 7:09 PM, Davies Liu dav...@databricks.com wrote: Thanks for reporting this, could you file a JIRA for it? On Thu, Jul 16, 2015 at 8:22 AM, Luis Guerra

Spark APIs memory usage?

2015-07-17 Thread Harit Vishwakarma
Hi, I used createDataFrame API of SqlContext in python. and getting OutOfMemoryException. I am wondering if it is creating whole dataFrame in memory? I did not find any documentation describing memory usage of Spark APIs. Documentation given is nice but little more details (specially on memory

Re: Spark APIs memory usage?

2015-07-17 Thread ayan guha
You can bump up number of partitions while creating the rdd you are using for df On 17 Jul 2015 21:03, Harit Vishwakarma harit.vishwaka...@gmail.com wrote: Hi, I used createDataFrame API of SqlContext in python. and getting OutOfMemoryException. I am wondering if it is creating whole

Re: What else is need to setup native support of BLAS/LAPACK with Spark?

2015-07-17 Thread Arun Ahuja
Hi Sean, Thanks for the reply! I did double-check that the jar is one I think I am running: [image: Inline image 2] jar tf /hpc/users/ahujaa01/src/spark/assembly/target/scala-2.10/spark-assembly-1.5.0-SNAPSHOT-hadoop2.6.0.jar | grep netlib | grep Native

Re: use S3-Compatible Storage with spark

2015-07-17 Thread Ankur Chauhan
The endpoint is the property you want to set. I would look at the source for that. Sent from my iPhone On Jul 17, 2015, at 08:55, Sujit Pal sujitatgt...@gmail.com wrote: Hi Schmirr, The part after the s3n:// is your bucket name and folder name, ie

Spark and SQL Server

2015-07-17 Thread Young, Matthew T
Hello, I am testing Spark interoperation with SQL Server via JDBC with Microsoft’s 4.2 JDBC Driver. Reading from the database works ok, but I have encountered a couple of issues writing back. In Scala 2.10 I can write back to the database except for a couple of types. 1. When I read a

Re: What else is need to setup native support of BLAS/LAPACK with Spark?

2015-07-17 Thread Sean Owen
Make sure /usr/lib64 contains libgfortran.so.3; that's really the issue. I'm pretty sure the answer is 'yes', but, make sure the assembly has jniloader too. I don't see why it wouldn't, but, that's needed. What is your env like -- local, standalone, YARN? how are you running? Just want to make

Data frames select and where clause dependency

2015-07-17 Thread Mike Trienis
I'd like to understand why the where field must exist in the select clause. For example, the following select statement works fine - df.select(field1, filter_field).filter(df(filter_field) === value).show() However, the next one fails with the error in operator !Filter (filter_field#60 =

Re: What else is need to setup native support of BLAS/LAPACK with Spark?

2015-07-17 Thread Sandy Ryza
Can you try setting the spark.yarn.jar property to make sure it points to the jar you're thinking of? -Sandy On Fri, Jul 17, 2015 at 11:32 AM, Arun Ahuja aahuj...@gmail.com wrote: Yes, it's a YARN cluster and using spark-submit to run. I have SPARK_HOME set to the directory above and using

MapType vs StructType

2015-07-17 Thread Corey Nolet
I notice JSON objects are all parsed as Map[String,Any] in Jackson but for some reason, the inferSchema tools in Spark SQL extracts the schema of nested JSON objects as StructTypes. This makes it really confusing when trying to rectify the object hierarchy when I have maps because the Catalyst

Re: Store DStreams into Hive using Hive Streaming

2015-07-17 Thread unk1102
Hi I have similar use case did you found solution for this problem of loading DStreams in Hive using Spark Streaming. Please guide. Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Store-DStreams-into-Hive-using-Hive-Streaming-tp18307p23885.html Sent

Re: What else is need to setup native support of BLAS/LAPACK with Spark?

2015-07-17 Thread Arun Ahuja
Yes, it's a YARN cluster and using spark-submit to run. I have SPARK_HOME set to the directory above and using the spark-submit script from there. bin/spark-submit --master yarn-client --executor-memory 10g --driver-memory 8g --num-executors 400 --executor-cores 1 --class

Has anyone run Python Spark application on Yarn-cluster mode ? (which has 3rd party Python modules (i.e., numpy) to be shipped with)

2015-07-17 Thread Elkhan Dadashov
Hi all, After SPARK-5479 https://issues.apache.org/jira/browse/SPARK-5479 issue fix (thanks to Marcelo Vanzin), now pyspark handles several python files (or in zip folder with __init__.py) addition to PYTHONPATH correctly in yarn-cluster mode. But adding python module as zip folder, still fails

Model Save function (ML-Lib)

2015-07-17 Thread Guillaume Guy
Hi: I'm using pyspark 1.3 and it seems that the model.save is not implemented for everyone. Here is what I have so far: *Model Name* *Model Class* *save available* Logistic Regression LogisticRegressionModel NO Random Forest TreeEnsembleModel OK GBM GradientBoostedTreesModel OK SVM

Re: Command builder problem when running worker in Windows

2015-07-17 Thread Elkhan Dadashov
Run Spark with --verbose flag, to see what it read for that path. I guess in Windows if you are using backslash, you need 2 of them (\\), or just use forward slashes everywhere. On Fri, Jul 17, 2015 at 2:40 PM, Julien Beaudan jbeau...@stottlerhenke.com wrote: Hi, I running a stand-alone

Re: Command builder problem when running worker in Windows

2015-07-17 Thread Elkhan Dadashov
Are you running it from command line (CLI) or through SparkLauncher ? If you can share the command (./bin/spark-submit ...) or the code snippet you are running, then it can give some clue. On Fri, Jul 17, 2015 at 3:30 PM, Julien Beaudan jbeau...@stottlerhenke.com wrote: Hi Elkhan, I ran

Re: BroadCast on Interval ( eg every 10 min )

2015-07-17 Thread ayan guha
The simple answer is you should not update broadcast variable. If you can post the problem you are handling, people here should be able to provide better suggestions. On 18 Jul 2015 13:53, Raghavendra Pandey raghavendra.pan...@gmail.com wrote: Broadcasted variables are immutable. Anyway, how

Re: MapType vs StructType

2015-07-17 Thread Michael Armbrust
I'll add there is a JIRA to override the default past some threshold of # of unique keys: https://issues.apache.org/jira/browse/SPARK-4476 https://issues.apache.org/jira/browse/SPARK-4476 On Fri, Jul 17, 2015 at 1:32 PM, Michael Armbrust mich...@databricks.com wrote: The difference between a

Re: MapType vs StructType

2015-07-17 Thread Michael Armbrust
The difference between a map and a struct here is that in a struct all possible keys are defined as part of the schema and can each can have a different type (and we don't support union types). JSON doesn't have differentiated data structures so we go with the one that gives you more information

Re: MapType vs StructType

2015-07-17 Thread Corey Nolet
This helps immensely. Thanks Michael! On Fri, Jul 17, 2015 at 4:33 PM, Michael Armbrust mich...@databricks.com wrote: I'll add there is a JIRA to override the default past some threshold of # of unique keys: https://issues.apache.org/jira/browse/SPARK-4476

Cleanup when tasks generate errors

2015-07-17 Thread sim
I've observed a number of cases where Spark does not clean HDFS side-effects on errors, especially out of memory conditions. Here is an example from the following code snippet executed in spark-shell: import org.apache.spark.sql.hive.HiveContextimport org.apache.spark.sql.SaveModeval ctx =

Re: Command builder problem when running worker in Windows

2015-07-17 Thread Julien Beaudan
Hi Elkhan, I ran Spark with --verbose, but the output looked the same to me - what should I be looking for? At the beginning, the system properties which are set are: System properties: SPARK_SUBMIT - true spark.app.name - tests.testFileReader spark.jars -

Re: What is java.sql.SQLException: Unsupported type -101?

2015-07-17 Thread Ted Yu
Looking at getCatalystType(): * Maps a JDBC type to a Catalyst type. This function is called only when * the JdbcDialect class corresponding to your database driver returns null. sqlType was carrying value of -101 However, I couldn't find -101 in

Re: The auxService:spark_shuffle does not exist

2015-07-17 Thread Andrew Or
Hi all, Did you forget to restart the node managers after editing yarn-site.xml by any chance? -Andrew 2015-07-17 8:32 GMT-07:00 Andrew Lee alee...@hotmail.com: I have encountered the same problem after following the document. Here's my spark-defaults.conf spark.shuffle.service.enabled

Re: Command builder problem when running worker in Windows

2015-07-17 Thread Julien Beaudan
Oh, yeah of course. I'm writing from the command line (I haven't tried the SparkLauncher), using bin/spark-submit --class tests.testFileReader --master spark://192.168.194.128:7077 --verbose ./sparkTest1.jar All that the testFileReader class does is create an RDD from a few text files -

What is java.sql.SQLException: Unsupported type -101?

2015-07-17 Thread Sambit Tripathy (RBEI/EDS1)
Hi, I was trying to get a oracle table using JDBC datasource val jdbcDF = sqlContext.load(jdbc, Map( url - jdbc:oracle:thin:USER/p...@host.com:1517:sid, dbtable - USER.TABLE,driver - oracle.jdbc.OracleDriver)) and got the error below java.sql.SQLException: Unsupported type -101 at

Command builder problem when running worker in Windows

2015-07-17 Thread Julien Beaudan
Hi, I running a stand-alone cluster in Windows 7, and when I try to run any worker on the machine, I get the following error: 15/07/17 14:14:43 ERROR ExecutorRunner: Error running executor java.io.IOException: Cannot run program

Re: Data frames select and where clause dependency

2015-07-17 Thread Michael Armbrust
Each operation on a dataframe is completely independent and doesn't know what operations happened before it. When you do a selection, you are removing other columns from the dataframe and so the filter has nothing to operate on. On Fri, Jul 17, 2015 at 11:55 AM, Mike Trienis

RE: What is java.sql.SQLException: Unsupported type -101?

2015-07-17 Thread Sambit Tripathy (RBEI/EDS1)
Does this mean there is a possible mismatch of jdbc driver with oracle? From: Ted Yu [mailto:yuzhih...@gmail.com] Sent: Friday, July 17, 2015 2:09 PM To: Sambit Tripathy (RBEI/EDS1) Cc: user@spark.apache.org Subject: Re: What is java.sql.SQLException: Unsupported type -101? Looking at