Is this map creation happening on client side ?
But how does it know which RS will contain that row key in put operation
until asking the .Meta. table .
Does Hbase client first gets that ranges of keys of each Reagionservers
and then group put objects based on Region servers ?
On Fri, Jul 17,
Does anybody have any idea what cuase this problem? Thanks.
Ningjun
From: Wang, Ningjun (LNG-NPV)
Sent: Wednesday, July 15, 2015 11:09 AM
To: user@spark.apache.org
Subject: java.lang.NoClassDefFoundError: Could not initialize class
org.fusesource.jansi.internal.Kernel32
I just installed spark
It resorts to the following method for finding region location:
private RegionLocations locateRegionInMeta(TableName tableName, byte[]
row,
boolean useCache, boolean retry, int replicaId) throws
IOException {
Note: useCache value is true in this call path.
Meaning the client
Can you paste the code? How much memory does your system have and how big
is your dataset? Did you try df.persist(StorageLevel.MEMORY_AND_DISK)?
Thanks
Best Regards
On Fri, Jul 17, 2015 at 5:14 PM, Harit Vishwakarma
harit.vishwaka...@gmail.com wrote:
Thanks,
Code is running on a single
1. load 3 matrices of size ~ 1 X 1 using numpy.
2. rdd2 = rdd1.values().flatMap( fun ) # rdd1 has roughly 10^7 tuples
3. df = sqlCtx.createDataFrame(rdd2)
4. df.save() # in parquet format
It throws exception in createDataFrame() call. I don't know what exactly it
is creating ? everything
Dear Community
Request to help on below queries they are unanswered.
Thanks and Regards
Aniruddh
On Wed, Jul 15, 2015 at 12:37 PM, Aniruddh Sharma asharma...@gmail.com
wrote:
Hi TD,
Request your guidance on below 5 queries. Following is the context of them
that I would use to evaluate
Thanks,
Code is running on a single machine.
And it still doesn't answer my question.
On Fri, Jul 17, 2015 at 4:52 PM, ayan guha guha.a...@gmail.com wrote:
You can bump up number of partitions while creating the rdd you are using
for df
On 17 Jul 2015 21:03, Harit Vishwakarma
Is it possible to set the number of cores per executor on standalone
cluster?
Because we find that, cores distribution may be very skewed on executor at
some time, so the workload is skewed, that make our job become slow.
Thanks!
--
郑旭东
Zheng, Xudong
Thanks !
My key is random (hexadecimal). So hot spot should not be created.
Is there any concept of bulk put. Say I want to raise a one put request for
a 1000 size batch which will hit a region server instead of individual put
for each key.
Htable.put(ListPut) Does this handles batching of put
Internally AsyncProcess uses a Map which is keyed by server name:
MapServerName, MultiActionRow actionsByServer =
new HashMapServerName, MultiActionRow();
Here MultiAction would group Put's in your example which are destined for
the same server.
Cheers
On Fri, Jul 17, 2015 at 5:15
I suspect its the numpy filling up Memory.
Thanks
Best Regards
On Fri, Jul 17, 2015 at 5:46 PM, Harit Vishwakarma
harit.vishwaka...@gmail.com wrote:
1. load 3 matrices of size ~ 1 X 1 using numpy.
2. rdd2 = rdd1.values().flatMap( fun ) # rdd1 has roughly 10^7 tuples
3. df =
Hello, thank you for your time.
Seq[String] works perfectly fine. I also tried running a for loop through all
elements to see if any access to a value was broken, but no, they are alright.
For now, I solved it properly calling this. Sadly, it takes a lot of time, but
works:
var data_sas =
Dear all,
The page
https://spark.apache.org/community.html
Says : If you'd like your meetup added, email user@spark.apache.org.
So here I am emailing, could please someone add three new groups to the page
Moscow : http://www.meetup.com/Apache-Spark-in-Moscow/
Slovenija (Ljubljana)
So I have a very simple dataframe that looks like
df: [name:String, Place:String, time: time:timestamp]
I build this java.sql.Timestamp from a string and it works really well
expect when I call saveAsTable(tableName) on this df. Without the
timestamp, it saves fine but with the timestamp, it
Responses inline.
On Thu, Jul 16, 2015 at 9:27 PM, N B nb.nos...@gmail.com wrote:
Hi TD,
Yes, we do have the invertible function provided. However, I am not sure I
understood how to use the filterFunction. Is there an example somewhere
showing its usage?
The header comment on the function
Hi,
I wonder how to use S3 compatible Storage in Spark ?
If I'm using s3n:// url schema, the it will point to amazon, is there
a way I can specify the host somewhere ?
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
If you takes time to actually learn Scala starting from its fundamental
concepts AND quite importantly get familiar with general functional
programming concepts, you'd immediately realize the things that you'd
really miss going back to Java (8).
On Fri, Jul 17, 2015 at 8:14 AM Wojciech Pituła
Hi,
Is Spark master high availability supported on YARN (yarn-client mode)
analogous to
https://spark.apache.org/docs/1.4.0/spark-standalone.html#high-availability?
Thanks
Bhaskie
Hi,
I have been running a batch of data through my application for the last
couple of days and this morning discovered it had fallen over with the
following error.
java.lang.IllegalStateException: unread block data
at
I see now.
There are three steps in SparkStreaming + Kafka date processing
1.Receiving the data
2.Transforming the data
3.Pushing out the data
SparkStreaming + Kafka only provide an exactly-once guarantee on step 1 2
We need to ensure exactly once on step 3 by myself.
More details see base
Hi Roberto
I have question regarding HiveContext .
when you create HiveContext where you define Hive connection properties ?
Suppose Hive is not in local machine i need to connect , how HiveConext will
know the data base info like url ,username and password ?
String username = ;
String
the new ALS()...run() form is underneath both of the first two.
I am not sure what you mean by underneath, so basically the mllib ALS()...run()
does the same thing as the mllib ALS train() ?
On Wed, Jul 15, 2015 at 2:02 PM, Sean Owen so...@cloudera.com wrote:
The first two examples are from
Yes, just have a look at the method in the source code. It calls new
ALS()run(). It's a convenience wrapper only.
On Fri, Jul 17, 2015 at 4:59 PM, Carol McDonald cmcdon...@maprtech.com wrote:
the new ALS()...run() form is underneath both of the first two.
I am not sure what you mean by
I have encountered the same problem after following the document.
Here's my spark-defaults.confspark.shuffle.service.enabled true
spark.dynamicAllocation.enabled true
spark.dynamicAllocation.executorIdleTimeout 60
spark.dynamicAllocation.cachedExecutorIdleTimeout 120
Hello,
First of all I m a newbie in Spark ,
I m trying to start the spark-shell with yarn cluster by running:
$ spark-shell --master yarn-client
Sometimes it goes well, but most of the time I got an error:
Container exited with a non-zero exit code 10
Failing this attempt. Failing the
Hi Schmirr,
The part after the s3n:// is your bucket name and folder name, ie
s3n://${bucket_name}/${folder_name}[/${subfolder_name}]*. Bucket names are
unique across S3, so the resulting path is also unique. There is no concept
of hostname in s3 urls as far as I know.
-sujit
On Fri, Jul 17,
Using a hive-site.xml file on the classpath.
On Fri, Jul 17, 2015 at 8:37 AM, spark user spark_u...@yahoo.com.invalid
wrote:
Hi Roberto
I have question regarding HiveContext .
when you create HiveContext where you define Hive connection properties ?
Suppose Hive is not in local machine i
Spark newbie here, using Spark 1.3.1.
I’m consuming a stream and trying to pipe the data from the entire window to R
for analysis. The R algorithm needs the entire dataset from the stream
(everything in the window) in order to function properly; it can’t be broken up.
So I tried doing a
Hello,
I have been having trouble getting large Spark jobs to complete against my
Cassandra ring.
I’m finding that the CPU goes to 100% on one of the nodes, and then, after
many hours, the job fails.
Here are my Spark settings:
.set(*spark.cassandra.input.split.size_in_mb*, *128*)
df: [name:String, Place:String, time: time:timestamp]
why not df: [name:String, Place:String, time:timestamp]
-- --
??: Brandon White;bwwintheho...@gmail.com;
: 2015??7??17??(??) 2:18
??: useruser@spark.apache.org;
:
Yes. More information in my talk -
https://www.youtube.com/watch?v=d5UJonrruHk
On Fri, Jul 17, 2015 at 1:15 AM, JoneZhang joyoungzh...@gmail.com wrote:
I see now.
There are three steps in SparkStreaming + Kafka date processing
1.Receiving the data
2.Transforming the data
3.Pushing out the
Hello community,
I'm currently using Spark 1.3.1 with Hive support for outputting processed
data on an external Hive table backed on S3. I'm using a manual
specification of the delimiter, but I'd want to know if is there any
clean way to write in CSV format:
*val* sparkConf = *new* SparkConf()
Hi TD,
Thanks for the response. I do believe I understand the concept and the need
for the filterfunction now. I made the requisite code changes and keeping
it running overnight to see the effect of it. Hopefully this should fix our
issue.
However, there was one place where I encountered a
Yeah, Spark SQL Parquet support need to do some metadata discovery when
firstly importing a folder containing Parquet files, and discovered
metadata is cached.
Cheng
On 7/17/15 1:56 PM, shsh...@tsmc.com wrote:
Hi all,
our scenario is to generate lots of folders containinig parquet file and
Sure, I have created JIRA SPARK-9131 - UDF change data values
https://issues.apache.org/jira/browse/SPARK-9131
On Thu, Jul 16, 2015 at 7:09 PM, Davies Liu dav...@databricks.com wrote:
Thanks for reporting this, could you file a JIRA for it?
On Thu, Jul 16, 2015 at 8:22 AM, Luis Guerra
Hi,
I used createDataFrame API of SqlContext in python. and getting
OutOfMemoryException. I am wondering if it is creating whole dataFrame in
memory?
I did not find any documentation describing memory usage of Spark APIs.
Documentation given is nice but little more details (specially on memory
You can bump up number of partitions while creating the rdd you are using
for df
On 17 Jul 2015 21:03, Harit Vishwakarma harit.vishwaka...@gmail.com
wrote:
Hi,
I used createDataFrame API of SqlContext in python. and getting
OutOfMemoryException. I am wondering if it is creating whole
Hi Sean,
Thanks for the reply! I did double-check that the jar is one I think I am
running:
[image: Inline image 2]
jar tf
/hpc/users/ahujaa01/src/spark/assembly/target/scala-2.10/spark-assembly-1.5.0-SNAPSHOT-hadoop2.6.0.jar
| grep netlib | grep Native
The endpoint is the property you want to set. I would look at the source for
that.
Sent from my iPhone
On Jul 17, 2015, at 08:55, Sujit Pal sujitatgt...@gmail.com wrote:
Hi Schmirr,
The part after the s3n:// is your bucket name and folder name, ie
Hello,
I am testing Spark interoperation with SQL Server via JDBC with Microsoft’s 4.2
JDBC Driver. Reading from the database works ok, but I have encountered a
couple of issues writing back. In Scala 2.10 I can write back to the database
except for a couple of types.
1. When I read a
Make sure /usr/lib64 contains libgfortran.so.3; that's really the issue.
I'm pretty sure the answer is 'yes', but, make sure the assembly has
jniloader too. I don't see why it wouldn't, but, that's needed.
What is your env like -- local, standalone, YARN? how are you running?
Just want to make
I'd like to understand why the where field must exist in the select clause.
For example, the following select statement works fine
- df.select(field1, filter_field).filter(df(filter_field) ===
value).show()
However, the next one fails with the error in operator !Filter
(filter_field#60 =
Can you try setting the spark.yarn.jar property to make sure it points to
the jar you're thinking of?
-Sandy
On Fri, Jul 17, 2015 at 11:32 AM, Arun Ahuja aahuj...@gmail.com wrote:
Yes, it's a YARN cluster and using spark-submit to run. I have SPARK_HOME
set to the directory above and using
I notice JSON objects are all parsed as Map[String,Any] in Jackson but for
some reason, the inferSchema tools in Spark SQL extracts the schema of
nested JSON objects as StructTypes.
This makes it really confusing when trying to rectify the object hierarchy
when I have maps because the Catalyst
Hi I have similar use case did you found solution for this problem of loading
DStreams in Hive using Spark Streaming. Please guide. Thanks.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Store-DStreams-into-Hive-using-Hive-Streaming-tp18307p23885.html
Sent
Yes, it's a YARN cluster and using spark-submit to run. I have SPARK_HOME
set to the directory above and using the spark-submit script from there.
bin/spark-submit --master yarn-client --executor-memory 10g
--driver-memory 8g --num-executors 400 --executor-cores 1 --class
Hi all,
After SPARK-5479 https://issues.apache.org/jira/browse/SPARK-5479 issue
fix (thanks to Marcelo Vanzin), now pyspark handles several python files
(or in zip folder with __init__.py) addition to PYTHONPATH correctly in
yarn-cluster mode.
But adding python module as zip folder, still fails
Hi:
I'm using pyspark 1.3 and it seems that the model.save is not implemented
for everyone.
Here is what I have so far:
*Model Name* *Model Class* *save available* Logistic Regression
LogisticRegressionModel NO Random Forest TreeEnsembleModel OK GBM
GradientBoostedTreesModel OK SVM
Run Spark with --verbose flag, to see what it read for that path.
I guess in Windows if you are using backslash, you need 2 of them (\\), or
just use forward slashes everywhere.
On Fri, Jul 17, 2015 at 2:40 PM, Julien Beaudan jbeau...@stottlerhenke.com
wrote:
Hi,
I running a stand-alone
Are you running it from command line (CLI) or through SparkLauncher ?
If you can share the command (./bin/spark-submit ...) or the code snippet
you are running, then it can give some clue.
On Fri, Jul 17, 2015 at 3:30 PM, Julien Beaudan jbeau...@stottlerhenke.com
wrote:
Hi Elkhan,
I ran
The simple answer is you should not update broadcast variable. If you can
post the problem you are handling, people here should be able to provide
better suggestions.
On 18 Jul 2015 13:53, Raghavendra Pandey raghavendra.pan...@gmail.com
wrote:
Broadcasted variables are immutable. Anyway, how
I'll add there is a JIRA to override the default past some threshold of #
of unique keys: https://issues.apache.org/jira/browse/SPARK-4476
https://issues.apache.org/jira/browse/SPARK-4476
On Fri, Jul 17, 2015 at 1:32 PM, Michael Armbrust mich...@databricks.com
wrote:
The difference between a
The difference between a map and a struct here is that in a struct all
possible keys are defined as part of the schema and can each can have a
different type (and we don't support union types). JSON doesn't have
differentiated data structures so we go with the one that gives you more
information
This helps immensely. Thanks Michael!
On Fri, Jul 17, 2015 at 4:33 PM, Michael Armbrust mich...@databricks.com
wrote:
I'll add there is a JIRA to override the default past some threshold of #
of unique keys: https://issues.apache.org/jira/browse/SPARK-4476
I've observed a number of cases where Spark does not clean HDFS side-effects
on errors, especially out of memory conditions. Here is an example from the
following code snippet executed in spark-shell:
import org.apache.spark.sql.hive.HiveContextimport
org.apache.spark.sql.SaveModeval ctx =
Hi Elkhan,
I ran Spark with --verbose, but the output looked the same to me - what
should I be looking for? At the beginning, the system properties which
are set are:
System properties:
SPARK_SUBMIT - true
spark.app.name - tests.testFileReader
spark.jars -
Looking at getCatalystType():
* Maps a JDBC type to a Catalyst type. This function is called only when
* the JdbcDialect class corresponding to your database driver returns
null.
sqlType was carrying value of -101
However, I couldn't find -101 in
Hi all,
Did you forget to restart the node managers after editing yarn-site.xml by
any chance?
-Andrew
2015-07-17 8:32 GMT-07:00 Andrew Lee alee...@hotmail.com:
I have encountered the same problem after following the document.
Here's my spark-defaults.conf
spark.shuffle.service.enabled
Oh, yeah of course. I'm writing from the command line (I haven't tried
the SparkLauncher), using
bin/spark-submit --class tests.testFileReader --master
spark://192.168.194.128:7077 --verbose ./sparkTest1.jar
All that the testFileReader class does is create an RDD from a few text
files -
Hi,
I was trying to get a oracle table using JDBC datasource
val jdbcDF = sqlContext.load(jdbc, Map( url -
jdbc:oracle:thin:USER/p...@host.com:1517:sid, dbtable -
USER.TABLE,driver - oracle.jdbc.OracleDriver))
and got the error below
java.sql.SQLException: Unsupported type -101
at
Hi,
I running a stand-alone cluster in Windows 7, and when I try to run any
worker on the machine, I get the following error:
15/07/17 14:14:43 ERROR ExecutorRunner: Error running executor
java.io.IOException: Cannot run program
Each operation on a dataframe is completely independent and doesn't know
what operations happened before it. When you do a selection, you are
removing other columns from the dataframe and so the filter has nothing to
operate on.
On Fri, Jul 17, 2015 at 11:55 AM, Mike Trienis
Does this mean there is a possible mismatch of jdbc driver with oracle?
From: Ted Yu [mailto:yuzhih...@gmail.com]
Sent: Friday, July 17, 2015 2:09 PM
To: Sambit Tripathy (RBEI/EDS1)
Cc: user@spark.apache.org
Subject: Re: What is java.sql.SQLException: Unsupported type -101?
Looking at
63 matches
Mail list logo