Hi Dmitry,
I am not familiar with all of the details you have just described, but I
think Tachyon should be able to help you.
If you store all of your resource files in HDFS or S3 or both, you can run
Tachyon to use those storage systems as the under storage (
For #1, yes it is possible.
You can find some example in hbase-spark module of hbase where hbase as
DataSource is provided.
e.g.
https://github.com/apache/hbase/blob/master/hbase-spark/src/main/scala/org/apache/hadoop/hbase/spark/HBaseRDDFunctions.scala
Cheers
On Thu, Jan 14, 2016 at 5:04 AM,
Tried using 1.6 version of Spark that takes numberOfFeatures fifth argument in
the API but still getting featureImportance as null.
RandomForestClassifier rfc = getRandomForestClassifier( numTrees, maxBinSize,
maxTreeDepth, seed, impurity);
RandomForestClassificationModel rfm =
Hi Bryan,
I ran "$> python --version" on every node on the cluster, and it is Python
2.7.8 for every single one.
When I try to submit the Python example in client mode
* ./bin/spark-submit --master yarn --deploy-mode client
--driver-memory 4g --executor-memory 2g
I personally support this. I had suggest drawing the line at Hadoop
2.6, but that's minor. More info:
Hadoop 2.7: April 2015
Hadoop 2.6: Nov 2014
Hadoop 2.5: Aug 2014
Hadoop 2.4: April 2014
Hadoop 2.3: Feb 2014
Hadoop 2.2: Oct 2013
CDH 5.0/5.1 = Hadoop 2.3 + backports
CDH 5.2/5.3 = Hadoop 2.5 +
Hi Daniel, Andrew
Thank you for your answers, So it s not possible to read the accumulator
value until the action that manipulate it finishes. it's bad, I ll think to
something else. However the main important thing in my application is the
ability to lunche 2 (or more) actions in parallel and
Hi
We have a RDD that needs to be mapped with information from
HBase, where the exact key is the user id.
What's the different alternatives for doing this?
- Is it possible to do HBase.get() requests from a map function in Spark?
- Or should we join RDDs with all full HBase table scan?
I ask
OK so it looks like Tachyon is a cluster memory plugin marked as
"experimental" in Spark.
In any case, we've got a few requirements for the system we're working on
which may drive the decision for how to implement large resource file
management.
The system is a framework of N data analyzers
Hi
After executing sql
sqlContext.sql("select day_time from my_table limit 10").show()
my output looks like :
++
| day_time|
++
|2015/12/15 15:52:...|
|2015/12/15 15:53:...|
|2015/12/15 15:52:...|
|2015/12/15 15:52:...|
|2015/12/15 15:52:...|
Hello,
I was wondering if somebody is able to help me get to the bottom of a null
pointer exception I'm seeing in my code. I've managed to narrow down a problem
in a larger class to my use of Joda's DateTime functions. I've successfully run
my code in scala, but I've hit a few problems when
The other thing from some folks' recommendations on this list was Apache
Ignite. Their In-Memory File System (
https://ignite.apache.org/features/igfs.html) looks quite interesting.
On Thu, Jan 14, 2016 at 7:54 AM, Dmitry Goldenberg wrote:
> OK so it looks like
Hi list,
I ran into an issue which I think could be a bug.
I have a Hive table stored as parquet files. Let's say it's called
testtable. I found the code below stuck forever in spark-shell with a local
master or driver/executor:
sqlContext.sql("select * from
Hi,
I tried take(1500) and test.collect and these both work on the "single" map
statement.
I'm very new to Kryo serialisation, I managed to find some code and I copied
and pasted and that's what originally made the single map statement work:
class MyRegistrator extends KryoRegistrator {
Hi,
Try …..show(false)
public void show(int numRows,
boolean truncate)
Kind Regards,
Alex.
From: Eli Super [mailto:eli.su...@gmail.com]
Sent: 14 January 2016 13:09
To: user@spark.apache.org
Subject: Spark SQL . How to enlarge output rows ?
Hi
After executing sql
It does look somehow like the state of the DateTime object isn't being
recreated properly on deserialization somehow, given where the NPE
occurs (look at the Joda source code). However the object is
java.io.Serializable. Are you sure the Kryo serialization is correct?
It doesn't quite explain why
Thanks Ted!
On Thu, Jan 14, 2016 at 4:49 PM, Ted Yu wrote:
> For #1, yes it is possible.
>
> You can find some example in hbase-spark module of hbase where hbase as
> DataSource is provided.
> e.g.
>
>
That's right, though it's possible the default way Kryo chooses to
serialize the object doesn't work. I'd debug a little more and print
out as much as you can about the DateTime object at the point it
appears to not work. I think there's a real problem and it only
happens to not turn up for the
I appreciate this – thank you.
I’m not an admin on the box I’m using spark-shell on – so I’m not sure I can
add them to that namespace. I’m hoping if I declare the JodaDateTimeSerializer
class in my REPL that I can still get this to work. I think the INTERVAL part
below may be key, I haven’t
Hi Arkadiusz,
the partitionBy is not designed to have many distinct value the last time I
used it. If you search in the mailing list, I think there are couple of
people also face similar issues. For example, in my case, it won't work
over a million distinct user ids. It will require a lot of
Could you change MEMORY_ONLY_SER to MEMORY_AND_DISK_SER_2 and see if this
still happens? It may be because you don't have enough memory to cache the
events.
On Thu, Jan 14, 2016 at 4:06 PM, Lin Zhao wrote:
> Hi,
>
> I'm testing spark streaming with actor receiver. The actor
Hi Shixiong,
I tried this but it still happens. If it helps, it's 1.6.0 and runs on YARN.
Batch duration is 20 seconds.
Some logs seemingly related to block manager:
16/01/15 00:31:25 INFO receiver.BlockGenerator: Pushed block
input-0-1452817873000
16/01/15 00:31:27 INFO storage.MemoryStore:
Hi All,
I am running a Spark program where one of my parts is using Spark as a
scheduler rather than a data management framework. That is, my job can be
described as RDD[String] where the string describes an operation to perform
which may be cheap or expensive (process an object which may have a
Hi Shixiong,
Just figured it out. I was doing a .print() as output operation, which seems to
stop the batch once it has 10 through. I changed it to a no-op foreachRDD and
it works.
Thanks for jumping in to help me.
From: "Shixiong(Ryan) Zhu"
If you are able to just train the RandomForestClassificationModel from ML
directly instead of training the old model and converting, then that would
be the way to go.
On Thu, Jan 14, 2016 at 2:21 PM,
wrote:
> Thanks so much Bryan for your response. Is
Hi, what is the easiest way to configure the Spark webui to bind to
localhost or 127.0.0.1? I intend to use this with ssh socks proxy to
provide a rudimentary "secured access". Unlike hadoop config options,
Spark doesn't allow the user to directly specify the ip addr to bind
services to.
Yeah, it's hard code as "0.0.0.0". Could you send a PR to add a
configuration for it?
On Thu, Jan 14, 2016 at 2:51 PM, Zee Chen wrote:
> Hi, what is the easiest way to configure the Spark webui to bind to
> localhost or 127.0.0.1? I intend to use this with ssh socks proxy to
>
Hi Ted,
So unfortunately after looking into the cluster manager that I will be
using for my testing (I'm using a super-computer called XSEDE rather than
AWS), it looks like the cluster does not actually come with Hbase installed
(this cluster is becoming somewhat problematic, as it is essentially
Hi,
I'm testing spark streaming with actor receiver. The actor keeps calling
store() to save a pair to Spark.
Once the job is launched, on the UI everything looks good. Millions of events
gets through every batch. However, I added logging to the first step and found
that only 20 or 40 events
sure will do.
On Thu, Jan 14, 2016 at 3:19 PM, Shixiong(Ryan) Zhu
wrote:
> Yeah, it's hard code as "0.0.0.0". Could you send a PR to add a
> configuration for it?
>
> On Thu, Jan 14, 2016 at 2:51 PM, Zee Chen wrote:
>>
>> Hi, what is the easiest way to
Could you post the codes of MessageRetriever? And by the way, could you
post the screenshot of tasks for a batch and check the input size of these
tasks? Considering there are so many events, there should be a lot of
blocks as well as a lot of tasks.
On Thu, Jan 14, 2016 at 4:34 PM, Lin Zhao
Hi Rachana,
I got the same exception. It is because computing the feature importance
depends on impurity stats, which is not calculated with the old
RandomForestModel in MLlib. Feel free to create a JIRA for this if you
think it is necessary, otherwise I believe this problem will be eventually
Hello Experts,
I am getting started with Hive with Spark as the query engine. I built the
package from sources. I am able to invoke Hive CLI and run queries and see
in Ambari that Spark application are being created confirming hive is using
Spark as the engine.
However other than Hive CLI, I am
We automatically convert types for UDFs defined in Scala, but we can't do
it in Java because the types are erased by the compiler. If you want to
use double you should cast before calling the UDF.
On Wed, Jan 13, 2016 at 8:10 PM, Raghu Ganti wrote:
> So, when I try
On Thu, Jan 14, 2016 at 10:17 AM, Sanjeev Verma
wrote:
> now it spawn a single executors with 1060M size, I am not able to understand
> why this time it executes executors with 1G+overhead not 2G what I
> specified.
Where are you looking for the memory size for the
Would this go away if the Spark source was compiled against Java 1.8 (since
the problem of type erasure is solved through proper generics
implementation in Java 1.8).
On Thu, Jan 14, 2016 at 1:42 PM, Michael Armbrust
wrote:
> We automatically convert types for UDFs
Please reply to the list.
The web ui does not show the total size of the executor's heap. It
shows the amount of memory available for caching data, which is, give
or take, 60% of the heap by default.
On Thu, Jan 14, 2016 at 11:03 AM, Sanjeev Verma
wrote:
> I am
I am seeing a strange behaviour while running spark in yarn client mode.I
am observing this on the single node yarn cluster.in spark-default I have
configured the executors memory as 2g and started the spark shell as follows
bin/spark-shell --master yarn-client
which trigger the 2 executors on
Hi
What is the proper configuration for saving parquet partition with
large number of repeated keys?
On bellow code I load 500 milion rows of data and partition it on
column with not so many different values.
Using spark-shell with 30g per executor and driver and 3 executor cores
Thanks for your reply, Ted.
Below is the stack dump for all threads:
Thread dump for executor driver
Updated at 2016/01/14 20:35:41
Collapse All
Thread 89: Executor task launch worker-0 (TIMED_WAITING)
sun.misc.Unsafe.park(Native Method)
Thanks for your response .
Our code as below :
public void process(){
logger.info("streaming process start !!!");
SparkConf sparkConf = createSparkConf(this.getClass().getSimpleName());
JavaStreamingContext jsc = this.createJavaStreamingContext(sparkConf);
Livy build test from master fails with below problem. Can't track it down.
YARN shows Livy Spark yarn application as running.
Although attempt to connect to application master shows connection refused:
HTTP ERROR 500
> Problem accessing /proxy/application_1448640910222_0046/. Reason:
>
Hi,
I would like to reparation / coalesce my data so that it is saved into one
Parquet file per partition. I would also like to use the Spark SQL
partitionBy API. So I could do that like this:
df.coalesce(1).write.partitionBy("entity", "year", "month", "day",
Today is my day... Trying to go thru where I can pitch in. Let me know if below
makes sense.
I looked at joda Java Api source code (1.2.9) and traced that call in NPE. It
looks like AssembledChronology class is being used, the iYears instance
variable is defined as transient.
Hi I have special requirement when I need to process data in one partition at
the last after doing many filtering,updating etc in a DataFrame. Currently
to process data in one partition I am using coalesce(1) which is killing and
painfully slow my jobs hangs for hours even 5-6 hours and I dont
My fault, I should have read documentation more accurate -
http://spark.apache.org/docs/latest/sql-programming-guide.html precisely
says, that I need to add these 3 jars to class path in case I need them. We
can not include them in fat jar, because they OSGI and require to have
plugin.xml and
Hi,
I'm trying to set the value of a hadoop parameter within spark-shell, and
System.setProperty("HADOOP_USER_NAME", "hadoop") seem to not be doing the trick
Does anything know how I can set the hadoop.job.ugi parameter from within
spark-shell ?
Cheers
I don't believe that Java 8 got rid of erasure. In fact I think its
actually worse when you use Java 8 lambdas.
On Thu, Jan 14, 2016 at 10:54 AM, Raghu Ganti wrote:
> Would this go away if the Spark source was compiled against Java 1.8
> (since the problem of type erasure
If you have a second could you post the version of derby that you
installed, the contents of hive-site.xml and the command you use to run
(along with spark version?). I'd like to retry the installation.
On Thu, Jan 7, 2016 at 7:35 AM, Deenar Toraskar
wrote:
> I
Praveen,
Zeppelin uses Spark's REPL.
I'm currently writing an app that is a web service, which is going to run
spark jobs.
So, at the init stage I just create JavaSparkContext and then use it for
all users requests. Web service is stateless. The issue with stateless is
that it's possible to run
Can you capture one or two stack traces of the local master process and
pastebin them ?
Thanks
On Thu, Jan 14, 2016 at 6:01 AM, Kai Wei wrote:
> Hi list,
>
> I ran into an issue which I think could be a bug.
>
> I have a Hive table stored as parquet files. Let's say it's
I had a similar problem a while back and leveraged these Kryo serializers,
https://github.com/magro/kryo-serializers. I had to fallback to version
0.28, but that was a while back. You can add these to the
org.apache.spark.serializer.KryoRegistrator
and then set your registrator in the spark
Hi there,
I'm facing a weird issue when upgrading from Spark 1.4.1 streaming driver
on EMR 3.9 (hence Hadoop 2.4.0) to Spark 1.5.2 on EMR 4.2 (hence Hadoop
2.6.0).
Basically, the very same driver which used to terminate after a timeout as
expected, now does not. In particular, as long as the
Could you try to use "Kryo.setDefaultSerializer" like this:
class YourKryoRegistrator extends KryoRegistrator {
override def registerClasses(kryo: Kryo) {
kryo.setDefaultSerializer(classOf[com.esotericsoftware.kryo.serializers.JavaSerializer])
}
}
On Thu, Jan 14, 2016 at 12:54 PM, Durgesh
Could you show your codes? Did you use `StreamingContext.awaitTermination`?
If so, it will return if any exception happens.
On Wed, Jan 13, 2016 at 11:47 PM, Triones,Deng(vip.com) <
triones.d...@vipshop.com> wrote:
> What’s more, I am running a 7*24 hours job , so I won’t call System.exit()
> by
54 matches
Mail list logo