That's a Hive version issue, not Hadoop version issue.
On Tue, Oct 7, 2014 at 7:21 AM, Li HM hmx...@gmail.com wrote:
Thanks for the replied.
Please refer to my another post entitled How to make ./bin/spark-sql
work with hive. It has all the error/exceptions I am getting.
If I understand you
Hi
Shark supported both the HiveServer1 and HiveServer2 thrift interfaces
(using $ bin/shark -service sharkserver[1 or 2]).
SparkSQL seems to support only HiveServer2. I was wondering what is involved
to add support for HiveServer1. Is this something straightforward to do that
I can embark on
Hi All,
Continuing on this discussion... Is there a good reason why the def of
saveAsNewAPIHadoopFiles in
org/apache/spark/streaming/api/java/JavaPairDStream.scala
is defined like this -
def saveAsNewAPIHadoopFiles(
prefix: String,
suffix: String,
keyClass: Class[_],
Hi Landon
I had a problem very similar to your, where we have to process around 5
million relatively small files on NFS. After trying various options, we did
something similar to what Matei suggested.
1) take the original path and find the subdirectories under that path and
then parallelize the
Hi,
I have already unsucesfully asked quiet simmilar question at stackoverflow,
particularly here:
http://stackoverflow.com/questions/26202978/spark-and-python-trying-to-parse-wikipedia-using-gensim.
I've also unsucessfully tryied some workaround, but unsucessfuly, workaround
problem can be
Try to set --total-executor-cores to limit how many total cores it can use.
Thanks Regards,
Meethu M
On Thursday, 2 October 2014 2:39 AM, Akshat Aranya aara...@gmail.com wrote:
I guess one way to do so would be to run 1 worker per node, like say, instead
of running 1 worker and giving
I have found related PRs in the parquet-mr project:
https://github.com/Parquet/parquet-mr/issues/324, however using that version
of the bundle doesn't solve the issue. The issue seems to related to
packaged scope in separate class loaders. I am busy looking for a work
around.
--
View this
Hello,
I'm trying to read from s3 using a simple spark java app:
-
SparkConf sparkConf = new SparkConf().setAppName(TestApp);
sparkConf.setMaster(local);
JavaSparkContext sc = new JavaSparkContext(sparkConf);
sc.hadoopConfiguration().set(fs.s3.awsAccessKeyId, XX);
Hello,
I'm trying to read from s3 using a simple spark java app:
-
SparkConf sparkConf = new SparkConf().setAppName(TestApp);
sparkConf.setMaster(local);
JavaSparkContext sc = new JavaSparkContext(sparkConf);
sc.hadoopConfiguration().set(fs.s3.awsAccessKeyId, XX);
Hi all,
My code was working fine in spark 1.0.2 ,but after upgrading to 1.1.0, its
throwing exceptions and tasks are getting failed.
The code contains some map and filter transformations followed by groupByKey
(reduceByKey in another code ). What I could find out is that the code works
fine
Hi again,
Thank you for your suggestion :)
I've tried to implement this method but I'm stuck trying to union the
payload before creating the graph.
Below is a really simplified snippet of what have worked so far.
//Reading the articles given in json format
val articles =
Hi David,
Thanks for the reply and effort u put to explain the concepts.Thanks for
example.It worked.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-ship-external-Python-libraries-in-PYSPARK-tp14074p15844.html
Sent from the Apache Spark User
The build command should be correct. What exact error did you encounter
when trying Spark 1.1 + Hive 0.12 + Hadoop 2.5.0?
On 10/7/14 2:21 PM, Li HM wrote:
Thanks for the replied.
Please refer to my another post entitled How to make ./bin/spark-sql
work with hive. It has all the
Hi, the same problem happens when I try several joins together, such as
'SELECT * FROM sales INNER JOIN magasin ON sales.STO_KEY = magasin.STO_KEY
INNER JOIN eans ON (sales.BARC_KEY = eans.BARC_KEY and magasin.FORM_KEY =
eans.FORM_KEY)'
The error information is as follow:
Hi, in fact, the same problem happens when I try several joins together:
SELECT *
FROM sales INNER JOIN magasin ON sales.STO_KEY = magasin.STO_KEY
INNER JOIN eans ON (sales.BARC_KEY = eans.BARC_KEY and magasin.FORM_KEY =
eans.FORM_KEY)
py4j.protocol.Py4JJavaError: An error occurred while
Hi All
I have one master and one worker on AWS (amazon web service) and am running
spark map reduce code provided on the link
https://spark.apache.org/examples.html
We are using Spark version 1.0.2
Word Count
val file = spark.textFile(hdfs://...)
val counts = file.flatMap(line = line.split( ))
Maybe sc.wholeTextFile() is what you want, you can get the whole text
and parse it by yourself.
On Tue, Oct 7, 2014 at 1:06 AM, jan.zi...@centrum.cz wrote:
Hi,
I have already unsucesfully asked quiet simmilar question at stackoverflow,
particularly here:
Not sure if it's supposed to work. Can you try newAPIHadoopFile() passing
in the required configuration object.
On Tue, Oct 7, 2014 at 4:20 AM, Tomer Benyamini tomer@gmail.com wrote:
Hello,
I'm trying to read from s3 using a simple spark java app:
-
SparkConf
Reading the Spark Streaming Programming Guide I found a couple of
interesting points. First of all, while talking about receivers, it says:
*If the number of cores allocated to the application is less than or equal
to the number of input DStreams / receivers, then the system will receive
data,
Currently I am not doing anything, if anything change start from scratch.
In general I doubt there are many options to account for schema changes. If you
are reading files using impala, then it may allow if the schema changes are
append only. Otherwise existing Parquet files have to be migrated
Thanks for the input Soumitra.
On Tue, Oct 7, 2014 at 10:24 AM, Soumitra Kumar kumar.soumi...@gmail.com
wrote:
Currently I am not doing anything, if anything change start from scratch.
In general I doubt there are many options to account for schema changes.
If you are reading files using
Try using s3n:// instead of s3 (for the credential configuration as well).
On Tue, Oct 7, 2014 at 9:51 AM, Sunny Khatri sunny.k...@gmail.com wrote:
Not sure if it's supposed to work. Can you try newAPIHadoopFile() passing
in the required configuration object.
On Tue, Oct 7, 2014 at 4:20 AM,
Not familiar with scikit SVM implementation ( and I assume you are using
linearSVC). To figure out an optimal decision boundary based on the scores
obtained, you can use an ROC curve varying your thresholds.
On Tue, Oct 7, 2014 at 12:08 AM, Adamantios Corais
adamantios.cor...@gmail.com wrote:
The issue is that you're using SQLContext instead of HiveContext. SQLContext
implements a smaller subset of the SQL language and so you're getting a SQL
parse error because it doesn't support the syntax you have. Look at how you'd
write this in HiveQL, and then try doing that with HiveContext.
You can create a new Configuration object in something like a
mapPartitions method and use that. It will pick up local Hadoop
configuration from the node, but presumably the Spark workers and HDFS
data nodes are colocated in this case, so the machines have the
correct Hadoop config locally.
On
Thanks Cheng.
Here is the error message after a fresh build.
$ mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.5.0 -Phive -DskipTests
clean package
[INFO]
[INFO] Reactor Summary:
[INFO]
[INFO] Spark Project Parent POM
Hi All,I have following classes of features:
class A: 15000 featuresclass B: 170 featuresclass C: 900 featuresClass D: 6000
features.
I use linear regression (over sparse data). I get excellent results with low
RMSE (~0.06) for the following combinations of classes:1. A + B + C 2. B + C +
D3.
BTW, one detail:
When number of iterations is 100 all weights are zero or below and the indices
are only from set A.
When number of iterations is 150 I see 30+ non-zero weights (when sorted by
weight) and indices are distributed across al sets. however MSE is high (5.xxx)
and the result does
- We set ulimit to 50. But I still get the same too many open files
warning.
- I tried setting consolidateFiles to True, but that did not help either.
I am using a Mesos cluster. Does Mesos have any limit on the number of
open files?
thanks
--
View this message in context:
java.lang.NullPointerException
at java.nio.ByteBuffer.wrap(ByteBuffer.java:392)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:58)
at org.apache.spark.scheduler.Task.run(Task.scala:54)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)
at
Hi Steve, what Spark version are you running?
2014-10-07 14:45 GMT-07:00 Steve Lewis lordjoe2...@gmail.com:
java.lang.NullPointerException
at java.nio.ByteBuffer.wrap(ByteBuffer.java:392)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:58)
at
Did you test different regularization parameters and step sizes? In
the combination that works, I don't see A + D. Did you test that
combination? Are there any linear dependency between A's columns and
D's columns? -Xiangrui
On Tue, Oct 7, 2014 at 1:56 PM, Sameer Tilak ssti...@live.com wrote:
Hi All,
Does anyone know if CDH5.1.2 packages the spark streaming kafka connector
under the spark externals project?
--
~
Is it possible to store spark shuffle files on Tachyon ?
I'm probably doing something obviously wrong, but I'm not seeing it.
I have the program below (in a file try1.scala), which is similar but not
identical to the examples.
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
Thanks Sean,
Sorry in my earlier question I meant to type CDH5.1.3 not CDH5.1.2
I presume it's included in spark-streaming_2.10-1.0.0-cdh5.1.3
But for some reason eclipse complains that import
org.apache.spark.streaming.kafka cannon be resolved, even though I have
included the
Are you sure the new ulimit has taken effect?
How many cores are you using? How many reducers?
In general if a node in your cluster has C assigned cores and you run
a job with X reducers then Spark will open C*X files in parallel and
start writing. Shuffle
Try using spark-submit instead of spark-shell
On Tue, Oct 7, 2014 at 3:47 PM, spr s...@yarcdata.com wrote:
I'm probably doing something obviously wrong, but I'm not seeing it.
I have the program below (in a file try1.scala), which is similar but not
identical to the examples.
import
@SK:
Make sure ulimit has taken effect as Todd mentioned. You can verify via
ulimit -a. Also make sure you have proper kernel parameters set in
/etc/sysctl.conf (MacOSX)
On Tue, Oct 7, 2014 at 3:57 PM, Lisonbee, Todd todd.lison...@intel.com
wrote:
Are you sure the new ulimit has taken effect?
|| Try using spark-submit instead of spark-shell
Two questions:
- What does spark-submit do differently from spark-shell that makes you
think that may be the cause of my difficulty?
- When I try spark-submit it complains about Error: Cannot load main class
from JAR:
Never mind... my bad... made a typo.
looks good.
Thanks,
On Tue, Oct 7, 2014 at 3:57 PM, Abraham Jacob abe.jac...@gmail.com wrote:
Thanks Sean,
Sorry in my earlier question I meant to type CDH5.1.3 not CDH5.1.2
I presume it's included in spark-streaming_2.10-1.0.0-cdh5.1.3
But for some
Hi
I think I found a bug in the iPython notebook integration. I am not sure how
to report it
I am running spark-1.1.0-bin-hadoop2.4 on an AWS ec2 cluster. I start the
cluster using the launch script provided by spark
I start iPython notebook on my cluster master as follows and use an ssh
tunnel
Hi,
I am using the fold(zeroValue)(t1, t2) on the RDD I noticed that it runs
in parallel on all the partitions then aggregates the results from the
partitions. My data object is not aggregate-able I was wondering if
there's any way to run the fold sequentially. [I am looking to do a foldLeft
Hi
I am new to Spark and trying to develop an application that loads data from
Hive. Here is my setup:
* Spark-1.1.0 (built using -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0
-Phive)
* Executing Spark-shell on a box with 16 GB RAM
* 4 Cores Single Processor
* OpenCSV library (SerDe)
* Hive table
Jianneng,
On Wed, Oct 8, 2014 at 8:44 AM, Jianneng Li jiannen...@berkeley.edu wrote:
I understand that Spark Streaming uses micro-batches to implement
streaming, while traditional streaming systems use the record-at-a-time
processing model. The performance benefit of the former is throughput,
Hi,
On Wed, Oct 8, 2014 at 4:50 AM, Josh J joshjd...@gmail.com wrote:
I have a source which fluctuates in the frequency of streaming tuples. I
would like to process certain batch counts, rather than batch window
durations. Is it possible to either
1) define batch window sizes
Cf.
Here is the hive-site.xml
?xml version=1.0?
?xml-stylesheet type=text/xsl href=configuration.xsl?
configuration
!-- Hive Execution Parameters --
property
namehive.metastore.local/name
valuefalse/value
descriptioncontrols whether to connect to remove metastore server
or open a new
I found the reason why reading HBase is too slow. Although each
regionserver serves multiple regions for the table I'm reading, the number
of Spark workers allocated by Yarn is too low. Actually, I could see that
the table has dozens of regions spread over about 20 regionservers, but
only two
You will need to restart your Mesos workers to pick up the new limits as
well.
On Tue, Oct 7, 2014 at 4:02 PM, Sunny Khatri sunny.k...@gmail.com wrote:
@SK:
Make sure ulimit has taken effect as Todd mentioned. You can verify via
ulimit -a. Also make sure you have proper kernel parameters set
Hi Meethu,
I believe you may be hitting a regression in
https://issues.apache.org/jira/browse/SPARK-3633
If you are able, could you please try running a patched version of Spark
1.1.0 that has commit 4fde28c reverted and see if the errors go away?
Posting your results on that bug would be
Hello,
I was interested in testing Parquet V2 with Spark SQL, but noticed after some
investigation that the parquet writer that Spark SQL uses is fixed at V1 here:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableSupport.scala#L350.
51 matches
Mail list logo