date:20141024

Usually when the SparkContext throws an NPE it means that it has been shut
down due to some earlier failure.

On Wed, Oct 22, 2014 at 5:29 PM, arthur.hk.c...@gmail.com <
arthur.hk.c...@gmail.com> wrote:

> Hi,
>
> I got java.lang.NullPointerException. Please help!
>
>
> sqlContext.sql("select l_orderkey, l_linenumber, l_partkey, l_quantity,
> l_shipdate, L_RETURNFLAG, L_LINESTATUS from lineitem limit
> 10").collect().foreach(println);
>
> 2014-10-23 08:20:12,024 INFO
> [sparkDriver-akka.actor.default-dispatcher-31] scheduler.DAGScheduler
> (Logging.scala:logInfo(59)) - Stage 41 (runJob at basicOperators.scala:136)
> finished in 0.086 s
> 2014-10-23 08:20:12,024 INFO  [Result resolver thread-1]
> scheduler.TaskSchedulerImpl (Logging.scala:logInfo(59)) - Removed TaskSet
> 41.0, whose tasks have all completed, from pool
> 2014-10-23 08:20:12,024 INFO  [main] spark.SparkContext
> (Logging.scala:logInfo(59)) - Job finished: runJob at
> basicOperators.scala:136, took 0.090129332 s
> [9001,6,-4584121,17,1997-01-04,N,O]
> [9002,1,-2818574,23,1996-02-16,N,O]
> [9002,2,-2449102,21,1993-12-12,A,F]
> [9002,3,-5810699,26,1994-04-06,A,F]
> [9002,4,-489283,18,1994-11-11,R,F]
> [9002,5,2169683,15,1997-09-14,N,O]
> [9002,6,2405081,4,1992-08-03,R,F]
> [9002,7,3835341,40,1998-04-28,N,O]
> [9003,1,1900071,4,1994-05-05,R,F]
> [9004,1,-2614665,41,1993-06-13,A,F]
>
>
> If "order by L_LINESTATUS” is added then error:
> sqlContext.sql("select l_orderkey, l_linenumber, l_partkey, l_quantity,
> l_shipdate, L_RETURNFLAG, L_LINESTATUS from lineitem order by L_LINESTATUS
> limit 10").collect().foreach(println);
>
> 2014-10-23 08:22:08,524 INFO  [main] parse.ParseDriver
> (ParseDriver.java:parse(179)) - Parsing command: select l_orderkey,
> l_linenumber, l_partkey, l_quantity, l_shipdate, L_RETURNFLAG, L_LINESTATUS
> from lineitem order by L_LINESTATUS limit 10
> 2014-10-23 08:22:08,525 INFO  [main] parse.ParseDriver
> (ParseDriver.java:parse(197)) - Parse Completed
> 2014-10-23 08:22:08,526 INFO  [main] metastore.HiveMetaStore
> (HiveMetaStore.java:logInfo(454)) - 0: get_table : db=boc_12 tbl=lineitem
> 2014-10-23 08:22:08,526 INFO  [main] HiveMetaStore.audit
> (HiveMetaStore.java:logAuditEvent(239)) - ugi=hd ip=unknown-ip-addr 
> cmd=get_table
> : db=boc_12 tbl=lineitem
> java.lang.NullPointerException
> at
> org.apache.spark.SparkContext.defaultParallelism(SparkContext.scala:1262)
> at
> org.apache.spark.SparkContext.defaultMinPartitions(SparkContext.scala:1269)
> at org.apache.spark.sql.hive.HadoopTableReader.(TableReader.scala:63)
> at
> org.apache.spark.sql.hive.execution.HiveTableScan.(HiveTableScan.scala:68)
> at
> org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:188)
> at
> org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:188)
> at
> org.apache.spark.sql.SQLContext$SparkPlanner.pruneFilterProject(SQLContext.scala:364)
> at
> org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:184)
> at
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
> at
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> at
> org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
> at
> org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
> at
> org.apache.spark.sql.execution.SparkStrategies$TakeOrdered$.apply(SparkStrategies.scala:191)
> at
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
> at
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> at
> org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
> at
> org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:402)
> at
> org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:400)
> at
> org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:406)
> at
> org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:406)
> at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:438)
> at $iwC$$iwC$$iwC$$iwC.(:15)
> at $iwC$$iwC$$iwC.(:20)
> at $iwC$$iwC.(:22)
> at $iwC.(:24)
> at (:26)
> at .(:30)
> at .()
> at .(:7)
> at .()
> at $print()
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789)
> at
> org.apache.spark.repl.SparkIMain$Request.

Re: Spark LIBLINEAR

2014-10-24 Thread DB Tsai

oh, we just train the model in the standardized space which will help
the convergence of LBFGS. Then we convert the weights to original
space so the whole thing is transparent to users.

Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Fri, Oct 24, 2014 at 3:13 PM, Debasish Das  wrote:
> @dbtsai for condition number what did you use ? Diagonal preconditioning of
> the inverse of B matrix ? But then B matrix keeps on changing...did u
> condition it after every few iterations ?
>
> Will it be possible to put that code in Breeze since it will be very useful
> to condition other solvers as well...
>
> On Fri, Oct 24, 2014 at 3:02 PM, DB Tsai  wrote:
>>
>> We don't have SVMWithLBFGS, but you can check out how we implement
>> LogisticRegressionWithLBFGS, and we also deal with some condition
>> number improving stuff in LogisticRegressionWithLBFGS which improves
>> the performance dramatically.
>>
>> Sincerely,
>>
>> DB Tsai
>> ---
>> My Blog: https://www.dbtsai.com
>> LinkedIn: https://www.linkedin.com/in/dbtsai
>>
>>
>> On Fri, Oct 24, 2014 at 2:39 PM, k.tham  wrote:
>> > Oh, I've only seen SVMWithSGD, hadn't realized LBFGS was implemented.
>> > I'll
>> > try it out when I have time. Thanks!
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-LIBLINEAR-tp5546p17240.html
>> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> >
>> > -
>> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> > For additional commands, e-mail: user-h...@spark.apache.org
>> >
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Saving very large data sets as Parquet on S3

2014-10-24 Thread Haoyuan Li

Daniel,

Currently, having Tachyon will at least help on the input part in this case.

Haoyuan

On Fri, Oct 24, 2014 at 2:01 PM, Daniel Mahler  wrote:

> I am trying to convert some json logs to Parquet and save them on S3.
> In principle this is just
>
> import org.apache.spark._
> val sqlContext = new sql.SQLContext(sc)
> val data = sqlContext.jsonFile(s3n://source/path/*/*",10e-8)
> data.registerAsTable("data")
> data.saveAsParquetFile("s3n://target/path)
>
> This works fine for up to about a 10^9 records, but above that I start
> having problems.
> The first problem I encountered is that after the data file get written out
> writing the Parquet summary file fails.
> While I seem to have all the data saved out,
> programs have a huge have a huge start up time
> when processing parquet files without a summary file.
>
> Writing  the summary file appears to primarily depend
> on on the number of partitions being written,
> and relatively independent of the amount of being written.
> Problems start after about a 1000 partitions,
> writing 1 partitions fails even with repartitioned one days worth of
> data.
>
> My data is very finely partitioned, about 16 log files per hour, or 13K
> files per month.
> The file sizes are very uneven, ranging over several orders of magnitude.
> There are several years of data.
> By my calculations this will produce 10s of terabytes of Parquet files.
>
> The first thing I tried to get around this problem
>  was  passing the data through `coalesce(1000, shuffle=false)` before
> writing.
> This works up to about a month worth of data,
> after that coalescing to 1000 partitions produces parquet files larger
> than 5G
> and writing to S3 fails as a result.
> Also coalescing slows processing down by at least a factor of 2.
> I do not understand why this should happen since I use shuffle=false.
> AFAIK coalesce should just be a bookkeeping trick and the original
> partitions should be processed pretty much the same as before, just with
> their outputs concatenated.
>
> The only other option I can think of is to write each month coalesced
> as a separate data set with its own summary file
> and union the RDDs when processing the data,
> but I do not know how much overhead that will introduce.
>
> I am looking for advice on the best way to save this size data in Parquet
> on S3.
> Apart from solving the the summary file issue i am also looking for ways
> to improve performance.
> Would it make sense to write the data to a local hdfs first and push it to
> S3 with `hadoop distcp`?
> Is putting Tachyon in front of either the input or the output S3 likely to
> help?
> If yes which is likely to help more?
>
> I set options on the master as follows
>
> +
> cat <>~/spark/conf/spark-defaults.conf
> spark.serializerorg.apache.spark.serializer.KryoSerializer
> spark.rdd.compress  true
> spark.shuffle.consolidateFiles  true
> spark.akka.frameSize  20
> EOF
>
> copy-dir /root/spark/conf
> spark/sbin/stop-all.sh
> sleep 5
> spark/sbin/start-all.
> ++
>
> Does this make sense? Should I set some other options?
> I have also asked these questions on StackOverflow where I reproduced the
> full error messages:
>
> +
> http://stackoverflow.com/questions/26332542/how-to-save-a-multi-terabyte-schemardd-in-parquet-format-on-s3
> +
> http://stackoverflow.com/questions/26321947/multipart-uploads-to-amazon-s3-from-apache-spark
> +
> http://stackoverflow.com/questions/26291165/spark-sql-unable-to-complete-writing-parquet-data-with-a-large-number-of-shards
>
> thanks
> Daniel
>
>
>


-- 
Haoyuan Li
AMPLab, EECS, UC Berkeley
http://www.cs.berkeley.edu/~haoyuan/

Re: Spark LIBLINEAR

2014-10-24 Thread Debasish Das

@dbtsai for condition number what did you use ? Diagonal preconditioning of
the inverse of B matrix ? But then B matrix keeps on changing...did u
condition it after every few iterations ?

Will it be possible to put that code in Breeze since it will be very useful
to condition other solvers as well...

On Fri, Oct 24, 2014 at 3:02 PM, DB Tsai  wrote:

> We don't have SVMWithLBFGS, but you can check out how we implement
> LogisticRegressionWithLBFGS, and we also deal with some condition
> number improving stuff in LogisticRegressionWithLBFGS which improves
> the performance dramatically.
>
> Sincerely,
>
> DB Tsai
> ---
> My Blog: https://www.dbtsai.com
> LinkedIn: https://www.linkedin.com/in/dbtsai
>
>
> On Fri, Oct 24, 2014 at 2:39 PM, k.tham  wrote:
> > Oh, I've only seen SVMWithSGD, hadn't realized LBFGS was implemented.
> I'll
> > try it out when I have time. Thanks!
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-LIBLINEAR-tp5546p17240.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> >
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Spark LIBLINEAR

2014-10-24 Thread DB Tsai

We don't have SVMWithLBFGS, but you can check out how we implement
LogisticRegressionWithLBFGS, and we also deal with some condition
number improving stuff in LogisticRegressionWithLBFGS which improves
the performance dramatically.

Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Fri, Oct 24, 2014 at 2:39 PM, k.tham  wrote:
> Oh, I've only seen SVMWithSGD, hadn't realized LBFGS was implemented. I'll
> try it out when I have time. Thanks!
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-LIBLINEAR-tp5546p17240.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: docker spark 1.1.0 cluster

2014-10-24 Thread Nicholas Chammas

Oh snap--first I've heard of this repo.

Marek,

We are having a discussion related to this on SPARK-3821
 you may be interested in.

Nick

On Fri, Oct 24, 2014 at 5:50 PM, Marek Wiewiorka 
wrote:

> Hi,
> here you can find some info regarding 1.0:
> https://github.com/amplab/docker-scripts
>
> Marek
>
> 2014-10-24 23:38 GMT+02:00 Josh J :
>
>> Hi,
>>
>> Is there a dockerfiles available which allow to setup a docker spark
>> 1.1.0 cluster?
>>
>> Thanks,
>> Josh
>>
>
>

Re: docker spark 1.1.0 cluster

2014-10-24 Thread Marek Wiewiorka

Hi,
here you can find some info regarding 1.0:
https://github.com/amplab/docker-scripts

Marek

2014-10-24 23:38 GMT+02:00 Josh J :

> Hi,
>
> Is there a dockerfiles available which allow to setup a docker spark 1.1.0
> cluster?
>
> Thanks,
> Josh
>

docker spark 1.1.0 cluster

2014-10-24 Thread Josh J

Hi,

Is there a dockerfiles available which allow to setup a docker spark 1.1.0
cluster?

Thanks,
Josh

Re: Spark LIBLINEAR

2014-10-24 Thread k.tham

Oh, I've only seen SVMWithSGD, hadn't realized LBFGS was implemented. I'll
try it out when I have time. Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-LIBLINEAR-tp5546p17240.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark using non-HDFS data on a distributed file system cluster

2014-10-24 Thread matan

Thanks Marcelo,

Let me spin this towards a parallel trajectory then, as the title change
implies. I think I will further read some of the articles at
https://spark.apache.org/research.html but basically, I understand Spark
keeps the data in-memory, and only pulls from hdfs, or at most writes the
final output of a job back to it, rather than depositing intermediary step
outputs to hdfs files like hadoop map reduce would typically do (?).

Just a small departing question then - does it also work with the *Gluster*
or *Ceph* distributed file systems, not just hdfs? reading some of the
documentation I think if my Scala code can read a file from either of
those, and I have a Spark standalone cluster, or one managed by Mesos, then
I am not bound to hdfs nor to hadoop..

I assume that architecture will consume a lot of bandwidth pulling the
inputs from the file system, and not leverage the placement of its
workers/executors along with the data nodes as it does with hadoop/hdfs
(... thus acting as a compute cluster that takes very long to bootstrap an
application). Perhaps however, hadoop's ability to work over GlusterFS or
CephFS would provide that data-locality benefit after all? or is it bound
specifically to the hdfs api of hadoop, for performing a local data pull on
the storage cluster machines?

Thanks,
Matan

On Fri, Oct 24, 2014 at 4:19 AM, Marcelo Vanzin [via Apache Spark User
List]  wrote:

> You assessment is mostly correct. I think the only thing I'd reword is
> the comment about splitting the data, since Spark itself doesn't do
> that, but read on.
>
> On Thu, Oct 23, 2014 at 6:12 PM, matan <[hidden email]
> > wrote:
> > In case I nailed it, how then does it handle a distributed hdfs file?
> does
> > it pull all of the file to/through one Spark server
>
> Well, Spark here just piggybacks on what HDFS already gives you, since
> it's a distributed file system. In HDFS, files are broken into blocks
> and each block is stored in one or more machine. Spark uses Hadoop
> classes that understand this and give you the information about where
> those blocks are.
>
> If there are Spark executors on those machines holding the blocks,
> Spark will try to run tasks on those executors. Otherwise, it will
> assign some other executor to do the computation, and that executor
> will pull that particular block from HDFS over the network.
>
> It can be a lot more complicated than that (since each file format may
> have different ways of partitioning data, or you can create your own
> way, or repartition data, or Spark may give up waiting for the right
> executor, or...), but that's a good first start.
>
> --
> Marcelo
>
> -
> To unsubscribe, e-mail: [hidden email]
> 
> For additional commands, e-mail: [hidden email]
> 
>
>
>
> --
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-using-HDFS-data-newb-tp17169p17170.html
>  To unsubscribe from Spark using HDFS data [newb], click here
> 
> .
> NAML
> 
>

--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Re-Spark-using-non-HDFS-data-on-a-distributed-file-system-cluster-tp17239.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Is SparkSQL + JDBC server a good approach for caching?

This is very experimental and mostly unsupported, but you can start the
JDBC server from within your own programs

by passing it the HiveContext.

On Fri, Oct 24, 2014 at 2:07 PM, ankits  wrote:

> Thanks for your response Michael.
>
> I'm still not clear on all the details - in particular, how do I take a
> temp
> table created from a SchemaRDD and allow it to be queried using the Thrift
> JDBC server? From the Hive guides, it looks like it only supports loading
> data from files, but I want to query tables stored in memory only via JDBC.
> Is that possible?
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Is-SparkSQL-JDBC-server-a-good-approach-for-caching-tp17196p17235.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Spark LIBLINEAR

2014-10-24 Thread Debasish Das

If the SVM is not already migrated to BFGS, that's the first thing you
should try...Basically following LBFGS Logistic Regression come up with
LBFGS based linear SVM...

About integrating TRON in mllib, David already has a version of TRON in
breeze but someone needs to validate it for linear SVM and do experiment to
see if it can improve upon LBFGS based linear SVM...Based on lib-linear
papers, it should but I don't expect substantial difference...

I am validating TRON for use-cases related to this PR (but I need more
features on top of TRON):

https://github.com/apache/spark/pull/2705

On Fri, Oct 24, 2014 at 2:09 PM, k.tham  wrote:

> Just wondering, any update on this? Is there a plan to integrate CJ's work
> with mllib? I'm asking since  SVM impl in MLLib did not give us good
> results
> and we have to resort to training our svm classifier in a serial manner on
> the driver node with liblinear.
>
> Also, it looks like CJ Lin is coming to the bay area in the coming weeks
> (http://www.meetup.com/sfmachinelearning/events/208078582/) might be a
> good
> time to connect with him.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-LIBLINEAR-tp5546p17236.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Spark LIBLINEAR

2014-10-24 Thread k.tham

Just wondering, any update on this? Is there a plan to integrate CJ's work
with mllib? I'm asking since  SVM impl in MLLib did not give us good results
and we have to resort to training our svm classifier in a serial manner on
the driver node with liblinear.

Also, it looks like CJ Lin is coming to the bay area in the coming weeks
(http://www.meetup.com/sfmachinelearning/events/208078582/) might be a good
time to connect with him.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-LIBLINEAR-tp5546p17236.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Is SparkSQL + JDBC server a good approach for caching?

2014-10-24 Thread ankits

Thanks for your response Michael.

I'm still not clear on all the details - in particular, how do I take a temp
table created from a SchemaRDD and allow it to be queried using the Thrift
JDBC server? From the Hive guides, it looks like it only supports loading
data from files, but I want to query tables stored in memory only via JDBC.
Is that possible?





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Is-SparkSQL-JDBC-server-a-good-approach-for-caching-tp17196p17235.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Fwd: Saving very large data sets as Parquet on S3

2014-10-24 Thread Daniel Mahler

I am trying to convert some json logs to Parquet and save them on S3.
In principle this is just

import org.apache.spark._
val sqlContext = new sql.SQLContext(sc)
val data = sqlContext.jsonFile(s3n://source/path/*/*",10e-8)
data.registerAsTable("data")
data.saveAsParquetFile("s3n://target/path)

This works fine for up to about a 10^9 records, but above that I start
having problems.
The first problem I encountered is that after the data file get written out
writing the Parquet summary file fails.
While I seem to have all the data saved out,
programs have a huge have a huge start up time
when processing parquet files without a summary file.

Writing the summary file appears to primarily depend
on on the number of partitions being written,
and relatively independent of the amount of being written.
Problems start after about a 1000 partitions,
writing 1 partitions fails even with repartitioned one days worth of
data.

My data is very finely partitioned, about 16 log files per hour, or 13K
files per month.
The file sizes are very uneven, ranging over several orders of magnitude.
There are several years of data.
By my calculations this will produce 10s of terabytes of Parquet files.

The first thing I tried to get around this problem
was passing the data through `coalesce(1000, shuffle=false)` before
writing.
This works up to about a month worth of data,
after that coalescing to 1000 partitions produces parquet files larger than
5G
and writing to S3 fails as a result.
Also coalescing slows processing down by at least a factor of 2.
I do not understand why this should happen since I use shuffle=false.
AFAIK coalesce should just be a bookkeeping trick and the original
partitions should be processed pretty much the same as before, just with
their outputs concatenated.

The only other option I can think of is to write each month coalesced
as a separate data set with its own summary file
and union the RDDs when processing the data,
but I do not know how much overhead that will introduce.

I am looking for advice on the best way to save this size data in Parquet
on S3.
Apart from solving the the summary file issue i am also looking for ways to
improve performance.
Would it make sense to write the data to a local hdfs first and push it to
S3 with `hadoop distcp`?
Is putting Tachyon in front of either the input or the output S3 likely to
help?
If yes which is likely to help more?

I set options on the master as follows

+
cat <>~/spark/conf/spark-defaults.conf
spark.serializerorg.apache.spark.serializer.KryoSerializer
spark.rdd.compress true
spark.shuffle.consolidateFiles true
spark.akka.frameSize 20
EOF

copy-dir /root/spark/conf
spark/sbin/stop-all.sh
sleep 5
spark/sbin/start-all.
++

Does this make sense? Should I set some other options?
I have also asked these questions on StackOverflow where I reproduced the
full error messages:

+
http://stackoverflow.com/questions/26332542/how-to-save-a-multi-terabyte-schemardd-in-parquet-format-on-s3
+
http://stackoverflow.com/questions/26321947/multipart-uploads-to-amazon-s3-from-apache-spark
+
http://stackoverflow.com/questions/26291165/spark-sql-unable-to-complete-writing-parquet-data-with-a-large-number-of-shards

thanks
Daniel

Re: spark is running extremely slow with larger data set, like 2G

2014-10-24 Thread Davies Liu

On Fri, Oct 24, 2014 at 1:37 PM, xuhongnever  wrote:
> Thank you very much.
> Changing to groupByKey works, it runs much more faster.
>
> By the way, could you give me some explanation of the following
> configurations, after reading the official explanation, i'm still confused,
> what's the relationship between them? is there any memory overlap between
> them?
>
> *spark.python.worker.memory
> spark.executor.memory
> spark.driver.memory*

spark.driver.memory is used for JVM together with you local python
scripts (called driver),
spark.executor.memory is used for JVM in spark cluster (called slave
or executor),

In local mode, driver and executor share the same JVM, so
spark.driver.memory is used.

spark.python.worker.memory is used for Python worker in executor.
Because of GIL,
pyspark use multiple python process in the executor, one for each task.
spark.python.worker.memory will tell the python worker to when to
spill the data into disk.
It's not hard limit, so the memory used in python worker maybe is
little higher than it.
If you have enough memory in executor, increase spark.python.worker.memory will
let python worker to use more memory during shuffle (like groupBy()),
which will increase
the performance.

> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/spark-is-running-extremely-slow-with-larger-data-set-like-2G-tp17152p17231.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Workaround for SPARK-1931 not compiling

2014-10-24 Thread Ankur Dave

At 2014-10-23 09:48:55 +0530, Arpit Kumar  wrote:
> error: value partitionBy is not a member of
> org.apache.spark.rdd.RDD[(org.apache.spark.graphx.PartitionID,
> org.apache.spark.graphx.Edge[ED])]

Since partitionBy is a member of PairRDDFunctions, it sounds like the implicit 
conversion from RDD to PairRDDFunctions is not getting applied. Does it help to 
"import org.apache.spark.SparkContext._" before applying the workaround?

Ankur

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: spark is running extremely slow with larger data set, like 2G

2014-10-24 Thread xuhongnever

Thank you very much.
Changing to groupByKey works, it runs much more faster.

By the way, could you give me some explanation of the following
configurations, after reading the official explanation, i'm still confused,
what's the relationship between them? is there any memory overlap between
them?

*spark.python.worker.memory
spark.executor.memory
spark.driver.memory*



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-is-running-extremely-slow-with-larger-data-set-like-2G-tp17152p17231.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Function returning multiple Values - problem with using "if-else"

2014-10-24 Thread HARIPRIYA AYYALASOMAYAJULA

Thanks Sean!

On Fri, Oct 24, 2014 at 3:04 PM, Sean Owen  wrote:

> This is just a Scala question really. Use ++
>
> def inc(x:Int, y:Int) = {
>   if (condition) {
> for(i <- 0 to 7) yield(x, y+i)
>   } else {
> (for(k <- 0 to 24-y) yield(x, y+k)) ++ (for(j<- 0 to y-16)
> yield(x+1,j))
>   }
> }
>
> On Fri, Oct 24, 2014 at 8:52 PM, HARIPRIYA AYYALASOMAYAJULA
>  wrote:
> > Hello,
> >
> > My map function will call the following function (inc) which should yield
> > multiple values:
> >
> >
> > def inc(x:Int, y:Int)
> >  ={
> >  if(condition)
> >  {
> >   for(i <- 0 to 7)  yield(x, y+i)
> >  }
> >
> >  else
> >  {
> >  for(k <- 0 to 24-y)  yield(x, y+k)
> >  for(j<- 0 to y-16) yield(x+1,j)
> >
> >  }
> >  }
> >
> > The "if" part  works fine, but in the else part , if should return 2
> sets of
> > values (from the first for loop and then from the second for loop). But,
> > only the values from second for loop (within else) are returned by the
> > function.
> >
> > I tried alternative ways but I am unable to make the above function give
> me
> > the expected result.
> >
> > Can someone help me understand how yield works, and what should be used
> in
> > place of yield to return multiple values when two for loops have to be
> > used.(like the test case in within else condition)  - If it was in JAVA
> or
> > C, we could simply store the values in an array or a list and return the
> > same but I'm still not clear how it works in Scala/Spark.
> >
> > Thank you for your time.
> >
> > --
> > Regards,
> > Haripriya Ayyalasomayajula
> > Graduate Student
> > Department of Computer Science
> > University of Houston
> > Contact : 650-796-7112
>



-- 
Regards,
Haripriya Ayyalasomayajula
Graduate Student
Department of Computer Science
University of Houston
Contact : 650-796-7112

Re: Spark MLLIB Decision Tree - ArrayIndexOutOfBounds Exception

2014-10-24 Thread Joseph Bradley

Hi Lokesh,
Glad the update fixed the bug.  maxBins is a parameter you can tune based
on your data.  Essentially, larger maxBins is potentially more accurate,
but will run more slowly and use more memory.  maxBins must be <= training
set size; I would say try some small values (4, 8, 16).  If there is a
difference in performance between those, then you can tune it more;
otherwise, just pick one.
Good luck!
Joseph

On Fri, Oct 24, 2014 at 12:54 AM, lokeshkumar  wrote:

> Hi Joseph,
>
> Thanks for the help.
>
> I have tried this DecisionTree example with the latest spark code and it is
> working fine now. But how do we choose the maxBins for this model?
>
> Thanks
> Lokesh
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-MLLIB-Decision-Tree-ArrayIndexOutOfBounds-Exception-tp16907p17195.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Function returning multiple Values - problem with using "if-else"

2014-10-24 Thread Sean Owen

This is just a Scala question really. Use ++

def inc(x:Int, y:Int) = {
  if (condition) {
for(i <- 0 to 7) yield(x, y+i)
  } else {
(for(k <- 0 to 24-y) yield(x, y+k)) ++ (for(j<- 0 to y-16) yield(x+1,j))
  }
}

On Fri, Oct 24, 2014 at 8:52 PM, HARIPRIYA AYYALASOMAYAJULA
 wrote:
> Hello,
>
> My map function will call the following function (inc) which should yield
> multiple values:
>
>
> def inc(x:Int, y:Int)
>  ={
>  if(condition)
>  {
>   for(i <- 0 to 7)  yield(x, y+i)
>  }
>
>  else
>  {
>  for(k <- 0 to 24-y)  yield(x, y+k)
>  for(j<- 0 to y-16) yield(x+1,j)
>
>  }
>  }
>
> The "if" part  works fine, but in the else part , if should return 2 sets of
> values (from the first for loop and then from the second for loop). But,
> only the values from second for loop (within else) are returned by the
> function.
>
> I tried alternative ways but I am unable to make the above function give me
> the expected result.
>
> Can someone help me understand how yield works, and what should be used in
> place of yield to return multiple values when two for loops have to be
> used.(like the test case in within else condition)  - If it was in JAVA or
> C, we could simply store the values in an array or a list and return the
> same but I'm still not clear how it works in Scala/Spark.
>
> Thank you for your time.
>
> --
> Regards,
> Haripriya Ayyalasomayajula
> Graduate Student
> Department of Computer Science
> University of Houston
> Contact : 650-796-7112

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Function returning multiple Values - problem with using "if-else"

2014-10-24 Thread HARIPRIYA AYYALASOMAYAJULA

Hello,

My map function will call the following function (inc) which should yield
multiple values:


def inc(x:Int, y:Int)
  ={
  if(condition)
  {
   for(i <- 0 to 7)  yield(x, y+i)
  }

  else
  {
  for(k <- 0 to 24-y)  yield(x, y+k)
  for(j<- 0 to y-16) yield(x+1,j)

  }
  }

The "if" part  works fine, but in the else part , if should return 2 sets
of values (from the first for loop and then from the second for loop). But,
only the values from second for loop (within else) are returned by the
function.

I tried alternative ways but I am unable to make the above function give me
the expected result.

Can someone help me understand how yield works, and what should be used in
place of yield to return multiple values when two for loops have to be
used.(like the test case in within else condition)  - If it was in JAVA or
C, we could simply store the values in an array or a list and return the
same but I'm still not clear how it works in Scala/Spark.

Thank you for your time.

-- 
Regards,
Haripriya Ayyalasomayajula
Graduate Student
Department of Computer Science
University of Houston
Contact : 650-796-7112

Re: How to use FlumeInputDStream in spark cluster?

2014-10-24 Thread BigDataUser

I am running FlumeEventCount program in CDH 5.0.1 which has Spark 0.9.0. The
program runs fine in local process as well as standalone cluster mode.
However, the program fails in YARN mode. I see the following error:
INFO scheduler.DAGScheduler: Stage 2 (runJob at
NetworkInputTracker.scala:182) finished in 0.215 s
INFO spark.SparkContext: Job finished: runJob at
NetworkInputTracker.scala:182, took 0.224696381 s
ERROR scheduler.NetworkInputTracker: De-registered receiver for network
stream 0 with message org.jboss.netty.channel.ChannelException: Failed to
bind to: x/xx.xxx.x.xx:41415
Is there a workaround for this ?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-FlumeInputDStream-in-spark-cluster-tp1604p17226.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Is SparkSQL + JDBC server a good approach for caching?

That works perfect. Thanks again Michael

On Fri, Oct 24, 2014 at 3:10 PM, Michael Armbrust 
wrote:

> It won't be transparent, but you can do so something like:
>
> CACHE TABLE newData AS SELECT * FROM allData WHERE date > "..."
>
> and then query newData.
>
> On Fri, Oct 24, 2014 at 12:06 PM, Sadhan Sood 
> wrote:
>
>> Is there a way to cache certain (or most latest) partitions of certain
>> tables ?
>>
>> On Fri, Oct 24, 2014 at 2:35 PM, Michael Armbrust > > wrote:
>>
>>> It does have support for caching using either CACHE TABLE  or
>>> CACHE TABLE  AS SELECT 
>>>
>>> On Fri, Oct 24, 2014 at 1:05 AM, ankits  wrote:
>>>
 I want to set up spark SQL to allow ad hoc querying over the last X
 days of
 processed data, where the data is processed through spark. This would
 also
 have to cache data (in memory only), so the approach I was thinking of
 was
 to build a layer that persists the appropriate RDDs and stores them in
 memory.

 I see spark sql allows ad hoc querying through JDBC though I have never
 used
 that before. Will using JDBC offer any advantages (e.g does it have
 built in
 support for caching?) over rolling my own solution for this use case?

 Thanks!



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Is-SparkSQL-JDBC-server-a-good-approach-for-caching-tp17196.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


>>>
>>
>

Re: Is SparkSQL + JDBC server a good approach for caching?

It won't be transparent, but you can do so something like:

CACHE TABLE newData AS SELECT * FROM allData WHERE date > "..."

and then query newData.

On Fri, Oct 24, 2014 at 12:06 PM, Sadhan Sood  wrote:

> Is there a way to cache certain (or most latest) partitions of certain
> tables ?
>
> On Fri, Oct 24, 2014 at 2:35 PM, Michael Armbrust 
> wrote:
>
>> It does have support for caching using either CACHE TABLE  or
>> CACHE TABLE  AS SELECT 
>>
>> On Fri, Oct 24, 2014 at 1:05 AM, ankits  wrote:
>>
>>> I want to set up spark SQL to allow ad hoc querying over the last X days
>>> of
>>> processed data, where the data is processed through spark. This would
>>> also
>>> have to cache data (in memory only), so the approach I was thinking of
>>> was
>>> to build a layer that persists the appropriate RDDs and stores them in
>>> memory.
>>>
>>> I see spark sql allows ad hoc querying through JDBC though I have never
>>> used
>>> that before. Will using JDBC offer any advantages (e.g does it have
>>> built in
>>> support for caching?) over rolling my own solution for this use case?
>>>
>>> Thanks!
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Is-SparkSQL-JDBC-server-a-good-approach-for-caching-tp17196.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>

Re: Is SparkSQL + JDBC server a good approach for caching?

Is there a way to cache certain (or most latest) partitions of certain
tables ?

On Fri, Oct 24, 2014 at 2:35 PM, Michael Armbrust 
wrote:

> It does have support for caching using either CACHE TABLE  or
> CACHE TABLE  AS SELECT 
>
> On Fri, Oct 24, 2014 at 1:05 AM, ankits  wrote:
>
>> I want to set up spark SQL to allow ad hoc querying over the last X days
>> of
>> processed data, where the data is processed through spark. This would also
>> have to cache data (in memory only), so the approach I was thinking of was
>> to build a layer that persists the appropriate RDDs and stores them in
>> memory.
>>
>> I see spark sql allows ad hoc querying through JDBC though I have never
>> used
>> that before. Will using JDBC offer any advantages (e.g does it have built
>> in
>> support for caching?) over rolling my own solution for this use case?
>>
>> Thanks!
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Is-SparkSQL-JDBC-server-a-good-approach-for-caching-tp17196.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

Under which user is the program run on slaves?

2014-10-24 Thread jan.zikes

Hi,

I would like to ask under which user is run the Spark program on slaves? My 
Spark is running on top of the Yarn.

The reason I am asking for this is that I need to download data for NLTK 
library and these data are dowloaded for specific python user and I am 
currently struggling with this. 
http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-problem-with-textblob-from-NLTK-used-in-map-td17211.html

Than you in advance for any ideas. 
 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Job cancelled because SparkContext was shut down - failures!

These seem like s3 connection errors for the table data. Wondering, since
we don't see that many failures on hive. I also set the spark.task.maxFailures
= 15.

On Fri, Oct 24, 2014 at 12:15 PM, Sadhan Sood  wrote:

> Hi,
>
> Trying to run a query on spark-sql but it keeps failing with this error on
> the cli ( we are running spark-sql on a yarn cluster):
>
>
> org.apache.spark.SparkException: Job cancelled because SparkContext was
> shut down
>   at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:700)
>   at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:699)
>   at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
>   at
> org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:699)
>   at
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor.postStop(DAGScheduler.scala:1405)
>   at akka.actor.Actor$class.aroundPostStop(Actor.scala:475)
>   at
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundPostStop(DAGScheduler.scala:1352)
>   at
> akka.actor.dungeon.FaultHandling$class.akka$actor$dungeon$FaultHandling$$finishTerminate(FaultHandling.scala:210)
>   at
> akka.actor.dungeon.FaultHandling$class.terminate(FaultHandling.scala:172)
>   at akka.actor.ActorCell.terminate(ActorCell.scala:369)
>   at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:462)
>   at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478)
>   at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263)
>   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>   at
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
>   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>   at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>   at
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>   at
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>
> On the UI, I see there are connection failures in mapPartition stage:
>
> java.net.SocketTimeoutException: Read timed out
> java.net.SocketInputStream.socketRead0(Native Method)
> java.net.SocketInputStream.read(SocketInputStream.java:150)
> java.net.SocketInputStream.read(SocketInputStream.java:121)
> java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
> java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
> java.io.BufferedInputStream.read(BufferedInputStream.java:345)
> sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:703)
> sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:647)
> 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1534)
> 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1439)
> org.apache.spark.util.Utils$.fetchFile(Utils.scala:362)
> 
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:331)
> 
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:329)
> 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
> 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
> 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
> 
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
> scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
> scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
> 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
> 
> org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:329)
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:162)
> 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> java.lang.Thread.run(Thread.java:745)
>
>

Re: PySpark problem with textblob from NLTK used in map

2014-10-24 Thread jan.zikes


Maybe I'll add one more question. I think that the problem is with user, so I 
would like to ask under which user are run Spark jobs on slaves?
__


Hi,

I am trying to implement function for text preprocessing in PySpark. I have amazon EMR 
where I am installing Python dependencies from the bootstrap script. One of these 
dependencies is textblob "python -m textblob.download_corpora". Then I am 
trying to use it locally on all the machines without any problem.

But when I am trying to run it from Spark then I am getting following error:


INFO: File "/home/hadoop/spark/python/pyspark/rdd.py", line 1324, in 
saveAsTextFile
INFO: keyed._jrdd.map(self.ctx._jvm.BytesToString()).saveAsTextFile(path)
INFO: File 
"/home/hadoop/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 
538, in __call__
INFO: File 
"/home/hadoop/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 
300, in get_return_value
INFO: py4j.protocol.Py4JJavaError: An error occurred while calling 
o54.saveAsTextFile.
INFO: : org.apache.spark.SparkException: Job aborted due to stage failure: Task 
8 in stage 1.0 failed 4 times, most recent failure: Lost task 8.3 in stage 1.0 
(TID 40, ip-172-31-3-125.ec2.internal): 
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
INFO: File "/home/hadoop/spark/python/pyspark/worker.py", line 79, in main
INFO: serializer.dump_stream(func(split_index, iterator), outfile)
INFO: File "/home/hadoop/spark/python/pyspark/serializers.py", line 127, in 
dump_stream
INFO: for obj in iterator:
INFO: File "/home/hadoop/spark/python/pyspark/rdd.py", line 1316, in func
INFO: for x in iterator:
INFO: File 
"/home/hadoop/pyckage/package_topics/package_topics/preprocessor.py", line 40, 
in make_tokens
INFO: File "./package_topics.zip/package_topics/data_utils.py", line 76, in 
preprocess_text
INFO: for noun_phrase in TextBlob(' '.join(tokens)).noun_phrases
INFO: File "/usr/lib/python2.6/site-packages/textblob/decorators.py", line 24, 
in __get__
INFO: value = obj.__dict__[self.func.__name__] = self.func(obj)
INFO: File "/usr/lib/python2.6/site-packages/textblob/blob.py", line 431, in 
noun_phrases
INFO: for phrase in self.np_extractor.extract(self.raw)
INFO: File "/usr/lib/python2.6/site-packages/textblob/en/np_extractors.py", 
line 138, in extract
INFO: self.train()
INFO: File "/usr/lib/python2.6/site-packages/textblob/decorators.py", line 38, 
in decorated
INFO: raise MissingCorpusError()
INFO: MissingCorpusError:
INFO: Looks like you are missing some required data for this feature.
INFO: 
INFO: To download the necessary data, simply run
INFO: 
INFO: python -m textblob.download_corpora
INFO: 
INFO: or use the NLTK downloader to download the missing data: 
http://nltk.org/data.html
INFO: If this doesn't fix the problem, file an issue at 
https://github.com/sloria/TextBlob/issues.
INFO: 
INFO: 
INFO: org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:124)
INFO: org.apache.spark.api.python.PythonRDD$$anon$1.(PythonRDD.scala:154)
INFO: org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:87)
INFO: org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
INFO: org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
INFO: org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
INFO: org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
INFO: org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
INFO: org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
INFO: org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
INFO: org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
INFO: org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
INFO: org.apache.spark.scheduler.Task.run(Task.scala:54)
INFO: org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
INFO: 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
INFO: 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
INFO: java.lang.Thread.run(Thread.java:745)
INFO: Driver stacktrace:
INFO: at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
INFO: at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
INFO: at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
INFO: at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
INFO: at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
INFO: at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173)
INFO: at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
INFO: at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
INFO: at scala.Option.foreach(Option.scala:236)
INFO: at 
o

Re: [Spark SQL] Setting variables

You might be hitting: https://issues.apache.org/jira/browse/SPARK-4037

On Fri, Oct 24, 2014 at 11:32 AM, Yana Kadiyska 
wrote:

> Hi all,
>
> I'm trying to set a pool for a JDBC session. I'm connecting to the thrift
> server via JDBC client.
>
> My installation appears to be good(queries run fine), I can see the pools
> in the UI, but any attempt to set a variable (I tried
> spark.sql.shuffle.partitions and spark.sql.thriftserver.scheduler.pool)
> result in the exception below (trace is from Thriftserver log)
>
>
> Any thoughts on what I'm doing wrong? (I am on master, built today)
>
> SET spark.sql.thriftserver.scheduler.pool=mypool;select count(*) from mytable;
>
>
> ==
>
> 14/10/24 18:17:10 ERROR server.SparkSQLOperationManager: Error executing
> query:
> java.lang.NullPointerException
> at
> org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:309)
> at
> org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:272)
> at
> org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:244)
> at
> org.apache.spark.sql.execution.SetCommand.sideEffectResult$lzycompute(commands.scala:64)
> at
> org.apache.spark.sql.execution.SetCommand.sideEffectResult(commands.scala:55)
> at
> org.apache.spark.sql.execution.Command$class.execute(commands.scala:44)
> at
> org.apache.spark.sql.execution.SetCommand.execute(commands.scala:51)
> at
> org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:357)
> at
> org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:357)
> at
> org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58)
> at org.apache.spark.sql.SchemaRDD.(SchemaRDD.scala:104)
> at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:99)
> at
> org.apache.spark.sql.hive.thriftserver.server.SparkSQLOperationManager$$anon$1.run(SparkSQLOperationManager.scala:172)
> at
> org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:193)
> at
> org.apache.hive.service.cli.session.HiveSessionImpl.executeStatement(HiveSessionImpl.java:175)
> at
> org.apache.hive.service.cli.CLIService.executeStatement(CLIService.java:150)
> at
> org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:207)
> at
> org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1133)
> at
> org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1118)
> at
> org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
> at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
> at
> org.apache.hive.service.auth.TUGIContainingProcessor$1.run(TUGIContainingProcessor.java:58)
> at
> org.apache.hive.service.auth.TUGIContainingProcessor$1.run(TUGIContainingProcessor.java:55)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
> at
> org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:526)
> at
> org.apache.hive.service.auth.TUGIContainingProcessor.process(TUGIContainingProcessor.java:55)
> at
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:744)
>
>
>

Re: Is SparkSQL + JDBC server a good approach for caching?

It does have support for caching using either CACHE TABLE  or
CACHE TABLE  AS SELECT 

On Fri, Oct 24, 2014 at 1:05 AM, ankits  wrote:

> I want to set up spark SQL to allow ad hoc querying over the last X days of
> processed data, where the data is processed through spark. This would also
> have to cache data (in memory only), so the approach I was thinking of was
> to build a layer that persists the appropriate RDDs and stores them in
> memory.
>
> I see spark sql allows ad hoc querying through JDBC though I have never
> used
> that before. Will using JDBC offer any advantages (e.g does it have built
> in
> support for caching?) over rolling my own solution for this use case?
>
> Thanks!
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Is-SparkSQL-JDBC-server-a-good-approach-for-caching-tp17196.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

[Spark SQL] Setting variables

2014-10-24 Thread Yana Kadiyska

Hi all,

I'm trying to set a pool for a JDBC session. I'm connecting to the thrift
server via JDBC client.

My installation appears to be good(queries run fine), I can see the pools
in the UI, but any attempt to set a variable (I tried
spark.sql.shuffle.partitions and spark.sql.thriftserver.scheduler.pool)
result in the exception below (trace is from Thriftserver log)


Any thoughts on what I'm doing wrong? (I am on master, built today)

SET spark.sql.thriftserver.scheduler.pool=mypool;select count(*) from mytable;


==

14/10/24 18:17:10 ERROR server.SparkSQLOperationManager: Error executing
query:
java.lang.NullPointerException
at
org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:309)
at
org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:272)
at
org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:244)
at
org.apache.spark.sql.execution.SetCommand.sideEffectResult$lzycompute(commands.scala:64)
at
org.apache.spark.sql.execution.SetCommand.sideEffectResult(commands.scala:55)
at
org.apache.spark.sql.execution.Command$class.execute(commands.scala:44)
at
org.apache.spark.sql.execution.SetCommand.execute(commands.scala:51)
at
org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:357)
at
org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:357)
at
org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58)
at org.apache.spark.sql.SchemaRDD.(SchemaRDD.scala:104)
at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:99)
at
org.apache.spark.sql.hive.thriftserver.server.SparkSQLOperationManager$$anon$1.run(SparkSQLOperationManager.scala:172)
at
org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:193)
at
org.apache.hive.service.cli.session.HiveSessionImpl.executeStatement(HiveSessionImpl.java:175)
at
org.apache.hive.service.cli.CLIService.executeStatement(CLIService.java:150)
at
org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:207)
at
org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1133)
at
org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1118)
at
org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
at
org.apache.hive.service.auth.TUGIContainingProcessor$1.run(TUGIContainingProcessor.java:58)
at
org.apache.hive.service.auth.TUGIContainingProcessor$1.run(TUGIContainingProcessor.java:55)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
at
org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:526)
at
org.apache.hive.service.auth.TUGIContainingProcessor.process(TUGIContainingProcessor.java:55)
at
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)

Re: spark-submit memory too larger

2014-10-24 Thread Sameer Farooqui

That does seem a bit odd. How many Executors are running under this Driver?

Does the spark-submit process start out using ~60GB of memory right away or
does it start out smaller and slowly build up to that high? If so, how long
does it take to get that high?

Also, which version of Spark are you using?

SameerF

On Fri, Oct 24, 2014 at 8:07 AM, marylucy  wrote:

> i used standalone spark，set spark.driver.memory=5g,but spark-submit
> process use 57g memory, is this normal?how to decrease it?

Job cancelled because SparkContext was shut down - failures!

Hi,

Trying to run a query on spark-sql but it keeps failing with this error on
the cli ( we are running spark-sql on a yarn cluster):


org.apache.spark.SparkException: Job cancelled because SparkContext was
shut down
  at
org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:700)
  at
org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:699)
  at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
  at
org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:699)
  at
org.apache.spark.scheduler.DAGSchedulerEventProcessActor.postStop(DAGScheduler.scala:1405)
  at akka.actor.Actor$class.aroundPostStop(Actor.scala:475)
  at
org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundPostStop(DAGScheduler.scala:1352)
  at
akka.actor.dungeon.FaultHandling$class.akka$actor$dungeon$FaultHandling$$finishTerminate(FaultHandling.scala:210)
  at
akka.actor.dungeon.FaultHandling$class.terminate(FaultHandling.scala:172)
  at akka.actor.ActorCell.terminate(ActorCell.scala:369)
  at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:462)
  at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478)
  at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263)
  at akka.dispatch.Mailbox.run(Mailbox.scala:219)
  at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
  at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
  at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
  at
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
  at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

On the UI, I see there are connection failures in mapPartition stage:

java.net.SocketTimeoutException: Read timed out
java.net.SocketInputStream.socketRead0(Native Method)
java.net.SocketInputStream.read(SocketInputStream.java:150)
java.net.SocketInputStream.read(SocketInputStream.java:121)
java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
java.io.BufferedInputStream.read(BufferedInputStream.java:345)
sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:703)
sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:647)

sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1534)

sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1439)
org.apache.spark.util.Utils$.fetchFile(Utils.scala:362)

org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:331)

org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:329)

scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)

scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)

scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)

scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
scala.collection.mutable.HashMap.foreach(HashMap.scala:98)

scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)

org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:329)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:162)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)

Re: Measuring execution time

2014-10-24 Thread Reza Zadeh

The Spark UI has timing information. When running locally, it is at
http://localhost:4040
Otherwise the url to the UI is printed out onto the console when you
startup spark shell or run a job.

Reza

On Fri, Oct 24, 2014 at 5:51 AM, shahab  wrote:

> Hi,
>
> I just wonder if there is any built-in function to get the execution time
> for each of the jobs/tasks ? in simple words, how can I find out how much
> time is spent on loading/mapping/filtering/reducing part of a job? I can
> see printout in the logs but since there is no clear presentation of the
> underlying DAG and associated tasks it is hard to find what I am looking
> for.
>
> best,
> /Shahab
>

Re: How can I set the IP a worker use?

2014-10-24 Thread Theodore Si

I found this. So it seems that we should use -h or --host instead of -i 
and --ip.


 -i HOST, --ip IP Hostname to listen on (deprecated, please use 
--host or -h)

  -h HOST, --host HOST Hostname to listen on


在 10/24/2014 3:35 PM, Akhil Das 写道:
Try using the --ip parameter 
 
while starting the worker. like:


spark-1.0.1/bin/spark-class org.apache.spark.deploy.worker.Worker --ip 
1.2.3.4 spark://1.2.3.4:7077 


Thanks
Best Regards

On Fri, Oct 24, 2014 at 12:34 PM, Theodore Si > wrote:


Hi all,

I have two network interface card on one node, one is a Eithernet
card, the other Infiniband HCA.
The master has two IP addresses, lets say 1.2.3.4 (for Eithernet
card) and 2.3.4.5 (for HCA).
I can start the master by
export SPARK_MASTER_IP='1.2.3.4';sbin/start-master.sh
to let master listen on 1.2.3.4:7077 

But when I connect the worker to the master by using
spark-1.0.1/bin/spark-class org.apache.spark.deploy.worker.Worker
spark://1.2.3.4:7077 

I will get errors, since it is using its HCA card. How can I let
the worker use its Eithernet card?

Thanks

PySpark problem with textblob from NLTK used in map

2014-10-24 Thread jan.zikes


Hi,

I am trying to implement function for text preprocessing in PySpark. I have amazon EMR 
where I am installing Python dependencies from the bootstrap script. One of these 
dependencies is textblob "python -m textblob.download_corpora". Then I am 
trying to use it locally on all the machines without any problem.

But when I am trying to run it from Spark then I am getting following error:


INFO: File "/home/hadoop/spark/python/pyspark/rdd.py", line 1324, in 
saveAsTextFile
INFO: keyed._jrdd.map(self.ctx._jvm.BytesToString()).saveAsTextFile(path)
INFO: File 
"/home/hadoop/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 
538, in __call__
INFO: File 
"/home/hadoop/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 
300, in get_return_value
INFO: py4j.protocol.Py4JJavaError: An error occurred while calling 
o54.saveAsTextFile.
INFO: : org.apache.spark.SparkException: Job aborted due to stage failure: Task 
8 in stage 1.0 failed 4 times, most recent failure: Lost task 8.3 in stage 1.0 
(TID 40, ip-172-31-3-125.ec2.internal): 
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
INFO: File "/home/hadoop/spark/python/pyspark/worker.py", line 79, in main
INFO: serializer.dump_stream(func(split_index, iterator), outfile)
INFO: File "/home/hadoop/spark/python/pyspark/serializers.py", line 127, in 
dump_stream
INFO: for obj in iterator:
INFO: File "/home/hadoop/spark/python/pyspark/rdd.py", line 1316, in func
INFO: for x in iterator:
INFO: File 
"/home/hadoop/pyckage/package_topics/package_topics/preprocessor.py", line 40, 
in make_tokens
INFO: File "./package_topics.zip/package_topics/data_utils.py", line 76, in 
preprocess_text
INFO: for noun_phrase in TextBlob(' '.join(tokens)).noun_phrases
INFO: File "/usr/lib/python2.6/site-packages/textblob/decorators.py", line 24, 
in __get__
INFO: value = obj.__dict__[self.func.__name__] = self.func(obj)
INFO: File "/usr/lib/python2.6/site-packages/textblob/blob.py", line 431, in 
noun_phrases
INFO: for phrase in self.np_extractor.extract(self.raw)
INFO: File "/usr/lib/python2.6/site-packages/textblob/en/np_extractors.py", 
line 138, in extract
INFO: self.train()
INFO: File "/usr/lib/python2.6/site-packages/textblob/decorators.py", line 38, 
in decorated
INFO: raise MissingCorpusError()
INFO: MissingCorpusError:
INFO: Looks like you are missing some required data for this feature.
INFO: 
INFO: To download the necessary data, simply run
INFO: 
INFO: python -m textblob.download_corpora
INFO: 
INFO: or use the NLTK downloader to download the missing data: 
http://nltk.org/data.html
INFO: If this doesn't fix the problem, file an issue at 
https://github.com/sloria/TextBlob/issues.
INFO: 
INFO: 
INFO: org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:124)
INFO: org.apache.spark.api.python.PythonRDD$$anon$1.(PythonRDD.scala:154)
INFO: org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:87)
INFO: org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
INFO: org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
INFO: org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
INFO: org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
INFO: org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
INFO: org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
INFO: org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
INFO: org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
INFO: org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
INFO: org.apache.spark.scheduler.Task.run(Task.scala:54)
INFO: org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
INFO: 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
INFO: 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
INFO: java.lang.Thread.run(Thread.java:745)
INFO: Driver stacktrace:
INFO: at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
INFO: at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
INFO: at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
INFO: at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
INFO: at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
INFO: at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173)
INFO: at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
INFO: at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
INFO: at scala.Option.foreach(Option.scala:236)
INFO: at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)
INFO: at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
INF

spark-submit memory too larger

2014-10-24 Thread marylucy

i used standalone spark，set spark.driver.memory=5g,but spark-submit process use 
57g memory, is this normal?how to decrease it?

Re: Problem packing spark-assembly jar

2014-10-24 Thread Yana Kadiyska

thanks -- that was it. I could swear this had worked for me before and
indeed it's fixed this morning.

On Fri, Oct 24, 2014 at 6:34 AM, Sean Owen  wrote:

> I imagine this is a side effect of the change that was just reverted,
> related to publishing the effective pom? sounds related but I don't
> know.
>
> On Fri, Oct 24, 2014 at 2:22 AM, Yana Kadiyska 
> wrote:
> > Hi folks,
> >
> > I'm trying to deploy the latest from master branch and having some
> trouble
> > with the assembly jar.
> >
> > In the spark-1.1 official distribution(I use cdh version), I see the
> > following jars, where spark-assembly-1.1.0-hadoop2.0.0-mr1-cdh4.2.0.jar
> > contains a ton of stuff:
> > datanucleus-api-jdo-3.2.1.jar
> > datanucleus-core-3.2.2.jar
> > datanucleus-rdbms-3.2.1.jar
> > spark-assembly-1.1.0-hadoop2.0.0-mr1-cdh4.2.0.jar
> > spark-examples-1.1.0-hadoop2.0.0-mr1-cdh4.2.0.jar
> > spark-hive-thriftserver_2.10-1.1.0.jar
> > spark-hive_2.10-1.1.0.jar
> > spark-sql_2.10-1.1.0.jar
> >
> >
> > I tried to create a similar distribution off of master running
> > mvn -Phive -Dhadoop.version=2.0.0-mr1-cdh4.2.0 -DskipTests -Pbigtop-dist
> > package
> > and
> > ./make-distribution.sh -Pbigtop-dist -Phive
> > -Dhadoop.version=2.0.0-mr1-cdh4.2.0 -DskipTests
> >
> > but in either case all I get in spark-assembly is near empty:
> >
> > spark_official/dist/lib$ jar -tvf
> > spark-assembly-1.2.0-SNAPSHOT-hadoop2.0.0-mr1-cdh4.2.0.jar
> >
> > META-INF/
> > META-INF/MANIFEST.MF
> > org/
> > org/apache/
> > org/apache/spark/
> > org/apache/spark/unused/
> > org/apache/spark/unused/UnusedStubClass.class
> > META-INF/maven/
> > META-INF/maven/org.spark-project.spark/
> > META-INF/maven/org.spark-project.spark/unused/
> > META-INF/maven/org.spark-project.spark/unused/pom.xml
> > META-INF/maven/org.spark-project.spark/unused/pom.properties
> > META-INF/NOTICE
> >
> > Any advice on how to get spark-core and the rest packaged into the
> assembly
> > jar -- I'd like to have fewer things to copy around.
>

Re: Is SparkSQL + JDBC server a good approach for caching?

2014-10-24 Thread Aniket Bhatnagar

Just curious... Why would you not store the processed results in regular
relational database? Not sure what you meant by persist the appropriate
RDDs. Did you mean output of your job will be RDDs?

On 24 October 2014 13:35, ankits  wrote:

> I want to set up spark SQL to allow ad hoc querying over the last X days of
> processed data, where the data is processed through spark. This would also
> have to cache data (in memory only), so the approach I was thinking of was
> to build a layer that persists the appropriate RDDs and stores them in
> memory.
>
> I see spark sql allows ad hoc querying through JDBC though I have never
> used
> that before. Will using JDBC offer any advantages (e.g does it have built
> in
> support for caching?) over rolling my own solution for this use case?
>
> Thanks!
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Is-SparkSQL-JDBC-server-a-good-approach-for-caching-tp17196.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Spark doesn't retry task while writing to HDFS

2014-10-24 Thread Aniket Bhatnagar

Hi all

I have written a job that reads data from HBASE and writes to HDFS (fairly
simple). While running the job, I noticed that a few of the tasks failed
with the following error. Quick googling on the error suggests that its an
unexplained error and is perhaps intermittent. What I am curious to know is
why didn't Spark retry writing file to HDFS? It just shows it as failed job
in Spark UI.

Error:
java.io.IOException: All datanodes x.x.x.x: are bad. Aborting...

org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1128)

org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:924)

org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:486)


Thanks,
Aniket

Re: Memory requirement of using Spark

2014-10-24 Thread jian.t

Thanks Akhil.

I searched DISK_AND_MEMORY_SER trying to figure out how it works, and I
cannot find any documentation on that. Do you have a link for that?

If what DISK_AND_MEMORY_SER does is reading and writing to the disk with
some memory caching, does that mean the output will be written to disk for
each join, and then read back into memory for the next join? If so, how it
is more performant than HIVE query model?

Again I am new to this, so I might ask something stupid.

Thanks,
JT




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Memory-requirement-of-using-Spark-tp17177p17204.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

scala.collection.mutable.ArrayOps$ofRef$.length$extension since Spark 1.1.0

2014-10-24 Thread Marius Soutier

Hi,

I’m running a job whose simple task it is to find files that cannot be read 
(sometimes our gz files are corrupted).

With 1.0.x, this worked perfectly. Since 1.1.0 however, I’m getting an 
exception: 
scala.collection.mutable.ArrayOps$ofRef$.length$extension(ArrayOps.scala:114)

sc.wholeTextFiles(input)
  .foreach { case (fileName, _) =>
try {
  println(s"Scanning $fileName")
  sc.textFile(fileName).take(1)
  println(s"Successfully scanned $fileName")
} catch {
  case t: Throwable => println(s"Failed to process $fileName, reason 
${t.getStackTrace.head}")
}
  }


Also since 1.1.0, the printlns are no longer visible on the console, only in 
the Spark UI worker output.


Thanks for any help
- Marius


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Measuring execution time

2014-10-24 Thread shahab

Hi,

I just wonder if there is any built-in function to get the execution time
for each of the jobs/tasks ? in simple words, how can I find out how much
time is spent on loading/mapping/filtering/reducing part of a job? I can
see printout in the logs but since there is no clear presentation of the
underlying DAG and associated tasks it is hard to find what I am looking
for.

best,
/Shahab

Broadcast failure with variable size of ~ 500mb with "key already cancelled ?"

2014-10-24 Thread htailor

Hi All,

I am relatively new to spark and currently having troubles with broadcasting
large variables ~500mb in size. Th
e broadcast fails with an error shown below and the memory usage on the
hosts also blow up. 

Our hardware consists of 8 hosts (1 x 64gb (driver) and 7 x 32gb (workers))
and we are using Spark 1.1 (Python) via Cloudera CDH 5.2.

We have managed to replicate the error using a test script shown below. I
would be interested to know if anyone has seen this before with broadcasting
or know of a fix.

=== ERROR ==

14/10/24 08:20:04 INFO BlockManager: Found block rdd_11_31 locally
14/10/24 08:20:08 INFO ConnectionManager: Key not valid ?
sun.nio.ch.SelectionKeyImpl@fbc6caf
14/10/24 08:20:08 INFO ConnectionManager: Removing ReceivingConnection to
ConnectionManagerId(pigeon3.ldn.ebs.io,55316)
14/10/24 08:20:08 INFO ConnectionManager: Removing SendingConnection to
ConnectionManagerId(pigeon3.ldn.ebs.io,55316)
14/10/24 08:20:08 INFO ConnectionManager: Removing SendingConnection to
ConnectionManagerId(pigeon3.ldn.ebs.io,55316)
14/10/24 08:20:08 INFO ConnectionManager: key already cancelled ?
sun.nio.ch.SelectionKeyImpl@fbc6caf
java.nio.channels.CancelledKeyException
at
org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:386)
at
org.apache.spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:139)
14/10/24 08:20:13 INFO ConnectionManager: Removing ReceivingConnection to
ConnectionManagerId(pigeon7.ldn.ebs.io,52524)
14/10/24 08:20:13 INFO ConnectionManager: Removing SendingConnection to
ConnectionManagerId(pigeon7.ldn.ebs.io,52524)
14/10/24 08:20:13 INFO ConnectionManager: Removing SendingConnection to
ConnectionManagerId(pigeon7.ldn.ebs.io,52524)
14/10/24 08:20:15 INFO ConnectionManager: Removing ReceivingConnection to
ConnectionManagerId(toppigeon.ldn.ebs.io,34370)
14/10/24 08:20:15 INFO ConnectionManager: Key not valid ?
sun.nio.ch.SelectionKeyImpl@3ecfdb7e
14/10/24 08:20:15 INFO ConnectionManager: Removing SendingConnection to
ConnectionManagerId(toppigeon.ldn.ebs.io,34370)
14/10/24 08:20:15 INFO ConnectionManager: Removing SendingConnection to
ConnectionManagerId(toppigeon.ldn.ebs.io,34370)
14/10/24 08:20:15 INFO ConnectionManager: key already cancelled ?
sun.nio.ch.SelectionKeyImpl@3ecfdb7e
java.nio.channels.CancelledKeyException
at
org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:386)
at
org.apache.spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:139)
14/10/24 08:20:19 INFO ConnectionManager: Removing ReceivingConnection to
ConnectionManagerId(pigeon8.ldn.ebs.io,48628)
14/10/24 08:20:19 INFO ConnectionManager: Removing SendingConnection to
ConnectionManagerId(pigeon8.ldn.ebs.io,48628)
14/10/24 08:20:19 INFO ConnectionManager: Removing ReceivingConnection to
ConnectionManagerId(pigeon6.ldn.ebs.io,44996)
14/10/24 08:20:19 INFO ConnectionManager: Removing SendingConnection to
ConnectionManagerId(pigeon6.ldn.ebs.io,44996)
14/10/24 08:20:27 ERROR SendingConnection: Exception while reading
SendingConnection to ConnectionManagerId(pigeon8.ldn.ebs.io,48628)
java.nio.channels.ClosedChannelException
at 
sun.nio.ch.SocketChannelImpl.ensureReadOpen(SocketChannelImpl.java:252)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:295)
at org.apache.spark.network.SendingConnection.read(Connection.scala:390)
at
org.apache.spark.network.ConnectionManager$$anon$7.run(ConnectionManager.scala:199)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
14/10/24 08:20:27 ERROR SendingConnection: Exception while reading
SendingConnection to ConnectionManagerId(pigeon6.ldn.ebs.io,44996)
java.nio.channels.ClosedChannelException
at 
sun.nio.ch.SocketChannelImpl.ensureReadOpen(SocketChannelImpl.java:252)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:295)
at org.apache.spark.network.SendingConnection.read(Connection.scala:390)
at
org.apache.spark.network.ConnectionManager$$anon$7.run(ConnectionManager.scala:199)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
14/10/24 08:20:27 INFO ConnectionManager: Handling connection error on
connection to ConnectionManagerId(pigeon6.ldn.ebs.io,44996)


=== PYTHON SCRIPT ==

#!/usr/bin/pyspark

import subprocess
from random import choice
import string

from pyspark import SparkContext, SparkConf

path_hdfs_broadcast_test = "broadcast_test/general_test_test"

subprocess.Popen(["hdfs", "dfs", "-rm", "-r", path_hdfs_broadcast_test],
stdout=subprocess.PIPE, stderr=subprocess.STDOUT)

sconf = SparkConf().se

Re: unable to make a custom class as a key in a pairrdd

2014-10-24 Thread Gerard Maas

There's an issue in the way case classes are handled on the REPL and you
won't be able to use a case class as a key. See:
https://issues.apache.org/jira/browse/SPARK-2620

BTW, case classes already implement equals and hashCode. It's not needed to
implement those again.

Given that you already implement equals and hashCode, try just dropping
"case"  and create a normal class. It might work that way.

-kr, Gerard.

On Fri, Oct 24, 2014 at 11:20 AM, Jaonary Rabarisoa 
wrote:

> In the documentation it's said that we need to override the hashCode and
> equals methods. Without overriding it does't work too. I get this error on
> REPL and stand alone application
>
> On Fri, Oct 24, 2014 at 3:29 AM, Prashant Sharma 
> wrote:
>
>> Are you doing this in REPL ? Then there is a bug filed for this, I just
>> can't recall the bug ID at the moment.
>>
>> Prashant Sharma
>>
>>
>>
>> On Fri, Oct 24, 2014 at 4:07 AM, Niklas Wilcke <
>> 1wil...@informatik.uni-hamburg.de> wrote:
>>
>>>  Hi Jao,
>>>
>>> I don't really know why this doesn't work but I have two hints.
>>> You don't need to override hashCode and equals. The modifier case is
>>> doing that for you. Writing
>>>
>>> case class PersonID(id: String)
>>>
>>> would be enough to get the class you want I think.
>>> If I change the type of the id param to Int it works for me but I don't
>>> know why.
>>>
>>> case class PersonID(id: Int)
>>>
>>> Looks like a strange behavior to me. Have a try.
>>>
>>> Good luck,
>>> Niklas
>>>
>>>
>>> On 23.10.2014 21:52, Jaonary Rabarisoa wrote:
>>>
>>>  Hi all,
>>>
>>>  I have the following case class that I want to use as a key in a
>>> key-value rdd. I defined the equals and hashCode methode but it's not
>>> working. What I'm doing wrong ?
>>>
>>>  *case class PersonID(id: String) {*
>>>
>>> * override def hashCode = id.hashCode*
>>>
>>> * override def equals(other: Any) = other match {*
>>>
>>> * case that: PersonID => this.id  == that.id
>>>  && this.getClass == that.getClass*
>>> * case _ => false*
>>> * }   *
>>> * }   *
>>>
>>>
>>> * val p = sc.parallelize((1 until 10).map(x => (PersonID("1"),x )))*
>>>
>>>
>>>  *p.groupByKey.collect foreach println*
>>>
>>>  *(PersonID(1),CompactBuffer(5))*
>>> *(PersonID(1),CompactBuffer(6))*
>>> *(PersonID(1),CompactBuffer(7))*
>>> *(PersonID(1),CompactBuffer(8, 9))*
>>> *(PersonID(1),CompactBuffer(1))*
>>> *(PersonID(1),CompactBuffer(2))*
>>> *(PersonID(1),CompactBuffer(3))*
>>> *(PersonID(1),CompactBuffer(4))*
>>>
>>>
>>>  Best,
>>>
>>>  Jao
>>>
>>>
>>>
>>
>

Re: Problem packing spark-assembly jar

2014-10-24 Thread Sean Owen

I imagine this is a side effect of the change that was just reverted,
related to publishing the effective pom? sounds related but I don't
know.

On Fri, Oct 24, 2014 at 2:22 AM, Yana Kadiyska  wrote:
> Hi folks,
>
> I'm trying to deploy the latest from master branch and having some trouble
> with the assembly jar.
>
> In the spark-1.1 official distribution(I use cdh version), I see the
> following jars, where spark-assembly-1.1.0-hadoop2.0.0-mr1-cdh4.2.0.jar
> contains a ton of stuff:
> datanucleus-api-jdo-3.2.1.jar
> datanucleus-core-3.2.2.jar
> datanucleus-rdbms-3.2.1.jar
> spark-assembly-1.1.0-hadoop2.0.0-mr1-cdh4.2.0.jar
> spark-examples-1.1.0-hadoop2.0.0-mr1-cdh4.2.0.jar
> spark-hive-thriftserver_2.10-1.1.0.jar
> spark-hive_2.10-1.1.0.jar
> spark-sql_2.10-1.1.0.jar
>
>
> I tried to create a similar distribution off of master running
> mvn -Phive -Dhadoop.version=2.0.0-mr1-cdh4.2.0 -DskipTests -Pbigtop-dist
> package
> and
> ./make-distribution.sh -Pbigtop-dist -Phive
> -Dhadoop.version=2.0.0-mr1-cdh4.2.0 -DskipTests
>
> but in either case all I get in spark-assembly is near empty:
>
> spark_official/dist/lib$ jar -tvf
> spark-assembly-1.2.0-SNAPSHOT-hadoop2.0.0-mr1-cdh4.2.0.jar
>
> META-INF/
> META-INF/MANIFEST.MF
> org/
> org/apache/
> org/apache/spark/
> org/apache/spark/unused/
> org/apache/spark/unused/UnusedStubClass.class
> META-INF/maven/
> META-INF/maven/org.spark-project.spark/
> META-INF/maven/org.spark-project.spark/unused/
> META-INF/maven/org.spark-project.spark/unused/pom.xml
> META-INF/maven/org.spark-project.spark/unused/pom.properties
> META-INF/NOTICE
>
> Any advice on how to get spark-core and the rest packaged into the assembly
> jar -- I'd like to have fewer things to copy around.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: unable to make a custom class as a key in a pairrdd

2014-10-24 Thread Jaonary Rabarisoa

In the documentation it's said that we need to override the hashCode and
equals methods. Without overriding it does't work too. I get this error on
REPL and stand alone application

On Fri, Oct 24, 2014 at 3:29 AM, Prashant Sharma 
wrote:

> Are you doing this in REPL ? Then there is a bug filed for this, I just
> can't recall the bug ID at the moment.
>
> Prashant Sharma
>
>
>
> On Fri, Oct 24, 2014 at 4:07 AM, Niklas Wilcke <
> 1wil...@informatik.uni-hamburg.de> wrote:
>
>>  Hi Jao,
>>
>> I don't really know why this doesn't work but I have two hints.
>> You don't need to override hashCode and equals. The modifier case is
>> doing that for you. Writing
>>
>> case class PersonID(id: String)
>>
>> would be enough to get the class you want I think.
>> If I change the type of the id param to Int it works for me but I don't
>> know why.
>>
>> case class PersonID(id: Int)
>>
>> Looks like a strange behavior to me. Have a try.
>>
>> Good luck,
>> Niklas
>>
>>
>> On 23.10.2014 21:52, Jaonary Rabarisoa wrote:
>>
>>  Hi all,
>>
>>  I have the following case class that I want to use as a key in a
>> key-value rdd. I defined the equals and hashCode methode but it's not
>> working. What I'm doing wrong ?
>>
>>  *case class PersonID(id: String) {*
>>
>> * override def hashCode = id.hashCode*
>>
>> * override def equals(other: Any) = other match {*
>>
>> * case that: PersonID => this.id  == that.id
>>  && this.getClass == that.getClass*
>> * case _ => false*
>> * }   *
>> * }   *
>>
>>
>> * val p = sc.parallelize((1 until 10).map(x => (PersonID("1"),x )))*
>>
>>
>>  *p.groupByKey.collect foreach println*
>>
>>  *(PersonID(1),CompactBuffer(5))*
>> *(PersonID(1),CompactBuffer(6))*
>> *(PersonID(1),CompactBuffer(7))*
>> *(PersonID(1),CompactBuffer(8, 9))*
>> *(PersonID(1),CompactBuffer(1))*
>> *(PersonID(1),CompactBuffer(2))*
>> *(PersonID(1),CompactBuffer(3))*
>> *(PersonID(1),CompactBuffer(4))*
>>
>>
>>  Best,
>>
>>  Jao
>>
>>
>>
>

Is SparkSQL + JDBC server a good approach for caching?

2014-10-24 Thread ankits

I want to set up spark SQL to allow ad hoc querying over the last X days of
processed data, where the data is processed through spark. This would also
have to cache data (in memory only), so the approach I was thinking of was
to build a layer that persists the appropriate RDDs and stores them in
memory.

I see spark sql allows ad hoc querying through JDBC though I have never used
that before. Will using JDBC offer any advantages (e.g does it have built in
support for caching?) over rolling my own solution for this use case?

Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Is-SparkSQL-JDBC-server-a-good-approach-for-caching-tp17196.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark MLLIB Decision Tree - ArrayIndexOutOfBounds Exception

2014-10-24 Thread lokeshkumar

Hi Joseph,

Thanks for the help.

I have tried this DecisionTree example with the latest spark code and it is
working fine now. But how do we choose the maxBins for this model?

Thanks
Lokesh



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-MLLIB-Decision-Tree-ArrayIndexOutOfBounds-Exception-tp16907p17195.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: How can I set the IP a worker use?

Try using the --ip parameter

while starting the worker. like:

spark-1.0.1/bin/spark-class org.apache.spark.deploy.worker.Worker --ip
1.2.3.4 spark://1.2.3.4:7077

Thanks
Best Regards

On Fri, Oct 24, 2014 at 12:34 PM, Theodore Si  wrote:

>  Hi all,
>
> I have two network interface card on one node, one is a Eithernet card,
> the other Infiniband HCA.
> The master has two IP addresses, lets say 1.2.3.4 (for Eithernet card) and
> 2.3.4.5 (for HCA).
> I can start the master by
> export SPARK_MASTER_IP='1.2.3.4';sbin/start-master.sh
> to let master listen on 1.2.3.4:7077
>
> But when I connect the worker to the master by using
> spark-1.0.1/bin/spark-class org.apache.spark.deploy.worker.Worker spark://
> 1.2.3.4:7077
>
> I will get errors, since it is using its HCA card. How can I let the
> worker use its Eithernet card?
>
> Thanks
>

Re: spark is running extremely slow with larger data set, like 2G

2014-10-24 Thread Davies Liu

On Thu, Oct 23, 2014 at 3:14 PM, xuhongnever  wrote:
> my code is here:
>
> from pyspark import SparkConf, SparkContext
>
> def Undirect(edge):
> vector = edge.strip().split('\t')
> if(vector[0].isdigit()):
> return [(vector[0], vector[1])]
> return []
>
>
> conf = SparkConf()
> conf.setMaster("spark://compute-0-14:7077")
> conf.setAppName("adjacencylist")
> conf.set("spark.executor.memory", "1g")

Use more memory to gain better performance, or spark will keep
spilling the data into disks, that is much slower.
You also could give more memory to Python worker by set
spark.python.worker.memory=1g or 2g

> sc = SparkContext(conf = conf)
>
> file = sc.textFile("file:///home/xzhang/data/soc-LiveJournal1.txt")
> records = file.flatMap(lambda line: Undirect(line)).reduceByKey(lambda a, b:
> a + "\t" + b )

a + "\t" + b will be very slow, if the number of values is large,
groupByKey() will be better than it.

> #print(records.count())
> #records = records.sortByKey()
> records = records.map(lambda line: line[0] + "\t" + line[1])
> records.saveAsTextFile("file:///home/xzhang/data/result")
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/spark-is-running-extremely-slow-with-larger-data-set-like-2G-tp17152p17153.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Memory requirement of using Spark

You can use spark-sql to solve this usecase, and you don't need to have
800G of memory (but of course if you are caching the whole data into
memory, then you would need it.). You can persist the data by setting
DISK_AND_MEMORY_SER property if you don't want to bring whole data into
memory, in this case most of the data would reside on the disk and spark
will utilize it efficiently.

Thanks
Best Regards

On Fri, Oct 24, 2014 at 8:47 AM, jian.t  wrote:

> Hello,
> I am new to Spark. I have a basic question about the memory requirement of
> using Spark.
>
> I need to join multiple data sources between multiple data sets. The join
> is
> not a straightforward join. The logic is more like: first join T1 on column
> A with T2, then for all the records that couldn't find the match in the
> Join, join T1 on column B with T2, and then join on C and son on. I was
> using HIVE, but it requires multiple scans on T1, which turns out slow.
>
> It seems like if I load T1 and T2 in memory using Spark, I could improve
> the
> performance. However, T1 and T2 totally is around 800G. Does that mean I
> need to have 800G memory (I don't have that amount of memory)? Or Spark
> could do something like streaming but then again will the performance
> sacrifice as a result?
>
>
>
> Thanks
> JT
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Memory-requirement-of-using-Spark-tp17177.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: spark is running extremely slow with larger data set, like 2G

Try providing the level of parallelism parameter to your reduceByKey
operation.

Thanks
Best Regards

On Fri, Oct 24, 2014 at 3:44 AM, xuhongnever  wrote:

> my code is here:
>
> from pyspark import SparkConf, SparkContext
>
> def Undirect(edge):
> vector = edge.strip().split('\t')
> if(vector[0].isdigit()):
> return [(vector[0], vector[1])]
> return []
>
>
> conf = SparkConf()
> conf.setMaster("spark://compute-0-14:7077")
> conf.setAppName("adjacencylist")
> conf.set("spark.executor.memory", "1g")
>
> sc = SparkContext(conf = conf)
>
> file = sc.textFile("file:///home/xzhang/data/soc-LiveJournal1.txt")
> records = file.flatMap(lambda line: Undirect(line)).reduceByKey(lambda a,
> b:
> a + "\t" + b )
> #print(records.count())
> #records = records.sortByKey()
> records = records.map(lambda line: line[0] + "\t" + line[1])
> records.saveAsTextFile("file:///home/xzhang/data/result")
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/spark-is-running-extremely-slow-with-larger-data-set-like-2G-tp17152p17153.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: NoClassDefFoundError on ThreadFactoryBuilder in Intellij

Make sure the guava jar
 is present
in the classpath.

Thanks
Best Regards

On Thu, Oct 23, 2014 at 2:13 PM, Stephen Boesch  wrote:

> After having checked out from master/head the following error occurs when
> attempting to run any test in Intellij
>
> Exception in thread "main" java.lang.NoClassDefFoundError:
> com/google/common/util/concurrent/ThreadFactoryBuilder
> at org.apache.spark.util.Utils$.(Utils.scala:648)
>
>
> There appears to be a related issue/JIRA:
>
>
> https://issues.apache.org/jira/browse/SPARK-3217
>
>
> But the conditions described do not apply in my case:
>
>  Did you by any chance do one of the following:
>
>- forget to "clean" after pulling that change
>- mix sbt and mvn built artifacts in the same build
>- set SPARK_PREPEND_CLASSES
>
>
> For reference here is the full stacktrace:
>
> Exception in thread "main" java.lang.NoClassDefFoundError:
> com/google/common/util/concurrent/ThreadFactoryBuilder
> at org.apache.spark.util.Utils$.(Utils.scala:648)
> at org.apache.spark.util.Utils$.(Utils.scala)
> at org.apache.spark.SparkContext.(SparkContext.scala:179)
> at org.apache.spark.SparkContext.(SparkContext.scala:119)
> at org.apache.spark.SparkContext.(SparkContext.scala:134)
> at
> org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:62)
> at
> org.apache.spark.sql.hbase.JavaHBaseSQLContext$.main(JavaHBaseSQLContext.scala:45)
> at
> org.apache.spark.sql.hbase.JavaHBaseSQLContext.main(JavaHBaseSQLContext.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)
> Caused by: java.lang.ClassNotFoundException:
> com.google.common.util.concurrent.ThreadFactoryBuilder
> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
> ... 13 more
> Exception in thread "delete Spark temp dirs"
> java.lang.NoClassDefFoundError: Could not initialize class
> org.apache.spark.util.Utils$
> at org.apache.spark.util.Utils$$anon$4.run(Utils.scala:173)
>
>
>

How can I set the IP a worker use?

2014-10-24 Thread Theodore Si


Hi all,

I have two network interface card on one node, one is a Eithernet card, 
the other Infiniband HCA.
The master has two IP addresses, lets say 1.2.3.4 (for Eithernet card) 
and 2.3.4.5 (for HCA).

I can start the master by
export SPARK_MASTER_IP='1.2.3.4';sbin/start-master.sh
to let master listen on 1.2.3.4:7077

But when I connect the worker to the master by using
spark-1.0.1/bin/spark-class org.apache.spark.deploy.worker.Worker 
spark://1.2.3.4:7077


I will get errors, since it is using its HCA card. How can I let the 
worker use its Eithernet card?


Thanks

Re: how to run a dev spark project without fully rebuilding the fat jar ?

You can use the --jars option to submit multiple jars using the
spark-submit, so you can simply build the jar that you have modified.

Thanks
Best Regards

On Thu, Oct 23, 2014 at 11:16 AM, Mohit Jaggi  wrote:

> i think you can give a list of jars - not just one - to spark-submit, so
> build only the one that has changed source code.
>
> On Wed, Oct 22, 2014 at 10:29 PM, Yang  wrote:
>
>> during tests, I often modify my code a little bit  and want to see the
>> result.
>> but spark-submit requires the full fat-jar, which takes quite a lot of
>> time to build.
>>
>> I just need to run in --master local mode. is there a way to run it
>> without rebuilding the fat jar?
>>
>> thanks
>> Yang
>>
>
>

Re: hive timestamp column always returns null

Try doing a *cat -v your_data | head -n3 *and make sure you are not having
any ^M at the end of the lines. Also your 2,3 rows doens't contain any
space in the data.

Thanks
Best Regards

On Thu, Oct 23, 2014 at 9:23 AM, tridib  wrote:

> Hello Experts,
> I created a table using spark-sql CLI. No Hive is installed. I am using
> spark 1.1.0.
>
> create table date_test(my_date timestamp)
> row format delimited
> fields terminated by ' '
> lines terminated by '\n'
> LOCATION '/user/hive/date_test';
>
> The data file has following data:
> 2014-12-11 00:00:00
> 2013-11-11T00:00:00
> 2012-11-11T00:00:00Z
>
> when I query using "select * from date_test" it returns:
> NULL
> NULL
> NULL
>
> Could you please help me to resolve this issue?
>
> Thanks
> Tridib
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/hive-timestamp-column-always-returns-null-tp17079.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Spark: Order by Failed, java.lang.NullPointerException