question about spark streaming

2015-08-10 Thread sequoiadb
hi guys,

i have a question about spark streaming.
There’s an application keep sending transaction records into spark stream with 
about 50k tps
The record represents a sales information including customer id / product id / 
time / price columns

The application is required to monitor the change of price for each product. 
For example, if the price of a product increases 10% within 3 minutes, it will 
send an alert to end user.

The interval is required to be set every 1 second, window is somewhere between 
180 to 300 seconds.

The issue is that I have to compare the price of each transaction ( totally 
about 10k different products ) against the lowest/highest price for the same 
product in the all past 180 seconds. 

That means, in every single second, I have to loop through 50k transactions and 
compare the price of the same product in all 180 seconds. So it seems I have to 
separate the calculation based on product id, so that each worker only 
processes a certain list of products.

For example, if I can make sure the same product id always go to the same 
worker agent, it doesn’t need to shuffle data between worker agent for each 
comparison. Otherwise if it required to compare each transaction with all other 
RDDs that cross multiple worker agent, I guess it may not be fast enough for 
the requirement.

Is there anyone knows how to specify the worker node for each transaction 
record based on its product id, in order to avoid massive shuffle operation?

If simply making the product id as the key and price as the value, 
reduceByKeyAndWindow may cause massive shuffle and slow down the whole 
throughput. Am I correct?

Thanks

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



question about sparksql caching

2015-05-14 Thread sequoiadb
Hi all,

We are planing to use SparkSQL in a DW system. There’s a question about the 
caching mechanism of SparkSQL.

For example, if I have a SQL like sqlContext.sql(“select c1, sum(c2) from T1, 
T2 where T1.key=T2.key group by c1”).cache()

Is it going to cache the final result or the raw data of each table that used 
in the SQL?

Since the user may have various of SQLs that use those tables, if the caching 
is for the final result only, it may still take very long time to scan the 
entire table if it’s a brand new SQL.

If this is the case, is there any other better way to cache the base tables 
instead of final result?

Thanks

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



question about sparksql caching

2015-05-14 Thread sequoiadb
Hi all,

We are planing to use SparkSQL in a DW system. There’s a question about the 
caching mechanism of SparkSQL.

For example, if I have a SQL like sqlContext.sql(“select c1, sum(c2) from T1, 
T2 where T1.key=T2.key group by c1”).cache()

Is it going to cache the final result or the raw data of each table that used 
in the SQL?

Since the user may have various of SQLs that use those tables, if the caching 
is for the final result only, it may still take very long time to scan the 
entire table if it’s a brand new SQL.

If this is the case, is there any other better way to cache the base tables 
instead of final result?

Thanks

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



what is the best way to transfer data from RDBMS to spark?

2015-04-24 Thread sequoiadb
If I run spark in stand-alone mode ( not YARN mode ), is there any tool like 
Sqoop that able to transfer data from RDBMS to spark storage?

Thanks

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



how to specify multiple masters in sbin/start-slaves.sh script?

2015-03-19 Thread sequoiadb
Hey guys,

Not sure if i’m the only one got this. We are building high-available 
standalone spark env. We are using ZK with 3 masters in the cluster.
However, in sbin/start-slaves.sh, it calls start-slave.sh for each member in 
conf/slaves file, and specify master using $SPARK_MASTER_IP and 
$SPARK_MASTER_PORT
exec $sbin/slaves.sh cd $SPARK_HOME \; $sbin/start-slave.sh 1 
spark://$SPARK_MASTER_IP:$SPARK_MASTER_PORT

But if I want to specify more than one master node, I have to use the format
spark://host1:port1,host2:port2,host3:port3 
spark://host1:port1,host2:port2,host3:port3

In this case, it seems the original sbin/start-slaves.sh can’t do the trick.
Does everyone need to modify the script in order to build a HA cluster, or is 
there something I missed?

Thanks

sparksql native jdbc driver

2015-03-18 Thread sequoiadb
hey guys,

In my understanding SparkSQL only supports JDBC connection through hive thrift 
server, is this correct?

Thanks

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



building all modules in spark by mvn

2015-03-13 Thread sequoiadb
guys, is there any easier way to build all modules by mvn ?
right now if I run “mvn package” in spark root directory I got:
[INFO] Reactor Summary:
[INFO] 
[INFO] Spark Project Parent POM ... SUCCESS [  8.327 s]
[INFO] Spark Project Networking ... SKIPPED
[INFO] Spark Project Shuffle Streaming Service  SKIPPED
[INFO] Spark Project Core . SKIPPED
[INFO] Spark Project Bagel  SKIPPED
[INFO] Spark Project GraphX ... SKIPPED
[INFO] Spark Project Streaming  SKIPPED
[INFO] Spark Project Catalyst . SKIPPED
[INFO] Spark Project SQL .. SKIPPED
[INFO] Spark Project ML Library ... SKIPPED
…

Apprently only Parent project is built and all other children projects are 
skipped.
I can get sparksql/stream projects built by sbt/sbt, but if I’d like to use mvn 
and do not want to build each dependent module separately, is there any good 
way to do it?

Thanks

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Unable to stop Worker in standalone mode by sbin/stop-all.sh

2015-03-12 Thread sequoiadb
Checking the script, it seems spark-daemon.sh unable to stop the worker
$ ./spark-daemon.sh stop org.apache.spark.deploy.worker.Worker 1
no org.apache.spark.deploy.worker.Worker to stop
$ ps -elf | grep spark
0 S taoewang 24922 1  0  80   0 - 733878 futex_ Mar12 ?   00:08:54 java 
-cp 
/data/sequoiadb-driver-1.10.jar,/data/spark-sequoiadb-0.0.1-SNAPSHOT.jar::/data/spark/conf:/data/spark/assembly/target/scala-2.10/spark-assembly-1.3.0-SNAPSHOT-hadoop2.4.0.jar
 -XX:MaxPermSize=128m -Dspark.deploy.recoveryMode=ZOOKEEPER 
-Dspark.deploy.zookeeper.url=centos-151:2181,centos-152:2181,centos-153:2181 
-Dspark.deploy.zookeeper.dir=/data/zookeeper 
-Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m 
org.apache.spark.deploy.worker.Worker 
spark://centos-151:7077,centos-152:7077,centos-153:7077

In spark-daemon script it tries to find $pid in /tmp/:
pid=$SPARK_PID_DIR/spark-$SPARK_IDENT_STRING-$command-$instance.pid”

In my case pid supposed to be: 
/tmp/spark-taoewang-org.apache.spark.deploy.worker.Worker-1.pid

However when I go through the files in /tmp directory I don’t find such file 
exist.
I got 777 on /tmp and also tried to touch a file with my current account and 
success, so it shouldn’t be permission issue.
$ ls -la / | grep tmp
drwxrwxrwx.   6 root root  4096 Mar 13 08:19 tmp

Anyone has any idea why the pid file didn’t show up?

Thanks
TW

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Unable to stop Worker in standalone mode by sbin/stop-all.sh

2015-03-12 Thread sequoiadb
Nope, I can see the master file exist but not the worker:
$ ls
bitrock_installer.log
hsperfdata_root
hsperfdata_taoewang
omatmp
sbt2435921113715137753.log
spark-taoewang-org.apache.spark.deploy.master.Master-1.pid

 在 2015年3月13日,上午9:34,Ted Yu yuzhih...@gmail.com 写道:
 
 Does the machine have cron job that periodically cleans up /tmp dir ?
 
 Cheers
 
 On Thu, Mar 12, 2015 at 6:18 PM, sequoiadb mailing-list-r...@sequoiadb.com 
 mailto:mailing-list-r...@sequoiadb.com wrote:
 Checking the script, it seems spark-daemon.sh unable to stop the worker
 $ ./spark-daemon.sh stop org.apache.spark.deploy.worker.Worker 1
 no org.apache.spark.deploy.worker.Worker to stop
 $ ps -elf | grep spark
 0 S taoewang 24922 1  0  80   0 - 733878 futex_ Mar12 ?   00:08:54 
 java -cp 
 /data/sequoiadb-driver-1.10.jar,/data/spark-sequoiadb-0.0.1-SNAPSHOT.jar::/data/spark/conf:/data/spark/assembly/target/scala-2.10/spark-assembly-1.3.0-SNAPSHOT-hadoop2.4.0.jar
  -XX:MaxPermSize=128m -Dspark.deploy.recoveryMode=ZOOKEEPER 
 -Dspark.deploy.zookeeper.url=centos-151:2181,centos-152:2181,centos-153:2181 
 -Dspark.deploy.zookeeper.dir=/data/zookeeper 
 -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m 
 org.apache.spark.deploy.worker.Worker 
 spark://centos-151:7077,centos-152:7077,centos-153:7077
 
 In spark-daemon script it tries to find $pid in /tmp/:
 pid=$SPARK_PID_DIR/spark-$SPARK_IDENT_STRING-$command-$instance.pid”
 
 In my case pid supposed to be: 
 /tmp/spark-taoewang-org.apache.spark.deploy.worker.Worker-1.pid
 
 However when I go through the files in /tmp directory I don’t find such file 
 exist.
 I got 777 on /tmp and also tried to touch a file with my current account and 
 success, so it shouldn’t be permission issue.
 $ ls -la / | grep tmp
 drwxrwxrwx.   6 root root  4096 Mar 13 08:19 tmp
 
 Anyone has any idea why the pid file didn’t show up?
 
 Thanks
 TW
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
 mailto:user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org 
 mailto:user-h...@spark.apache.org